QNetworkAccessManager: download page resources



  • Hi, Qt Project.

    I had a task to download the whole webpage and store it on the computer. So, I need to download page itself and all its resources (css, img, js).

    @
    image.png
    style.css
    common.js
    @

    Questions:

    How can I download the whole webpage with all its resources

    What's the best way to impl some kind of cache-manager (my task), which will save page and its resources to be run again in future.

    Thanks!



  • Check the similar "thread with ideas how to download a whole web site":http://qt-project.org/forums/viewthread/20957 using QNetworkAccessManager with the assistance of QUrl.



  • You have to find pieces of html code starting with href=" or src='. I used to cut the code by these(in both cases can be " or ' behind = so I split it by href= and src= and removed 1st char), cut off unneeded part of code after next ' or " (as http://example.com/style.css">some other code...) char in code and select addresses ending with regexp I actually need, such as .css,.png etc.
    Some links can look like "/images/blahblah.png" so you need to select them (easily url.toString.startsWith("/")) and add the url you downloaded it from (for example "http://example.com"+"/images/blahblah.png").
    And don't forget create folder which you should save it to, for example for blahblah.png it is /images in directory you are saving it all to.
    Hope this helps :)



  • I'd not try to parse the HTML myself. What if some javascript is used to load additional content?

    Instead, I'd just use QWebPage with a custom QNetworkAccessManager that simply saves all resources downloaded.


  • Moderators

    To download a website with all the resources use QNAM::setCache to set a QNetworkDiskCache with your preffered directory to store data.

    After you download the page, you can do this on the QNetworkRequest to make an "offline" request:
    @
    QNetworkRequest rq(QUrl("http://whatever.url"));
    rq.setAttribute(QNetworkRequest::CacheLoadControlAttribute, QNetworkRequest::AlwaysCache);
    @



  • I have the same question. I want to download all the resources of a webpage (css, js, image) by loading the page in QWebPage.

    The problem that I have is that read() in QNetworkReply is sequential and after QWebPage uses read() for its own rendering, my program gets nothing to read (and then to save to a file).

    I have seen a few posts suggesting that we should use a custom QNetworkAccessManager and a custom QNetworkReply, but I'm new to Qt and don't know exactly how to do this. I would appreciate it if you can give a little bit more information about this. If you have any sample code for this, that would be great too.


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.