retrieve website html source code (without javascript rendering)



  • Hi All,
    Using webkit loading a website, how can I get the original html source code (before javascript get rendered)? Just like what you will get when you right click mouse then choose "View Page Source" in chrome/firefox/ie?
    QWebFrame->toHtml() gives out html with javascript rendered.


  • Moderators

    @datasunny
    What do you mean with "JavaScript rendering"???
    You can use QNetworkAccessManager::get() to download the initial source.



  • Saving resources of loaded page is a planned feature [1].

    This feature is not hard to implement, and is delayed just because of lacking time. If you are interested you can implement it yourself, it's a good entry poiint for first time contributor, and I'll help you to get started!

    As for QNAM::get() advice, if you need only original HTML and you handle HTTP redirects correctly, it can work if there are no JavaScript redirects in loaded page.

    [1] https://github.com/annulen/webkit/issues/105



  • Could you shed some light on where to start? Thanks a bunch!

    @Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

    Saving resources of loaded page is a planned feature [1].

    This feature is not hard to implement, and is delayed just because of lacking time. If you are interested you can implement it yourself, it's a good entry poiint for first time contributor, and I'll help you to get started!

    As for QNAM::get() advice, if you need only original HTML and you handle HTTP redirects correctly, it can work if there are no JavaScript redirects in loaded page.

    [1] https://github.com/annulen/webkit/issues/105



  • @datasunny My original plan with using PageSerializer similarly to what MHTMLArchive::generateMHTMLData() is doing did not work out for your case (*), as it saves modified HTML code in the manner similar to toHtml().

    However, I tried different idea and turns out to work. Here is a patch (API is not written in stone, but you can see the idea and start playing with it):

    https://github.com/annulen/webkit/commit/baea600a065241d31dc56da304c10d1d3445d223

    (*) That approach allows to save page with all its resources like CSS and images, optionally filtering them by their MIME types



  • You rock!

    @Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

    @datasunny My original plan with using PageSerializer similarly to what MHTMLArchive::generateMHTMLData() is doing did not work out for your case (*), as it saves modified HTML code in the manner similar to toHtml().

    However, I tried different idea and turns out to work. Here is a patch (API is not written in stone, but you can see the idea and start playing with it):

    https://github.com/annulen/webkit/commit/baea600a065241d31dc56da304c10d1d3445d223

    (*) That approach allows to save page with all its resources like CSS and images, optionally filtering them by their MIME types



  • Got a few errors when compiling, I made the change on top of qt 5.5:

    /WebCoreSupport/QWebFrameAdapter.cpp
    qt/WebCoreSupport/QWebFrameAdapter.cpp: In member function ‘QByteArray QWebFrameAdapter::mainResourceData() const’:
    qt/WebCoreSupport/QWebFrameAdapter.cpp:267:44: error: request for member ‘activeDocumentLoader’ in ‘((WebCore::Frame*)((const QWebFrameAdapter*)this)->QWebFrameAdapter::frame)->WebCore::Frame::loader()’, which is of pointer type ‘WebCore::FrameLoader*’ (maybe you meant to use ‘->’ ?)
    auto* documentLoader = frame->loader().activeDocumentLoader();

    So I changed to:
    auto* documentLoader = frame->loader()->activeDocumentLoader();

    Then I got:
    /WebCoreSupport/QWebFrameAdapter.cpp
    In file included from ../WTF/wtf/VectorTraits.h:26:0,
    from ../WTF/wtf/Vector.h:31,
    from ../WTF/wtf/text/StringImpl.h:31,
    from ../WTF/wtf/text/WTFString.h:29,
    from ../WebCore/loader/FormState.h:33,
    from qt/WebCoreSupport/FrameLoaderClientQt.h:33,
    from qt/WebCoreSupport/QWebFrameAdapter.h:23,
    from qt/WebCoreSupport/QWebFrameAdapter.cpp:22:
    ../WTF/wtf/RefPtr.h: In instantiation of ‘WTF::RefPtr<T>::RefPtr(const WTF::PassRefPtr<U>&) [with U = WebCore::ResourceBuffer; T = WebCore::SharedBuffer]’:
    qt/WebCoreSupport/QWebFrameAdapter.cpp:269:68: required from here
    ../WTF/wtf/RefPtr.h:99:28: error: cannot convert ‘WebCore::ResourceBuffer*’ to ‘WebCore::SharedBuffer*’ in initialization
    : m_ptr(o.leakRef())

    I then made the following changes:
    RefPtr<ResourceBuffer> buffer = documentLoader->mainResourceData();

    After that it still reports errors:

    /WebCoreSupport/QWebFrameAdapter.cpp
    qt/WebCoreSupport/QWebFrameAdapter.cpp: In member function ‘QByteArray QWebFrameAdapter::mainResourceData() const’:
    qt/WebCoreSupport/QWebFrameAdapter.cpp:273:29: error: invalid use of incomplete type ‘class WebCore::ResourceBuffer’
    return QByteArray(buffer->data(), buffer->size());
    ^
    In file included from qt/WebCoreSupport/QWebFrameAdapter.cpp:27:0:
    ../WebCore/loader/DocumentLoader.h:72:11: note: forward declaration of ‘class WebCore::ResourceBuffer’
    class ResourceBuffer;
    ^
    qt/WebCoreSupport/QWebFrameAdapter.cpp:273:45: error: invalid use of incomplete type ‘class WebCore::ResourceBuffer’
    return QByteArray(buffer->data(), buffer->size());
    ^
    In file included from qt/WebCoreSupport/QWebFrameAdapter.cpp:27:0:
    ../WebCore/loader/DocumentLoader.h:72:11: note: forward declaration of ‘class WebCore::ResourceBuffer’
    class ResourceBuffer;

    Sorry for the newbie question.

    @datasunny said in retrieve website html source code (without javascript rendering):

    You rock!

    @Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

    @datasunny My original plan with using PageSerializer similarly to what MHTMLArchive::generateMHTMLData() is doing did not work out for your case (*), as it saves modified HTML code in the manner similar to toHtml().

    However, I tried different idea and turns out to work. Here is a patch (API is not written in stone, but you can see the idea and start playing with it):

    https://github.com/annulen/webkit/commit/baea600a065241d31dc56da304c10d1d3445d223

    (*) That approach allows to save page with all its resources like CSS and images, optionally filtering them by their MIME types



  • This patch is for revived QtWebKit, not legacy version. The easiest thing that you can do is to pull that commit from github and follow build instructions in wiki. Feel free to join #qtwebkit of freenode to get more operative help



  • @Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

    This patch is for revived QtWebKit, not legacy version. The easiest thing that you can do is to pull that commit from github and follow build instructions in wiki. Feel free to join #qtwebkit of freenode to get more operative help

    One more question is after I convert the data to QString, non ascii char (ex. 'ª'/'š'/ will be shown as '�'. I guess it's some kind of encoding issue?
    Non of below seems work:
    return QString::fromUtf8(QByteArray(buffer->data(), buffer->size()));
    return QString::fromUtf8(buffer->data());

    Appreciate your insight, thanks!



  • @datasunny You are right, buffer may have different encoding.

    My initial thought of this API was to return QByteArray to avoid useless encoding conversion for those who just needs e.g. to save it into file. Now I think we should better have easy API returning QString, and advanced API returning object with QIODevice and properties like encoding and MIME type.


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.