Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. Qt WebKit
  4. retrieve website html source code (without javascript rendering)
Forum Updated to NodeBB v4.3 + New Features

retrieve website html source code (without javascript rendering)

Scheduled Pinned Locked Moved Qt WebKit
10 Posts 3 Posters 3.4k Views 3 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • D Offline
    D Offline
    datasunny
    wrote on 22 Nov 2016, 22:42 last edited by
    #1

    Hi All,
    Using webkit loading a website, how can I get the original html source code (before javascript get rendered)? Just like what you will get when you right click mouse then choose "View Page Source" in chrome/firefox/ie?
    QWebFrame->toHtml() gives out html with javascript rendered.

    R 1 Reply Last reply 23 Nov 2016, 07:05
    0
    • D datasunny
      22 Nov 2016, 22:42

      Hi All,
      Using webkit loading a website, how can I get the original html source code (before javascript get rendered)? Just like what you will get when you right click mouse then choose "View Page Source" in chrome/firefox/ie?
      QWebFrame->toHtml() gives out html with javascript rendered.

      R Offline
      R Offline
      raven-worx
      Moderators
      wrote on 23 Nov 2016, 07:05 last edited by
      #2

      @datasunny
      What do you mean with "JavaScript rendering"???
      You can use QNetworkAccessManager::get() to download the initial source.

      --- SUPPORT REQUESTS VIA CHAT WILL BE IGNORED ---
      If you have a question please use the forum so others can benefit from the solution in the future

      1 Reply Last reply
      1
      • K Offline
        K Offline
        Konstantin Tokarev
        wrote on 23 Nov 2016, 09:57 last edited by
        #3

        Saving resources of loaded page is a planned feature [1].

        This feature is not hard to implement, and is delayed just because of lacking time. If you are interested you can implement it yourself, it's a good entry poiint for first time contributor, and I'll help you to get started!

        As for QNAM::get() advice, if you need only original HTML and you handle HTTP redirects correctly, it can work if there are no JavaScript redirects in loaded page.

        [1] https://github.com/annulen/webkit/issues/105

        D 1 Reply Last reply 23 Nov 2016, 17:36
        1
        • K Konstantin Tokarev
          23 Nov 2016, 09:57

          Saving resources of loaded page is a planned feature [1].

          This feature is not hard to implement, and is delayed just because of lacking time. If you are interested you can implement it yourself, it's a good entry poiint for first time contributor, and I'll help you to get started!

          As for QNAM::get() advice, if you need only original HTML and you handle HTTP redirects correctly, it can work if there are no JavaScript redirects in loaded page.

          [1] https://github.com/annulen/webkit/issues/105

          D Offline
          D Offline
          datasunny
          wrote on 23 Nov 2016, 17:36 last edited by
          #4

          Could you shed some light on where to start? Thanks a bunch!

          @Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

          Saving resources of loaded page is a planned feature [1].

          This feature is not hard to implement, and is delayed just because of lacking time. If you are interested you can implement it yourself, it's a good entry poiint for first time contributor, and I'll help you to get started!

          As for QNAM::get() advice, if you need only original HTML and you handle HTTP redirects correctly, it can work if there are no JavaScript redirects in loaded page.

          [1] https://github.com/annulen/webkit/issues/105

          K 1 Reply Last reply 23 Nov 2016, 19:32
          0
          • D datasunny
            23 Nov 2016, 17:36

            Could you shed some light on where to start? Thanks a bunch!

            @Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

            Saving resources of loaded page is a planned feature [1].

            This feature is not hard to implement, and is delayed just because of lacking time. If you are interested you can implement it yourself, it's a good entry poiint for first time contributor, and I'll help you to get started!

            As for QNAM::get() advice, if you need only original HTML and you handle HTTP redirects correctly, it can work if there are no JavaScript redirects in loaded page.

            [1] https://github.com/annulen/webkit/issues/105

            K Offline
            K Offline
            Konstantin Tokarev
            wrote on 23 Nov 2016, 19:32 last edited by
            #5

            @datasunny My original plan with using PageSerializer similarly to what MHTMLArchive::generateMHTMLData() is doing did not work out for your case (*), as it saves modified HTML code in the manner similar to toHtml().

            However, I tried different idea and turns out to work. Here is a patch (API is not written in stone, but you can see the idea and start playing with it):

            https://github.com/annulen/webkit/commit/baea600a065241d31dc56da304c10d1d3445d223

            (*) That approach allows to save page with all its resources like CSS and images, optionally filtering them by their MIME types

            D 1 Reply Last reply 23 Nov 2016, 21:02
            0
            • K Konstantin Tokarev
              23 Nov 2016, 19:32

              @datasunny My original plan with using PageSerializer similarly to what MHTMLArchive::generateMHTMLData() is doing did not work out for your case (*), as it saves modified HTML code in the manner similar to toHtml().

              However, I tried different idea and turns out to work. Here is a patch (API is not written in stone, but you can see the idea and start playing with it):

              https://github.com/annulen/webkit/commit/baea600a065241d31dc56da304c10d1d3445d223

              (*) That approach allows to save page with all its resources like CSS and images, optionally filtering them by their MIME types

              D Offline
              D Offline
              datasunny
              wrote on 23 Nov 2016, 21:02 last edited by
              #6

              You rock!

              @Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

              @datasunny My original plan with using PageSerializer similarly to what MHTMLArchive::generateMHTMLData() is doing did not work out for your case (*), as it saves modified HTML code in the manner similar to toHtml().

              However, I tried different idea and turns out to work. Here is a patch (API is not written in stone, but you can see the idea and start playing with it):

              https://github.com/annulen/webkit/commit/baea600a065241d31dc56da304c10d1d3445d223

              (*) That approach allows to save page with all its resources like CSS and images, optionally filtering them by their MIME types

              D 1 Reply Last reply 24 Nov 2016, 00:08
              0
              • D datasunny
                23 Nov 2016, 21:02

                You rock!

                @Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

                @datasunny My original plan with using PageSerializer similarly to what MHTMLArchive::generateMHTMLData() is doing did not work out for your case (*), as it saves modified HTML code in the manner similar to toHtml().

                However, I tried different idea and turns out to work. Here is a patch (API is not written in stone, but you can see the idea and start playing with it):

                https://github.com/annulen/webkit/commit/baea600a065241d31dc56da304c10d1d3445d223

                (*) That approach allows to save page with all its resources like CSS and images, optionally filtering them by their MIME types

                D Offline
                D Offline
                datasunny
                wrote on 24 Nov 2016, 00:08 last edited by datasunny
                #7

                Got a few errors when compiling, I made the change on top of qt 5.5:

                /WebCoreSupport/QWebFrameAdapter.cpp
                qt/WebCoreSupport/QWebFrameAdapter.cpp: In member function ‘QByteArray QWebFrameAdapter::mainResourceData() const’:
                qt/WebCoreSupport/QWebFrameAdapter.cpp:267:44: error: request for member ‘activeDocumentLoader’ in ‘((WebCore::Frame*)((const QWebFrameAdapter*)this)->QWebFrameAdapter::frame)->WebCore::Frame::loader()’, which is of pointer type ‘WebCore::FrameLoader*’ (maybe you meant to use ‘->’ ?)
                auto* documentLoader = frame->loader().activeDocumentLoader();

                So I changed to:
                auto* documentLoader = frame->loader()->activeDocumentLoader();

                Then I got:
                /WebCoreSupport/QWebFrameAdapter.cpp
                In file included from ../WTF/wtf/VectorTraits.h:26:0,
                from ../WTF/wtf/Vector.h:31,
                from ../WTF/wtf/text/StringImpl.h:31,
                from ../WTF/wtf/text/WTFString.h:29,
                from ../WebCore/loader/FormState.h:33,
                from qt/WebCoreSupport/FrameLoaderClientQt.h:33,
                from qt/WebCoreSupport/QWebFrameAdapter.h:23,
                from qt/WebCoreSupport/QWebFrameAdapter.cpp:22:
                ../WTF/wtf/RefPtr.h: In instantiation of ‘WTF::RefPtr<T>::RefPtr(const WTF::PassRefPtr<U>&) [with U = WebCore::ResourceBuffer; T = WebCore::SharedBuffer]’:
                qt/WebCoreSupport/QWebFrameAdapter.cpp:269:68: required from here
                ../WTF/wtf/RefPtr.h:99:28: error: cannot convert ‘WebCore::ResourceBuffer*’ to ‘WebCore::SharedBuffer*’ in initialization
                : m_ptr(o.leakRef())

                I then made the following changes:
                RefPtr<ResourceBuffer> buffer = documentLoader->mainResourceData();

                After that it still reports errors:

                /WebCoreSupport/QWebFrameAdapter.cpp
                qt/WebCoreSupport/QWebFrameAdapter.cpp: In member function ‘QByteArray QWebFrameAdapter::mainResourceData() const’:
                qt/WebCoreSupport/QWebFrameAdapter.cpp:273:29: error: invalid use of incomplete type ‘class WebCore::ResourceBuffer’
                return QByteArray(buffer->data(), buffer->size());
                ^
                In file included from qt/WebCoreSupport/QWebFrameAdapter.cpp:27:0:
                ../WebCore/loader/DocumentLoader.h:72:11: note: forward declaration of ‘class WebCore::ResourceBuffer’
                class ResourceBuffer;
                ^
                qt/WebCoreSupport/QWebFrameAdapter.cpp:273:45: error: invalid use of incomplete type ‘class WebCore::ResourceBuffer’
                return QByteArray(buffer->data(), buffer->size());
                ^
                In file included from qt/WebCoreSupport/QWebFrameAdapter.cpp:27:0:
                ../WebCore/loader/DocumentLoader.h:72:11: note: forward declaration of ‘class WebCore::ResourceBuffer’
                class ResourceBuffer;

                Sorry for the newbie question.

                @datasunny said in retrieve website html source code (without javascript rendering):

                You rock!

                @Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

                @datasunny My original plan with using PageSerializer similarly to what MHTMLArchive::generateMHTMLData() is doing did not work out for your case (*), as it saves modified HTML code in the manner similar to toHtml().

                However, I tried different idea and turns out to work. Here is a patch (API is not written in stone, but you can see the idea and start playing with it):

                https://github.com/annulen/webkit/commit/baea600a065241d31dc56da304c10d1d3445d223

                (*) That approach allows to save page with all its resources like CSS and images, optionally filtering them by their MIME types

                1 Reply Last reply
                0
                • K Offline
                  K Offline
                  Konstantin Tokarev
                  wrote on 24 Nov 2016, 09:46 last edited by
                  #8

                  This patch is for revived QtWebKit, not legacy version. The easiest thing that you can do is to pull that commit from github and follow build instructions in wiki. Feel free to join #qtwebkit of freenode to get more operative help

                  D 1 Reply Last reply 13 Jan 2017, 18:24
                  0
                  • K Konstantin Tokarev
                    24 Nov 2016, 09:46

                    This patch is for revived QtWebKit, not legacy version. The easiest thing that you can do is to pull that commit from github and follow build instructions in wiki. Feel free to join #qtwebkit of freenode to get more operative help

                    D Offline
                    D Offline
                    datasunny
                    wrote on 13 Jan 2017, 18:24 last edited by
                    #9

                    @Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

                    This patch is for revived QtWebKit, not legacy version. The easiest thing that you can do is to pull that commit from github and follow build instructions in wiki. Feel free to join #qtwebkit of freenode to get more operative help

                    One more question is after I convert the data to QString, non ascii char (ex. 'ª'/'š'/ will be shown as '�'. I guess it's some kind of encoding issue?
                    Non of below seems work:
                    return QString::fromUtf8(QByteArray(buffer->data(), buffer->size()));
                    return QString::fromUtf8(buffer->data());

                    Appreciate your insight, thanks!

                    K 1 Reply Last reply 13 Jan 2017, 18:36
                    0
                    • D datasunny
                      13 Jan 2017, 18:24

                      @Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

                      This patch is for revived QtWebKit, not legacy version. The easiest thing that you can do is to pull that commit from github and follow build instructions in wiki. Feel free to join #qtwebkit of freenode to get more operative help

                      One more question is after I convert the data to QString, non ascii char (ex. 'ª'/'š'/ will be shown as '�'. I guess it's some kind of encoding issue?
                      Non of below seems work:
                      return QString::fromUtf8(QByteArray(buffer->data(), buffer->size()));
                      return QString::fromUtf8(buffer->data());

                      Appreciate your insight, thanks!

                      K Offline
                      K Offline
                      Konstantin Tokarev
                      wrote on 13 Jan 2017, 18:36 last edited by
                      #10

                      @datasunny You are right, buffer may have different encoding.

                      My initial thought of this API was to return QByteArray to avoid useless encoding conversion for those who just needs e.g. to save it into file. Now I think we should better have easy API returning QString, and advanced API returning object with QIODevice and properties like encoding and MIME type.

                      1 Reply Last reply
                      0

                      • Login

                      • Login or register to search.
                      • First post
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • Users
                      • Groups
                      • Search
                      • Get Qt Extensions
                      • Unsolved