retrieve website html source code (without javascript rendering)

datasunny · 23 Nov 2016, 07:05

Hi All,
Using webkit loading a website, how can I get the original html source code (before javascript get rendered)? Just like what you will get when you right click mouse then choose "View Page Source" in chrome/firefox/ie?
QWebFrame->toHtml() gives out html with javascript rendered.

raven-worx · wrote on 23 Nov 2016, 07:05

@datasunny
What do you mean with "JavaScript rendering"???
You can use QNetworkAccessManager::get() to download the initial source.

Konstantin Tokarev · 23 Nov 2016, 17:36

Saving resources of loaded page is a planned feature [1].

This feature is not hard to implement, and is delayed just because of lacking time. If you are interested you can implement it yourself, it's a good entry poiint for first time contributor, and I'll help you to get started!

As for QNAM::get() advice, if you need only original HTML and you handle HTTP redirects correctly, it can work if there are no JavaScript redirects in loaded page.

[1] https://github.com/annulen/webkit/issues/105

datasunny · 23 Nov 2016, 19:32

Could you shed some light on where to start? Thanks a bunch!

@Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

Saving resources of loaded page is a planned feature [1].

This feature is not hard to implement, and is delayed just because of lacking time. If you are interested you can implement it yourself, it's a good entry poiint for first time contributor, and I'll help you to get started!

As for QNAM::get() advice, if you need only original HTML and you handle HTTP redirects correctly, it can work if there are no JavaScript redirects in loaded page.

[1] https://github.com/annulen/webkit/issues/105

Konstantin Tokarev · 23 Nov 2016, 21:02

@datasunny My original plan with using PageSerializer similarly to what MHTMLArchive::generateMHTMLData() is doing did not work out for your case (*), as it saves modified HTML code in the manner similar to toHtml().

However, I tried different idea and turns out to work. Here is a patch (API is not written in stone, but you can see the idea and start playing with it):

https://github.com/annulen/webkit/commit/baea600a065241d31dc56da304c10d1d3445d223

(*) That approach allows to save page with all its resources like CSS and images, optionally filtering them by their MIME types

datasunny · 24 Nov 2016, 00:08

You rock!

@Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

@datasunny My original plan with using PageSerializer similarly to what MHTMLArchive::generateMHTMLData() is doing did not work out for your case (*), as it saves modified HTML code in the manner similar to toHtml().

However, I tried different idea and turns out to work. Here is a patch (API is not written in stone, but you can see the idea and start playing with it):

https://github.com/annulen/webkit/commit/baea600a065241d31dc56da304c10d1d3445d223

(*) That approach allows to save page with all its resources like CSS and images, optionally filtering them by their MIME types

datasunny · wrote on 24 Nov 2016, 00:08

Got a few errors when compiling, I made the change on top of qt 5.5:

/WebCoreSupport/QWebFrameAdapter.cpp
qt/WebCoreSupport/QWebFrameAdapter.cpp: In member function ‘QByteArray QWebFrameAdapter::mainResourceData() const’:
qt/WebCoreSupport/QWebFrameAdapter.cpp:267:44: error: request for member ‘activeDocumentLoader’ in ‘((WebCore::Frame*)((const QWebFrameAdapter*)this)->QWebFrameAdapter::frame)->WebCore::Frame::loader()’, which is of pointer type ‘WebCore::FrameLoader*’ (maybe you meant to use ‘->’ ?)
auto* documentLoader = frame->loader().activeDocumentLoader();

So I changed to:
auto* documentLoader = frame->loader()->activeDocumentLoader();

Then I got:
/WebCoreSupport/QWebFrameAdapter.cpp
In file included from ../WTF/wtf/VectorTraits.h:26:0,
from ../WTF/wtf/Vector.h:31,
from ../WTF/wtf/text/StringImpl.h:31,
from ../WTF/wtf/text/WTFString.h:29,
from ../WebCore/loader/FormState.h:33,
from qt/WebCoreSupport/FrameLoaderClientQt.h:33,
from qt/WebCoreSupport/QWebFrameAdapter.h:23,
from qt/WebCoreSupport/QWebFrameAdapter.cpp:22:
../WTF/wtf/RefPtr.h: In instantiation of ‘WTF::RefPtr<T>::RefPtr(const WTF::PassRefPtr<U>&) [with U = WebCore::ResourceBuffer; T = WebCore::SharedBuffer]’:
qt/WebCoreSupport/QWebFrameAdapter.cpp:269:68: required from here
../WTF/wtf/RefPtr.h:99:28: error: cannot convert ‘WebCore::ResourceBuffer*’ to ‘WebCore::SharedBuffer*’ in initialization
: m_ptr(o.leakRef())

I then made the following changes:
RefPtr<ResourceBuffer> buffer = documentLoader->mainResourceData();

After that it still reports errors:

/WebCoreSupport/QWebFrameAdapter.cpp
qt/WebCoreSupport/QWebFrameAdapter.cpp: In member function ‘QByteArray QWebFrameAdapter::mainResourceData() const’:
qt/WebCoreSupport/QWebFrameAdapter.cpp:273:29: error: invalid use of incomplete type ‘class WebCore::ResourceBuffer’
return QByteArray(buffer->data(), buffer->size());
^
In file included from qt/WebCoreSupport/QWebFrameAdapter.cpp:27:0:
../WebCore/loader/DocumentLoader.h:72:11: note: forward declaration of ‘class WebCore::ResourceBuffer’
class ResourceBuffer;
^
qt/WebCoreSupport/QWebFrameAdapter.cpp:273:45: error: invalid use of incomplete type ‘class WebCore::ResourceBuffer’
return QByteArray(buffer->data(), buffer->size());
^
In file included from qt/WebCoreSupport/QWebFrameAdapter.cpp:27:0:
../WebCore/loader/DocumentLoader.h:72:11: note: forward declaration of ‘class WebCore::ResourceBuffer’
class ResourceBuffer;

Sorry for the newbie question.

@datasunny said in retrieve website html source code (without javascript rendering):

You rock!

@Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

@datasunny My original plan with using PageSerializer similarly to what MHTMLArchive::generateMHTMLData() is doing did not work out for your case (*), as it saves modified HTML code in the manner similar to toHtml().

However, I tried different idea and turns out to work. Here is a patch (API is not written in stone, but you can see the idea and start playing with it):

https://github.com/annulen/webkit/commit/baea600a065241d31dc56da304c10d1d3445d223

(*) That approach allows to save page with all its resources like CSS and images, optionally filtering them by their MIME types

Konstantin Tokarev · 13 Jan 2017, 18:24

This patch is for revived QtWebKit, not legacy version. The easiest thing that you can do is to pull that commit from github and follow build instructions in wiki. Feel free to join #qtwebkit of freenode to get more operative help

datasunny · 13 Jan 2017, 18:36

@Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

This patch is for revived QtWebKit, not legacy version. The easiest thing that you can do is to pull that commit from github and follow build instructions in wiki. Feel free to join #qtwebkit of freenode to get more operative help

One more question is after I convert the data to QString, non ascii char (ex. 'ª'/'š'/ will be shown as '�'. I guess it's some kind of encoding issue?
Non of below seems work:
return QString::fromUtf8(QByteArray(buffer->data(), buffer->size()));
return QString::fromUtf8(buffer->data());

Appreciate your insight, thanks!

Konstantin Tokarev · wrote on 13 Jan 2017, 18:36

@datasunny You are right, buffer may have different encoding.

My initial thought of this API was to return QByteArray to avoid useless encoding conversion for those who just needs e.g. to save it into file. Now I think we should better have easy API returning QString, and advanced API returning object with QIODevice and properties like encoding and MIME type.