retrieve website html source code (without javascript rendering)

datasunny

Hi All,
Using webkit loading a website, how can I get the original html source code (before javascript get rendered)? Just like what you will get when you right click mouse then choose "View Page Source" in chrome/firefox/ie?
QWebFrame->toHtml() gives out html with javascript rendered.

raven-worx

@datasunny
What do you mean with "JavaScript rendering"???
You can use QNetworkAccessManager::get() to download the initial source.

Konstantin Tokarev

Saving resources of loaded page is a planned feature [1].

This feature is not hard to implement, and is delayed just because of lacking time. If you are interested you can implement it yourself, it's a good entry poiint for first time contributor, and I'll help you to get started!

As for QNAM::get() advice, if you need only original HTML and you handle HTTP redirects correctly, it can work if there are no JavaScript redirects in loaded page.

[1] https://github.com/annulen/webkit/issues/105

datasunny

Could you shed some light on where to start? Thanks a bunch!

@Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

Saving resources of loaded page is a planned feature [1].

This feature is not hard to implement, and is delayed just because of lacking time. If you are interested you can implement it yourself, it's a good entry poiint for first time contributor, and I'll help you to get started!

As for QNAM::get() advice, if you need only original HTML and you handle HTTP redirects correctly, it can work if there are no JavaScript redirects in loaded page.

[1] https://github.com/annulen/webkit/issues/105

Konstantin Tokarev

@datasunny My original plan with using PageSerializer similarly to what MHTMLArchive::generateMHTMLData() is doing did not work out for your case (*), as it saves modified HTML code in the manner similar to toHtml().

However, I tried different idea and turns out to work. Here is a patch (API is not written in stone, but you can see the idea and start playing with it):

https://github.com/annulen/webkit/commit/baea600a065241d31dc56da304c10d1d3445d223

(*) That approach allows to save page with all its resources like CSS and images, optionally filtering them by their MIME types

datasunny

You rock!

@Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

@datasunny My original plan with using PageSerializer similarly to what MHTMLArchive::generateMHTMLData() is doing did not work out for your case (*), as it saves modified HTML code in the manner similar to toHtml().

However, I tried different idea and turns out to work. Here is a patch (API is not written in stone, but you can see the idea and start playing with it):

https://github.com/annulen/webkit/commit/baea600a065241d31dc56da304c10d1d3445d223

(*) That approach allows to save page with all its resources like CSS and images, optionally filtering them by their MIME types

datasunny

Got a few errors when compiling, I made the change on top of qt 5.5:

/WebCoreSupport/QWebFrameAdapter.cpp
qt/WebCoreSupport/QWebFrameAdapter.cpp: In member function ‘QByteArray QWebFrameAdapter::mainResourceData() const’:
qt/WebCoreSupport/QWebFrameAdapter.cpp:267:44: error: request for member ‘activeDocumentLoader’ in ‘((WebCore::Frame*)((const QWebFrameAdapter*)this)->QWebFrameAdapter::frame)->WebCore::Frame::loader()’, which is of pointer type ‘WebCore::FrameLoader*’ (maybe you meant to use ‘->’ ?)
auto* documentLoader = frame->loader().activeDocumentLoader();

So I changed to:
auto* documentLoader = frame->loader()->activeDocumentLoader();

Then I got:
/WebCoreSupport/QWebFrameAdapter.cpp
In file included from ../WTF/wtf/VectorTraits.h:26:0,
from ../WTF/wtf/Vector.h:31,
from ../WTF/wtf/text/StringImpl.h:31,
from ../WTF/wtf/text/WTFString.h:29,
from ../WebCore/loader/FormState.h:33,
from qt/WebCoreSupport/FrameLoaderClientQt.h:33,
from qt/WebCoreSupport/QWebFrameAdapter.h:23,
from qt/WebCoreSupport/QWebFrameAdapter.cpp:22:
../WTF/wtf/RefPtr.h: In instantiation of ‘WTF::RefPtr<T>::RefPtr(const WTF::PassRefPtr<U>&) [with U = WebCore::ResourceBuffer; T = WebCore::SharedBuffer]’:
qt/WebCoreSupport/QWebFrameAdapter.cpp:269:68: required from here
../WTF/wtf/RefPtr.h:99:28: error: cannot convert ‘WebCore::ResourceBuffer*’ to ‘WebCore::SharedBuffer*’ in initialization
: m_ptr(o.leakRef())

I then made the following changes:
RefPtr<ResourceBuffer> buffer = documentLoader->mainResourceData();

After that it still reports errors:

/WebCoreSupport/QWebFrameAdapter.cpp
qt/WebCoreSupport/QWebFrameAdapter.cpp: In member function ‘QByteArray QWebFrameAdapter::mainResourceData() const’:
qt/WebCoreSupport/QWebFrameAdapter.cpp:273:29: error: invalid use of incomplete type ‘class WebCore::ResourceBuffer’
return QByteArray(buffer->data(), buffer->size());
^
In file included from qt/WebCoreSupport/QWebFrameAdapter.cpp:27:0:
../WebCore/loader/DocumentLoader.h:72:11: note: forward declaration of ‘class WebCore::ResourceBuffer’
class ResourceBuffer;
^
qt/WebCoreSupport/QWebFrameAdapter.cpp:273:45: error: invalid use of incomplete type ‘class WebCore::ResourceBuffer’
return QByteArray(buffer->data(), buffer->size());
^
In file included from qt/WebCoreSupport/QWebFrameAdapter.cpp:27:0:
../WebCore/loader/DocumentLoader.h:72:11: note: forward declaration of ‘class WebCore::ResourceBuffer’
class ResourceBuffer;

Sorry for the newbie question.

@datasunny said in retrieve website html source code (without javascript rendering):

You rock!

@Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

@datasunny My original plan with using PageSerializer similarly to what MHTMLArchive::generateMHTMLData() is doing did not work out for your case (*), as it saves modified HTML code in the manner similar to toHtml().

However, I tried different idea and turns out to work. Here is a patch (API is not written in stone, but you can see the idea and start playing with it):

https://github.com/annulen/webkit/commit/baea600a065241d31dc56da304c10d1d3445d223

(*) That approach allows to save page with all its resources like CSS and images, optionally filtering them by their MIME types

Konstantin Tokarev

This patch is for revived QtWebKit, not legacy version. The easiest thing that you can do is to pull that commit from github and follow build instructions in wiki. Feel free to join #qtwebkit of freenode to get more operative help

datasunny

@Konstantin-Tokarev said in retrieve website html source code (without javascript rendering):

This patch is for revived QtWebKit, not legacy version. The easiest thing that you can do is to pull that commit from github and follow build instructions in wiki. Feel free to join #qtwebkit of freenode to get more operative help

One more question is after I convert the data to QString, non ascii char (ex. 'ª'/'š'/ will be shown as '�'. I guess it's some kind of encoding issue?
Non of below seems work:
return QString::fromUtf8(QByteArray(buffer->data(), buffer->size()));
return QString::fromUtf8(buffer->data());

Appreciate your insight, thanks!

Konstantin Tokarev

@datasunny You are right, buffer may have different encoding.

My initial thought of this API was to return QByteArray to avoid useless encoding conversion for those who just needs e.g. to save it into file. Now I think we should better have easy API returning QString, and advanced API returning object with QIODevice and properties like encoding and MIME type.

Discover and share your #QtStories

Upcoming Forum Update April 22nd

retrieve website html source code (without javascript rendering)