Can't load full html code from page
-
If we look at a plain source code of a website we will see that all ads, or most of them (flash, Google, others) are inserted as a JavaScript code. But if you look at the code in for example Firefox Firebug you will see that the JavaScript have been replaced with the HTML code of the add.
I want to load and parse this "full" html and I believed that Qt WebKit can do such stuff.I tried to do it in that way:
@
PageLoader::PageLoader(const QUrl &url)
{
mWebPage = new QWebPage();
mWebPage->settings()->setAttribute(QWebSettings::JavascriptEnabled, true);
mWebPage->settings()->setAttribute(QWebSettings::PluginsEnabled, false);
mWebPage->settings()->setAttribute(QWebSettings::AutoLoadImages, false);
mWebPage->settings()->setAttribute(QWebSettings::JavascriptCanOpenWindows, false);
connect(mWebPage->mainFrame(),SIGNAL(loadFinished(bool)), this, SLOT(processPage()));
mWebPage->currentFrame()->load(url);
}void PageLoader::processPage()
{
QWebFrame* frame = mWebPage->currentFrame();
QString webHtml = frame->toHtml();
QFile file("/home/ostap/output.txt");
file.open(QIODevice::WriteOnly | QIODevice::Text);
QTextStream out(&file);
out << webHtml;
emit finished();
}
@But in output file I have only plain html with links to *.js files in script tags.
Where is my problem?
sorry for my terrible English...
-
You download and display html code only of the file that you access on server. All other files are only linked with html script and and located somewhere in ram or temp.
You'll have to manually parse html code to get other files and download them in the same way. -
But when I tried to render the loaded page from mWebPage:
@void PageLoader::render()
{
mWebPage->setViewportSize(mWebPage->mainFrame()->contentsSize());
QImage image(mWebPage->viewportSize(), QImage::Format_ARGB32);
QPainter painter(&image);
mWebPage->mainFrame()->render(&painter);
painter.end();
QImage thumbnail = image.scaled(400, 400);
thumbnail.save("thumbnail.png");
emit finished();
}@I get in thumbnail.png normal full view of web page. I think it means that QWebPage object has somewhere this full version of html with executed javascripts and can to render this web page.
-
It definitely has all js, images, css etc. files somewhere ( ram or internet temporary files). But as I looked through the docs, I didn't found any useful functions to access that data.
So you'll probably have to write your own program that will strip those files out, download them and changed links between those files that will match those on your hard drive.
And what you get is entire html file produced by server. For all the other files ( js, css, images), WebKit does the same for all the other files that are linked to your "main" file ( note: files can be included in files etc... ( recursion)).
Regards,
Jake -
Thank you! I'll try that:)