Parsing page
-
Hello
I would like to load a website and parse it similar to a XML document, creating a tree structure. There can be standard requirements on the page, as XHTML 1.0 Strict. I've tried the following:@ui->webView->load(QUrl("http://validator.w3.org/"));
QDomDocument xmlDoc;
if( !xmlDoc.setContent( ui->webView->page()->currentFrame()->toHtml() ) )
{
qDebug("ERROR");
}
else
{
qDebug() << "OK";
}@But though the http://validator.w3.org/ is valid xhtml, it still won't parse as a XML file. Is there any good options to this?
Thanks
Richard -
"QWebElement":https://qt-project.org/doc/qt-4.8/qwebelement.html will be your friend.
To get the 'root' use @webView->page()->mainFrame()->documentElement();@
Then you can walk through it using methods like QWebElement::firstChild(), QWebElement::lastChild(), QWebElement::nextSibling(), ...I used something like the following to find the element that has focus. You can adapt it for your use case.
@
// for example hand over the root element as the parameter
QWebElement WebViewDerivedClass::findElementWithFocus(const QWebElement& a_element)
{
QWebElement result;
QWebElement tempWebElement = a_element.firstChild();
bool done = false;
while(!done)
{
if(tempWebElement == a_element.lastChild())
{
done = true;
}if(tempWebElement.hasFocus()) { return tempWebElement; } if(!tempWebElement.firstChild().isNull()) { QWebElement tempWebElement2 = findElementWithFocus(tempWebElement); if(!tempWebElement2.isNull()) { return tempWebElement2; } } tempWebElement = tempWebElement.nextSibling();
}
return result;
}
@ -
Great. Still, I'm wondering why the Webkit modifies the source code? When I read the source from a file and print it to the debug, it looks ok. But as soon as I set it to the webview and read it from there (toHtml) it changes all the "/>" to ">" meaning it'll lack the closing part... Can I prevent this? It's a problem as eg normally meta tags are closed directly. Thanks
Richard