Parsing page

ThaRez · wrote on 27 Mar 2012, 10:04

Hello
I would like to load a website and parse it similar to a XML document, creating a tree structure. There can be standard requirements on the page, as XHTML 1.0 Strict. I've tried the following:

@ui->webView->load(QUrl("http://validator.w3.org/"));
QDomDocument xmlDoc;
if( !xmlDoc.setContent( ui->webView->page()->currentFrame()->toHtml() ) )
{
qDebug("ERROR");
}
else
{
qDebug() << "OK";
}@

But though the http://validator.w3.org/ is valid xhtml, it still won't parse as a XML file. Is there any good options to this?
Thanks
Richard

KA51O · wrote on 27 Mar 2012, 11:13

"QWebElement":https://qt-project.org/doc/qt-4.8/qwebelement.html will be your friend.
To get the 'root' use @webView->page()->mainFrame()->documentElement();@
Then you can walk through it using methods like QWebElement::firstChild(), QWebElement::lastChild(), QWebElement::nextSibling(), ...

I used something like the following to find the element that has focus. You can adapt it for your use case.
@
// for example hand over the root element as the parameter
QWebElement WebViewDerivedClass::findElementWithFocus(const QWebElement& a_element)
{
QWebElement result;
QWebElement tempWebElement = a_element.firstChild();
bool done = false;
while(!done)
{
if(tempWebElement == a_element.lastChild())
{
done = true;
}

  if(tempWebElement.hasFocus())
  {
     return tempWebElement;
  }
  if(!tempWebElement.firstChild().isNull())
  {
     QWebElement tempWebElement2 = findElementWithFocus(tempWebElement);
     if(!tempWebElement2.isNull())
     {
        return tempWebElement2;
     }
  }
  tempWebElement = tempWebElement.nextSibling();

}
return result;
}
@

ThaRez · wrote on 27 Mar 2012, 12:11

Great. Still, I'm wondering why the Webkit modifies the source code? When I read the source from a file and print it to the debug, it looks ok. But as soon as I set it to the webview and read it from there (toHtml) it changes all the "/>" to ">" meaning it'll lack the closing part... Can I prevent this? It's a problem as eg normally meta tags are closed directly. Thanks
Richard

KA51O · wrote on 28 Mar 2012, 06:32

Yeah I noticed that too. Don't know how to stop QtWebKit from doing so. You could try to use QDomDocument for parsing your HTML stuff again.