Important: Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

Parsing page



  • Hello
    I would like to load a website and parse it similar to a XML document, creating a tree structure. There can be standard requirements on the page, as XHTML 1.0 Strict. I've tried the following:

    @ui->webView->load(QUrl("http://validator.w3.org/"));
    QDomDocument xmlDoc;
    if( !xmlDoc.setContent( ui->webView->page()->currentFrame()->toHtml() ) )
    {
    qDebug("ERROR");
    }
    else
    {
    qDebug() << "OK";
    }@

    But though the http://validator.w3.org/ is valid xhtml, it still won't parse as a XML file. Is there any good options to this?
    Thanks
    Richard



  • "QWebElement":https://qt-project.org/doc/qt-4.8/qwebelement.html will be your friend.
    To get the 'root' use @webView->page()->mainFrame()->documentElement();@
    Then you can walk through it using methods like QWebElement::firstChild(), QWebElement::lastChild(), QWebElement::nextSibling(), ...

    I used something like the following to find the element that has focus. You can adapt it for your use case.
    @
    // for example hand over the root element as the parameter
    QWebElement WebViewDerivedClass::findElementWithFocus(const QWebElement& a_element)
    {
    QWebElement result;
    QWebElement tempWebElement = a_element.firstChild();
    bool done = false;
    while(!done)
    {
    if(tempWebElement == a_element.lastChild())
    {
    done = true;
    }

      if(tempWebElement.hasFocus())
      {
         return tempWebElement;
      }
      if(!tempWebElement.firstChild().isNull())
      {
         QWebElement tempWebElement2 = findElementWithFocus(tempWebElement);
         if(!tempWebElement2.isNull())
         {
            return tempWebElement2;
         }
      }
      tempWebElement = tempWebElement.nextSibling();
    

    }
    return result;
    }
    @



  • Great. Still, I'm wondering why the Webkit modifies the source code? When I read the source from a file and print it to the debug, it looks ok. But as soon as I set it to the webview and read it from there (toHtml) it changes all the "/>" to ">" meaning it'll lack the closing part... Can I prevent this? It's a problem as eg normally meta tags are closed directly. Thanks
    Richard



  • Yeah I noticed that too. Don't know how to stop QtWebKit from doing so. You could try to use QDomDocument for parsing your HTML stuff again.


Log in to reply