Parsing page



  • Hello
    I would like to load a website and parse it similar to a XML document, creating a tree structure. There can be standard requirements on the page, as XHTML 1.0 Strict. I've tried the following:

    @ui->webView->load(QUrl("http://validator.w3.org/"));
    QDomDocument xmlDoc;
    if( !xmlDoc.setContent( ui->webView->page()->currentFrame()->toHtml() ) )
    {
    qDebug("ERROR");
    }
    else
    {
    qDebug() << "OK";
    }@

    But though the http://validator.w3.org/ is valid xhtml, it still won't parse as a XML file. Is there any good options to this?
    Thanks
    Richard



  • "QWebElement":https://qt-project.org/doc/qt-4.8/qwebelement.html will be your friend.
    To get the 'root' use @webView->page()->mainFrame()->documentElement();@
    Then you can walk through it using methods like QWebElement::firstChild(), QWebElement::lastChild(), QWebElement::nextSibling(), ...

    I used something like the following to find the element that has focus. You can adapt it for your use case.
    @
    // for example hand over the root element as the parameter
    QWebElement WebViewDerivedClass::findElementWithFocus(const QWebElement& a_element)
    {
    QWebElement result;
    QWebElement tempWebElement = a_element.firstChild();
    bool done = false;
    while(!done)
    {
    if(tempWebElement == a_element.lastChild())
    {
    done = true;
    }

      if(tempWebElement.hasFocus())
      {
         return tempWebElement;
      }
      if(!tempWebElement.firstChild().isNull())
      {
         QWebElement tempWebElement2 = findElementWithFocus(tempWebElement);
         if(!tempWebElement2.isNull())
         {
            return tempWebElement2;
         }
      }
      tempWebElement = tempWebElement.nextSibling();
    

    }
    return result;
    }
    @



  • Great. Still, I'm wondering why the Webkit modifies the source code? When I read the source from a file and print it to the debug, it looks ok. But as soon as I set it to the webview and read it from there (toHtml) it changes all the "/>" to ">" meaning it'll lack the closing part... Can I prevent this? It's a problem as eg normally meta tags are closed directly. Thanks
    Richard



  • Yeah I noticed that too. Don't know how to stop QtWebKit from doing so. You could try to use QDomDocument for parsing your HTML stuff again.


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.