How to parse xml with inline (embedded tags)?



  • hello,

    I have the following xml I need to parse using QXmlStreamReader:

    <p>some text <a>link</a> some other text</p>
    

    but I found no examples on how to do this, and my following code failed to get the correct elementText after the </a> tag.

        qDebug() << xml.name();
        qDebug() << xml.readElementText();
        qDebug() << xml.name() << xml.isStartElement();
        qDebug() << xml.readElementText();
        qDebug() << xml.readElementText();
        qDebug() << xml.name();
        qDebug() << xml.name();
        qDebug() << xml.readElementText();
    

    I know that readElementText() function supports options to skip or ignore inner tags. but in my case, I need to retrieve both the text and the <a> tag.


  • Lifetime Qt Champion

    Hi,

    Do you mean you want to get some text link some other text ?



  • @SGaist

    no, I want to parse it as if this was html

    I want to be able to read the above xml and get the tree

    so

    root 
    |-</p>
    |- text block 1: some text
    |- <a>
    |-- inner text of <a>: "link"
    |-</a>
    |- text block 2: some other text
    |- </p>
    

    then re-render it as

    <para>some text <urllink>link</urlink> some other text</para>
    

  • Lifetime Qt Champion

    So replace p and a with para and urllink ?



  • @SGaist that's very hacky and not a generic solution.

    I used <p> and <para> as an example, in reality, the tags could be anything (I won't be able to know in advance).

    my question is regarding how to properly parse inline tags like above.

    The QXmlStreamReader api doesn't seem to handle this case, which is quite common in html.

    What my program actually does is parsing the xml files generated with doxygen and re-rendering them as html pages.



  • for example, the above case is relatively easy to do with rapidxml

        rapidxml::xml_document<> doc;
        char data[1024] = "<p>some text <a>link</a> some other text</p>";
        doc.parse<0>(data);
    
        qDebug() << "first node" << doc.first_node()->name();
    
        qDebug() << "first node children" << doc.first_node()->first_node()->type();
    
        qDebug() << "content" << doc.first_node()->first_node()->value();
    
        qDebug() << "next" << doc.first_node()->first_node()->next_sibling()->name();
    
        qDebug() << "next next" << doc.first_node()->first_node()->next_sibling()->first_node()->value();
    
        qDebug() << "next next next" << doc.first_node()->first_node()->next_sibling()->next_sibling()->value();
    

    as rapidxml will parse the xml into a tree structure, you can easy query for a tree node's siblings and children.

    whereas Qt's QXmlStreamReader only tokenize the xml file in to a flat token sequence. And it doesn't expect inline tags.



  • Hi, Qt has the same xml tree structure parsing stuff in the DOM classes:
    add xml in your.pro file, eg: QT += core gui xml
    #include "QDomDocument"
    then you can use code like this:

    QDomDocument doc;
    doc.setContent(QString("<p>some text <a>link</a> some other text</p>"));
    
    for (auto c1 = doc.documentElement().firstChild(); !c1.isNull(); c1 = c1.nextSibling())
    {
        qDebug() << "level1: " << c1.toText().data();
    
        for (auto c2 = c1.firstChild(); !c2.isNull(); c2 = c2.nextSibling())
            qDebug() << "       level2: " << c2.toText().data();
    }
    



  • Lifetime Qt Champion

    @billconan it was meant as a question not a suggestion.

    How do you know what tag you will replace and by what ? Is it something your application users will provide ?



  • @hskoglund thanks this works



  • @SGaist yes, I know what will be replaced by what.

    my task is converting doxygen xml files into html.

    your solution would work, but it's hacky. because it's not always one-to-one translation, for example, from <para> to <p>.

    being able to parse the initial xml into a tree structure gives me great flexibility.


  • Lifetime Qt Champion

    Again, it wasn't a suggestion, I was just asking whether you would simply do tag for tag replacement.

    Out of curiosity, since you are using doxygen, why not make it generate the html directly ?


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.