Important: Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

How to extract <br /> tag from XML (XHTML)



  • Hello. I have following XML document:

    <!DOCTYPE html>
    <html xmlns="http://www.w3.org/1999/xhtml">
    <body>
    <table>
    <tr>
    <td>new<br />line</td>
    <td>single line</td>
    </tr>
    </table>
    </body>
    </html>
    

    Now I am trying to extract table with following code.

    QDomDocument doc;
    doc.setContent(html_table);
    QDomElement docElem = doc.documentElement();
    QDomNodeList tables = docElem.elementsByTagName("table");
    for (int i = 0; i < tables.size(); ++i) {
            QDomNode table = tables.item(i);
            QDomElement row = table.firstChildElement("tr");
    
            while (!row.isNull()) {
                qDebug() << "\trow";
                QDomNodeList cols = row.elementsByTagName("td");
                for (int j = 0; j < cols.size(); ++j) {
                    QDomNode col = cols.item(j);
                    qDebug() << "\t\tcol" << col.toElement().text();
                }
                row = row.nextSiblingElement("tr");
            }
    }
    

    With this I getting following output:

    row
        col "newline"
        col "single line"
    

    The question is how to extract <br /> tag and transform it to new line?

    Closest I get is to preprocess input before parsing XML and change all <br /> to \n so newline already present in extracted strings.



  • @asc7uni
    You should be able to get at the <br />. I think you either you shouldn't go col.toElement().text(), or possibly not sure what qDebug() is going to show if it encounters that. Have a poke around in QDomNode col and see what is actually in there before you text() it? I think see my old post at https://forum.qt.io/topic/119756/solved-qdomnode-get-formated-text/2.



  • @JonB
    Thank you for pointing me in the right direction. It seems I missed completely that the text inside DomNode can be DomNode.
    So the following code solves my case:

        QDomNodeList tables = docElem.elementsByTagName("table");
        for (int i = 0; i < tables.size(); ++i) {
                QDomNode table = tables.item(i);
                QDomElement row = table.firstChildElement("tr");
    
                while (!row.isNull()) {
                    qDebug() << "\trow";
                    QDomNodeList cols = row.elementsByTagName("td");
                    for (int j = 0; j < cols.size(); ++j) {
                        QDomNode col = cols.item(j);
                        qDebug() << "\t\tcol" << col.toElement().text();
    // added v
                        QDomNode inside_row = col.firstChild();
                        while (!inside_row.isNull()) {
                            if (inside_row.nodeType() == QDomNode::TextNode) {
                                qDebug() << "\t\t\t inside col"
                                         << inside_row.toText().data();
                            } else {
                                qDebug() << "\t\t\t inside col"
                                         << inside_row.nodeName();
                            }
                            inside_row = inside_row.nextSibling();
                        }
    // added ^
                    }
                    row = row.nextSiblingElement("tr");
                }
        }
    

    And it prints

    HTMLTableParser::parse
            row
                    col "newline"
                             inside col "new"
                             inside col "br"
                             inside col "line"
                    col "single line"
                             inside col "single line"
    

    Which is what I needed.


Log in to reply