How to extract <br /> tag from XML (XHTML)
-
Hello. I have following XML document:
<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <body> <table> <tr> <td>new<br />line</td> <td>single line</td> </tr> </table> </body> </html>
Now I am trying to extract table with following code.
QDomDocument doc; doc.setContent(html_table); QDomElement docElem = doc.documentElement(); QDomNodeList tables = docElem.elementsByTagName("table"); for (int i = 0; i < tables.size(); ++i) { QDomNode table = tables.item(i); QDomElement row = table.firstChildElement("tr"); while (!row.isNull()) { qDebug() << "\trow"; QDomNodeList cols = row.elementsByTagName("td"); for (int j = 0; j < cols.size(); ++j) { QDomNode col = cols.item(j); qDebug() << "\t\tcol" << col.toElement().text(); } row = row.nextSiblingElement("tr"); } }
With this I getting following output:
row col "newline" col "single line"
The question is how to extract
<br />
tag and transform it to new line?Closest I get is to preprocess input before parsing XML and change all
<br />
to\n
so newline already present in extracted strings. -
@asc7uni
You should be able to get at the<br />
. I think you either you shouldn't gocol.toElement().text()
, or possibly not sure whatqDebug()
is going to show if it encounters that. Have a poke around inQDomNode col
and see what is actually in there before youtext()
it? I think see my old post at https://forum.qt.io/topic/119756/solved-qdomnode-get-formated-text/2. -
@JonB
Thank you for pointing me in the right direction. It seems I missed completely that the text inside DomNode can be DomNode.
So the following code solves my case:QDomNodeList tables = docElem.elementsByTagName("table"); for (int i = 0; i < tables.size(); ++i) { QDomNode table = tables.item(i); QDomElement row = table.firstChildElement("tr"); while (!row.isNull()) { qDebug() << "\trow"; QDomNodeList cols = row.elementsByTagName("td"); for (int j = 0; j < cols.size(); ++j) { QDomNode col = cols.item(j); qDebug() << "\t\tcol" << col.toElement().text(); // added v QDomNode inside_row = col.firstChild(); while (!inside_row.isNull()) { if (inside_row.nodeType() == QDomNode::TextNode) { qDebug() << "\t\t\t inside col" << inside_row.toText().data(); } else { qDebug() << "\t\t\t inside col" << inside_row.nodeName(); } inside_row = inside_row.nextSibling(); } // added ^ } row = row.nextSiblingElement("tr"); } }
And it prints
HTMLTableParser::parse row col "newline" inside col "new" inside col "br" inside col "line" col "single line" inside col "single line"
Which is what I needed.