Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. How to extract <br /> tag from XML (XHTML)
Forum Updated to NodeBB v4.3 + New Features

How to extract <br /> tag from XML (XHTML)

Scheduled Pinned Locked Moved Solved General and Desktop
3 Posts 2 Posters 504 Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • A Offline
    A Offline
    asc7uni
    wrote on 20 Dec 2020, 14:03 last edited by asc7uni
    #1

    Hello. I have following XML document:

    <!DOCTYPE html>
    <html xmlns="http://www.w3.org/1999/xhtml">
    <body>
    <table>
    <tr>
    <td>new<br />line</td>
    <td>single line</td>
    </tr>
    </table>
    </body>
    </html>
    

    Now I am trying to extract table with following code.

    QDomDocument doc;
    doc.setContent(html_table);
    QDomElement docElem = doc.documentElement();
    QDomNodeList tables = docElem.elementsByTagName("table");
    for (int i = 0; i < tables.size(); ++i) {
            QDomNode table = tables.item(i);
            QDomElement row = table.firstChildElement("tr");
    
            while (!row.isNull()) {
                qDebug() << "\trow";
                QDomNodeList cols = row.elementsByTagName("td");
                for (int j = 0; j < cols.size(); ++j) {
                    QDomNode col = cols.item(j);
                    qDebug() << "\t\tcol" << col.toElement().text();
                }
                row = row.nextSiblingElement("tr");
            }
    }
    

    With this I getting following output:

    row
        col "newline"
        col "single line"
    

    The question is how to extract <br /> tag and transform it to new line?

    Closest I get is to preprocess input before parsing XML and change all <br /> to \n so newline already present in extracted strings.

    J 1 Reply Last reply 20 Dec 2020, 14:24
    0
    • A asc7uni
      20 Dec 2020, 14:03

      Hello. I have following XML document:

      <!DOCTYPE html>
      <html xmlns="http://www.w3.org/1999/xhtml">
      <body>
      <table>
      <tr>
      <td>new<br />line</td>
      <td>single line</td>
      </tr>
      </table>
      </body>
      </html>
      

      Now I am trying to extract table with following code.

      QDomDocument doc;
      doc.setContent(html_table);
      QDomElement docElem = doc.documentElement();
      QDomNodeList tables = docElem.elementsByTagName("table");
      for (int i = 0; i < tables.size(); ++i) {
              QDomNode table = tables.item(i);
              QDomElement row = table.firstChildElement("tr");
      
              while (!row.isNull()) {
                  qDebug() << "\trow";
                  QDomNodeList cols = row.elementsByTagName("td");
                  for (int j = 0; j < cols.size(); ++j) {
                      QDomNode col = cols.item(j);
                      qDebug() << "\t\tcol" << col.toElement().text();
                  }
                  row = row.nextSiblingElement("tr");
              }
      }
      

      With this I getting following output:

      row
          col "newline"
          col "single line"
      

      The question is how to extract <br /> tag and transform it to new line?

      Closest I get is to preprocess input before parsing XML and change all <br /> to \n so newline already present in extracted strings.

      J Offline
      J Offline
      JonB
      wrote on 20 Dec 2020, 14:24 last edited by JonB
      #2

      @asc7uni
      You should be able to get at the <br />. I think you either you shouldn't go col.toElement().text(), or possibly not sure what qDebug() is going to show if it encounters that. Have a poke around in QDomNode col and see what is actually in there before you text() it? I think see my old post at https://forum.qt.io/topic/119756/solved-qdomnode-get-formated-text/2.

      A 1 Reply Last reply 21 Dec 2020, 07:15
      1
      • J JonB
        20 Dec 2020, 14:24

        @asc7uni
        You should be able to get at the <br />. I think you either you shouldn't go col.toElement().text(), or possibly not sure what qDebug() is going to show if it encounters that. Have a poke around in QDomNode col and see what is actually in there before you text() it? I think see my old post at https://forum.qt.io/topic/119756/solved-qdomnode-get-formated-text/2.

        A Offline
        A Offline
        asc7uni
        wrote on 21 Dec 2020, 07:15 last edited by asc7uni
        #3

        @JonB
        Thank you for pointing me in the right direction. It seems I missed completely that the text inside DomNode can be DomNode.
        So the following code solves my case:

            QDomNodeList tables = docElem.elementsByTagName("table");
            for (int i = 0; i < tables.size(); ++i) {
                    QDomNode table = tables.item(i);
                    QDomElement row = table.firstChildElement("tr");
        
                    while (!row.isNull()) {
                        qDebug() << "\trow";
                        QDomNodeList cols = row.elementsByTagName("td");
                        for (int j = 0; j < cols.size(); ++j) {
                            QDomNode col = cols.item(j);
                            qDebug() << "\t\tcol" << col.toElement().text();
        // added v
                            QDomNode inside_row = col.firstChild();
                            while (!inside_row.isNull()) {
                                if (inside_row.nodeType() == QDomNode::TextNode) {
                                    qDebug() << "\t\t\t inside col"
                                             << inside_row.toText().data();
                                } else {
                                    qDebug() << "\t\t\t inside col"
                                             << inside_row.nodeName();
                                }
                                inside_row = inside_row.nextSibling();
                            }
        // added ^
                        }
                        row = row.nextSiblingElement("tr");
                    }
            }
        

        And it prints

        HTMLTableParser::parse
                row
                        col "newline"
                                 inside col "new"
                                 inside col "br"
                                 inside col "line"
                        col "single line"
                                 inside col "single line"
        

        Which is what I needed.

        1 Reply Last reply
        1

        1/3

        20 Dec 2020, 14:03

        • Login

        • Login or register to search.
        1 out of 3
        • First post
          1/3
          Last post
        0
        • Categories
        • Recent
        • Tags
        • Popular
        • Users
        • Groups
        • Search
        • Get Qt Extensions
        • Unsolved