Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. QtXmlStreamReader and UTF-8 multibytes characters
Forum Updated to NodeBB v4.3 + New Features

QtXmlStreamReader and UTF-8 multibytes characters

Scheduled Pinned Locked Moved Solved General and Desktop
16 Posts 4 Posters 1.3k Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • SGaistS Offline
    SGaistS Offline
    SGaist
    Lifetime Qt Champion
    wrote on last edited by
    #6

    Can you provide a minimal compilable example that shows this behaviour ?

    It's also unclear why you would use std::string in this case.

    Interested in AI ? www.idiap.ch
    Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

    1 Reply Last reply
    0
    • Christian EhrlicherC Christian Ehrlicher

      @SGaist said in QtXmlStreamReader and UTF-8 multibytes characters:

      inputXmlString

      And to add @SGaist answer - what type is this (I assume std::string) and are you sure it's really utf-8 encoded?

      F Offline
      F Offline
      Frenk21
      wrote on last edited by
      #7

      @Christian-Ehrlicher Well I just put this xml as you see into the std::string, then copy/add it to the qbyteArray. I checked the XML and it is utf8 encoded. std::string is just an array of characters, so sign "EUR" should be stored as 3 chars or am I wrong? And if qbytearray holds sign "EUR" as 3 chars do I need to do anything special so QtXmlStreamReader can properly handle it?

      1 Reply Last reply
      0
      • F Frenk21

        @SGaist Well at start I did not have any conversion, but as qtXmlStreamReader is always failing I thought that maybe the input std::string does not have a valid utf8 string.

        The really question is though why is it failing if I use utf8 char with length > 1 byte?

        F Offline
        F Offline
        Frenk21
        wrote on last edited by Frenk21
        #8

        @Frenk21 The implementation was done by someone else. He did an abstract c++ part so that later you could integrate different real XML parsers and he used Qt XmlStreamParser as an example. Until now everything was working fine as noone tested it with multiple chars.

        Here is the code snipped of the parsing part:

        NOTE: class XmlElement holds the xml element info (name, node, childes)
        
        void XmlReader::parse(const std::string &xmlData)
        {
        	std::list<XmlElement> openElements;
        
            QMutexLocker locker(&m_mutex);	
        
        	// Add data to the XML reader buffer
        	(void)m_xmlData.buffer().append(xmlData.c_str());
        
        
            // Read XML document from the XML data buffer
            bool end = false;
        
            while ((!m_xmlReader.atEnd()) &&
                   (!end))
            {
                QXmlStreamReader::TokenType tokenType = m_xmlReader.readNext();
        
                switch (tokenType)
                {
                    case QXmlStreamReader::NoToken:
                    {
                        // The reader has not yet read anything.
                        break;
                    }
        
                    case QXmlStreamReader::Invalid:
                    {
                        // An error occurred, handle it
                        end = true;
                        break;
                    }
        
                    case QXmlStreamReader::StartElement:
                    {
                        // Read start of the element (element name and attributes)
                        const std::string elementName = m_xmlReader.name().toString().toStdString();
                        QXmlStreamAttributes qtAttributeList = m_xmlReader.attributes();
        
                        for (int32_t i = 0; i < qtAttributeList.size(); ++i)
                        {
                            const QXmlStreamAttribute &attribute = qtAttributeList.at(i);
        
                            const std::string attributeName = attribute.name().toString().toStdString();
                            const std::string attributeValue = attribute.value().toString().toStdString();
                        }
        
                        // Open a new element
                        openElements.push_back(XmlElement(elementName));
                        break;
                    }
        
                    case QXmlStreamReader::EndElement:
                    {
                        // Check if the element is even open
                        if (openElements.size() > 0U)
                        {
                            // Close the element
                            XmlElement currentElement = openElements.back();
                            openElements.pop_back();
        
                            // Check if this is a child element or the root element
                            if (openElements.size() > 0U)
                            {
                                // This is a child element, add it to its parent
                                XmlElement &parentElement = openElements.back();
                                parentElement.addChildElement(currentElement);
                            }
                            else
                            {
                                // This is the root element, save it and finish reading the XML document
                                end = true;
                            }
                        }
                        else
                        {
                            end = true;
                        }
        
                        break;
                    }
        
                    case QXmlStreamReader::Characters:
                    {
                        // It reports characters also when whitespaces remain between XML tags and the
                        // caller must decide if this is a valid
                        const std::string text = m_xmlReader.text().toString().toStdString();
        
                        // Append the text node to the currently open element's text node
                        XmlElement &currentElement = openElements.back();
        
                        currentElement.appendTextNode(text);
                        break;
                    }
        
                    default:
                    {
        				// Not supported types are skipped
                        break;
                    }
                }
            }
        }
        
        1 Reply Last reply
        0
        • C Offline
          C Offline
          ChrisW67
          wrote on last edited by
          #9

          Here is how QXmlStreamReader sees your file:

          <?xml version="1.0" encoding="UTF-8"?>
          <PTL.R01>
           <HDR>
           <HDR.control_id V="XXXX"/>
           <HDR.version_id V="POCT1"/>
           <HDR.creation_dttm V="2001-11-01T16:32:45-8:00"/>
           </HDR>
           <PT>
           <PT.patient_id V="€path"/>
           </PT>
          </PTL.R01>
          
          "StartDocument"
          "StartElement" "PTL.R01"
          "Characters" "\n "
          "StartElement" "HDR"
          "Characters" "\n "
          "StartElement" "HDR.control_id"
            "V" == "XXXX"
          "EndElement" "HDR.control_id"
          "Characters" "\n "
          "StartElement" "HDR.version_id"
            "V" == "POCT1"
          "EndElement" "HDR.version_id"
          "Characters" "\n "
          "StartElement" "HDR.creation_dttm"
            "V" == "2001-11-01T16:32:45-8:00"
          "EndElement" "HDR.creation_dttm"
          "Characters" "\n "
          "EndElement" "HDR"
          "Characters" "\n "
          "StartElement" "PT"
          "Characters" "\n "
          "StartElement" "PT.patient_id"
            "V" == "€path"
          "EndElement" "PT.patient_id"
          "Characters" "\n "
          "EndElement" "PT"
          "Characters" "\n"
          "EndElement" "PTL.R01"
          "EndDocument"
          

          when you use it in the straightforward fashion:

          #include <QCoreApplication>
          #include <QFile>
          #include <QXmlStreamReader>
          #include <QDebug>
          
          int main(int argc, char *argv[])
          {
              QCoreApplication a(argc, argv);
          
              QFile input("/tmp/test/test.xml");
              if (input.open(QFile::ReadOnly)) {
                  // get the test data
                  QByteArray m_xmlData = input.readAll();
          
                  QXmlStreamReader m_xmlReader(m_xmlData);
          
                  while (!m_xmlReader.atEnd()) {
                      QXmlStreamReader::TokenType tokenType = m_xmlReader.readNext();
                      switch (tokenType) {
                      case QXmlStreamReader::NoToken:
                          break;
                      case QXmlStreamReader::Invalid: {
                          qDebug() << m_xmlReader.tokenString();  // never seen
                          break;
                      }
                      case QXmlStreamReader::Characters:   {
                          qDebug() << m_xmlReader.tokenString() << m_xmlReader.text();
                          break;
                      }
                      case QXmlStreamReader::StartElement: {
                          qDebug() << m_xmlReader.tokenString() << m_xmlReader.qualifiedName();
                          for (QXmlStreamAttribute &attr: m_xmlReader.attributes()) {
                              qDebug() << " " << attr.name() << "==" << attr.value();
                           }
                          break;
                      }
                      case QXmlStreamReader::EndElement:   {
                          qDebug() << m_xmlReader.tokenString() << m_xmlReader.qualifiedName();
                          break;
                      }
                      default: {
                          qDebug() << m_xmlReader.tokenString();
                          break;
                          }
                      }
                  }
              }
              return 0;
          }
          

          You see that Qt has no problem with correctly encoded data including the Euro sign in an attribute.
          The problem is not QXmlStreamReader.

          Mangling the input is the one option. Your output shows that the three-byte UTF-8 Euro has been digested and recognised correctly as an attribute value (the non-UTF console notwithstanding). This suggests the stream is being asked to read past the end of the input, possibly because the end condition of whatever loop is feeding the parser is broken.

          F 2 Replies Last reply
          3
          • C ChrisW67

            Here is how QXmlStreamReader sees your file:

            <?xml version="1.0" encoding="UTF-8"?>
            <PTL.R01>
             <HDR>
             <HDR.control_id V="XXXX"/>
             <HDR.version_id V="POCT1"/>
             <HDR.creation_dttm V="2001-11-01T16:32:45-8:00"/>
             </HDR>
             <PT>
             <PT.patient_id V="€path"/>
             </PT>
            </PTL.R01>
            
            "StartDocument"
            "StartElement" "PTL.R01"
            "Characters" "\n "
            "StartElement" "HDR"
            "Characters" "\n "
            "StartElement" "HDR.control_id"
              "V" == "XXXX"
            "EndElement" "HDR.control_id"
            "Characters" "\n "
            "StartElement" "HDR.version_id"
              "V" == "POCT1"
            "EndElement" "HDR.version_id"
            "Characters" "\n "
            "StartElement" "HDR.creation_dttm"
              "V" == "2001-11-01T16:32:45-8:00"
            "EndElement" "HDR.creation_dttm"
            "Characters" "\n "
            "EndElement" "HDR"
            "Characters" "\n "
            "StartElement" "PT"
            "Characters" "\n "
            "StartElement" "PT.patient_id"
              "V" == "€path"
            "EndElement" "PT.patient_id"
            "Characters" "\n "
            "EndElement" "PT"
            "Characters" "\n"
            "EndElement" "PTL.R01"
            "EndDocument"
            

            when you use it in the straightforward fashion:

            #include <QCoreApplication>
            #include <QFile>
            #include <QXmlStreamReader>
            #include <QDebug>
            
            int main(int argc, char *argv[])
            {
                QCoreApplication a(argc, argv);
            
                QFile input("/tmp/test/test.xml");
                if (input.open(QFile::ReadOnly)) {
                    // get the test data
                    QByteArray m_xmlData = input.readAll();
            
                    QXmlStreamReader m_xmlReader(m_xmlData);
            
                    while (!m_xmlReader.atEnd()) {
                        QXmlStreamReader::TokenType tokenType = m_xmlReader.readNext();
                        switch (tokenType) {
                        case QXmlStreamReader::NoToken:
                            break;
                        case QXmlStreamReader::Invalid: {
                            qDebug() << m_xmlReader.tokenString();  // never seen
                            break;
                        }
                        case QXmlStreamReader::Characters:   {
                            qDebug() << m_xmlReader.tokenString() << m_xmlReader.text();
                            break;
                        }
                        case QXmlStreamReader::StartElement: {
                            qDebug() << m_xmlReader.tokenString() << m_xmlReader.qualifiedName();
                            for (QXmlStreamAttribute &attr: m_xmlReader.attributes()) {
                                qDebug() << " " << attr.name() << "==" << attr.value();
                             }
                            break;
                        }
                        case QXmlStreamReader::EndElement:   {
                            qDebug() << m_xmlReader.tokenString() << m_xmlReader.qualifiedName();
                            break;
                        }
                        default: {
                            qDebug() << m_xmlReader.tokenString();
                            break;
                            }
                        }
                    }
                }
                return 0;
            }
            

            You see that Qt has no problem with correctly encoded data including the Euro sign in an attribute.
            The problem is not QXmlStreamReader.

            Mangling the input is the one option. Your output shows that the three-byte UTF-8 Euro has been digested and recognised correctly as an attribute value (the non-UTF console notwithstanding). This suggests the stream is being asked to read past the end of the input, possibly because the end condition of whatever loop is feeding the parser is broken.

            F Offline
            F Offline
            Frenk21
            wrote on last edited by
            #10

            @ChrisW67 Thank you very much. I assumed this could be the problem, will check the flow again. And try to fix stream handling, am still not sure if the problem is std::string or QString or QByteArray. Would be fine to know what to expect in each of them when sign "EURO" is passed as multibyte utf8, but did not find anything in the QT documentation or maybe I missed something.

            Christian EhrlicherC 1 Reply Last reply
            0
            • C ChrisW67

              Here is how QXmlStreamReader sees your file:

              <?xml version="1.0" encoding="UTF-8"?>
              <PTL.R01>
               <HDR>
               <HDR.control_id V="XXXX"/>
               <HDR.version_id V="POCT1"/>
               <HDR.creation_dttm V="2001-11-01T16:32:45-8:00"/>
               </HDR>
               <PT>
               <PT.patient_id V="€path"/>
               </PT>
              </PTL.R01>
              
              "StartDocument"
              "StartElement" "PTL.R01"
              "Characters" "\n "
              "StartElement" "HDR"
              "Characters" "\n "
              "StartElement" "HDR.control_id"
                "V" == "XXXX"
              "EndElement" "HDR.control_id"
              "Characters" "\n "
              "StartElement" "HDR.version_id"
                "V" == "POCT1"
              "EndElement" "HDR.version_id"
              "Characters" "\n "
              "StartElement" "HDR.creation_dttm"
                "V" == "2001-11-01T16:32:45-8:00"
              "EndElement" "HDR.creation_dttm"
              "Characters" "\n "
              "EndElement" "HDR"
              "Characters" "\n "
              "StartElement" "PT"
              "Characters" "\n "
              "StartElement" "PT.patient_id"
                "V" == "€path"
              "EndElement" "PT.patient_id"
              "Characters" "\n "
              "EndElement" "PT"
              "Characters" "\n"
              "EndElement" "PTL.R01"
              "EndDocument"
              

              when you use it in the straightforward fashion:

              #include <QCoreApplication>
              #include <QFile>
              #include <QXmlStreamReader>
              #include <QDebug>
              
              int main(int argc, char *argv[])
              {
                  QCoreApplication a(argc, argv);
              
                  QFile input("/tmp/test/test.xml");
                  if (input.open(QFile::ReadOnly)) {
                      // get the test data
                      QByteArray m_xmlData = input.readAll();
              
                      QXmlStreamReader m_xmlReader(m_xmlData);
              
                      while (!m_xmlReader.atEnd()) {
                          QXmlStreamReader::TokenType tokenType = m_xmlReader.readNext();
                          switch (tokenType) {
                          case QXmlStreamReader::NoToken:
                              break;
                          case QXmlStreamReader::Invalid: {
                              qDebug() << m_xmlReader.tokenString();  // never seen
                              break;
                          }
                          case QXmlStreamReader::Characters:   {
                              qDebug() << m_xmlReader.tokenString() << m_xmlReader.text();
                              break;
                          }
                          case QXmlStreamReader::StartElement: {
                              qDebug() << m_xmlReader.tokenString() << m_xmlReader.qualifiedName();
                              for (QXmlStreamAttribute &attr: m_xmlReader.attributes()) {
                                  qDebug() << " " << attr.name() << "==" << attr.value();
                               }
                              break;
                          }
                          case QXmlStreamReader::EndElement:   {
                              qDebug() << m_xmlReader.tokenString() << m_xmlReader.qualifiedName();
                              break;
                          }
                          default: {
                              qDebug() << m_xmlReader.tokenString();
                              break;
                              }
                          }
                      }
                  }
                  return 0;
              }
              

              You see that Qt has no problem with correctly encoded data including the Euro sign in an attribute.
              The problem is not QXmlStreamReader.

              Mangling the input is the one option. Your output shows that the three-byte UTF-8 Euro has been digested and recognised correctly as an attribute value (the non-UTF console notwithstanding). This suggests the stream is being asked to read past the end of the input, possibly because the end condition of whatever loop is feeding the parser is broken.

              F Offline
              F Offline
              Frenk21
              wrote on last edited by
              #11

              @ChrisW67 I tried your example and using the file as test input and I get invalid. I maybe forgot to mention that I am using Qt 5.15.2. Is there any issue with the QtXmlStreamReader with this version? Which version did you use for testing?

              C 1 Reply Last reply
              0
              • F Frenk21

                @ChrisW67 Thank you very much. I assumed this could be the problem, will check the flow again. And try to fix stream handling, am still not sure if the problem is std::string or QString or QByteArray. Would be fine to know what to expect in each of them when sign "EURO" is passed as multibyte utf8, but did not find anything in the QT documentation or maybe I missed something.

                Christian EhrlicherC Offline
                Christian EhrlicherC Offline
                Christian Ehrlicher
                Lifetime Qt Champion
                wrote on last edited by
                #12

                @Frenk21 said in QtXmlStreamReader and UTF-8 multibytes characters:

                Would be fine to know what to expect in each of them when sign "EURO" is passed as multibyte utf8, but did not find anything in the QT documentation

                std::string - no Qt, no encoding jsut a bunch of bytes, depends on you how to interpret them
                QByteArray - just a bunch of bytes, depends on you how to interpret them
                QString - an utf-16 encoded string

                Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                Visit the Qt Academy at https://academy.qt.io/catalog

                1 Reply Last reply
                1
                • F Frenk21

                  @ChrisW67 I tried your example and using the file as test input and I get invalid. I maybe forgot to mention that I am using Qt 5.15.2. Is there any issue with the QtXmlStreamReader with this version? Which version did you use for testing?

                  C Offline
                  C Offline
                  ChrisW67
                  wrote on last edited by
                  #13

                  @Frenk21 Qt 5.15.2 or 6.3.1 on Linux with GCC 9.4.0. I just copy-n-pasted from my earlier post to create the test.xml.

                  I think you need to take the input your are given and, without imposing any character conversions at all, write it to a file in binary mode. Then inspect what is actually in it.

                  For my testing I added an xml processing instruction, but it works also without.

                  chrisw@newton:/tmp/test$ od -ta -tx1 test.xml 
                  0000000   <   ?   x   m   l  sp   v   e   r   s   i   o   n   =   "   1
                           3c  3f  78  6d  6c  20  76  65  72  73  69  6f  6e  3d  22  31
                  0000020   .   0   "  sp   e   n   c   o   d   i   n   g   =   "   U   T
                           2e  30  22  20  65  6e  63  6f  64  69  6e  67  3d  22  55  54
                  0000040   F   -   8   "   ?   >  nl   <   P   T   L   .   R   0   1   >
                           46  2d  38  22  3f  3e  0a  3c  50  54  4c  2e  52  30  31  3e
                  0000060  nl  sp   <   H   D   R   >  nl  sp   <   H   D   R   .   c   o
                           0a  20  3c  48  44  52  3e  0a  20  3c  48  44  52  2e  63  6f
                  0000100   n   t   r   o   l   _   i   d  sp   V   =   "   X   X   X   X
                           6e  74  72  6f  6c  5f  69  64  20  56  3d  22  58  58  58  58
                  0000120   "   /   >  nl  sp   <   H   D   R   .   v   e   r   s   i   o
                           22  2f  3e  0a  20  3c  48  44  52  2e  76  65  72  73  69  6f
                  0000140   n   _   i   d  sp   V   =   "   P   O   C   T   1   "   /   >
                           6e  5f  69  64  20  56  3d  22  50  4f  43  54  31  22  2f  3e
                  0000160  nl  sp   <   H   D   R   .   c   r   e   a   t   i   o   n   _
                           0a  20  3c  48  44  52  2e  63  72  65  61  74  69  6f  6e  5f
                  0000200   d   t   t   m  sp   V   =   "   2   0   0   1   -   1   1   -
                           64  74  74  6d  20  56  3d  22  32  30  30  31  2d  31  31  2d
                  0000220   0   1   T   1   6   :   3   2   :   4   5   -   8   :   0   0
                           30  31  54  31  36  3a  33  32  3a  34  35  2d  38  3a  30  30
                  0000240   "   /   >  nl  sp   <   /   H   D   R   >  nl  sp   <   P   T
                           22  2f  3e  0a  20  3c  2f  48  44  52  3e  0a  20  3c  50  54
                  0000260   >  nl  sp   <   P   T   .   p   a   t   i   e   n   t   _   i
                           3e  0a  20  3c  50  54  2e  70  61  74  69  65  6e  74  5f  69
                  0000300   d  sp   V   =   "   b stx   ,   p   a   t   h   "   /   >  nl
                           64  20  56  3d  22  e2  82  ac  70  61  74  68  22  2f  3e  0a
                  0000320  sp   <   /   P   T   >  nl   <   /   P   T   L   .   R   0   1
                           20  3c  2f  50  54  3e  0a  3c  2f  50  54  4c  2e  52  30  31
                  0000340   >  nl
                           3e  0a
                  0000342
                  

                  You should be looking for random rubbish before and after the xml you expected.

                  F 2 Replies Last reply
                  0
                  • C ChrisW67

                    @Frenk21 Qt 5.15.2 or 6.3.1 on Linux with GCC 9.4.0. I just copy-n-pasted from my earlier post to create the test.xml.

                    I think you need to take the input your are given and, without imposing any character conversions at all, write it to a file in binary mode. Then inspect what is actually in it.

                    For my testing I added an xml processing instruction, but it works also without.

                    chrisw@newton:/tmp/test$ od -ta -tx1 test.xml 
                    0000000   <   ?   x   m   l  sp   v   e   r   s   i   o   n   =   "   1
                             3c  3f  78  6d  6c  20  76  65  72  73  69  6f  6e  3d  22  31
                    0000020   .   0   "  sp   e   n   c   o   d   i   n   g   =   "   U   T
                             2e  30  22  20  65  6e  63  6f  64  69  6e  67  3d  22  55  54
                    0000040   F   -   8   "   ?   >  nl   <   P   T   L   .   R   0   1   >
                             46  2d  38  22  3f  3e  0a  3c  50  54  4c  2e  52  30  31  3e
                    0000060  nl  sp   <   H   D   R   >  nl  sp   <   H   D   R   .   c   o
                             0a  20  3c  48  44  52  3e  0a  20  3c  48  44  52  2e  63  6f
                    0000100   n   t   r   o   l   _   i   d  sp   V   =   "   X   X   X   X
                             6e  74  72  6f  6c  5f  69  64  20  56  3d  22  58  58  58  58
                    0000120   "   /   >  nl  sp   <   H   D   R   .   v   e   r   s   i   o
                             22  2f  3e  0a  20  3c  48  44  52  2e  76  65  72  73  69  6f
                    0000140   n   _   i   d  sp   V   =   "   P   O   C   T   1   "   /   >
                             6e  5f  69  64  20  56  3d  22  50  4f  43  54  31  22  2f  3e
                    0000160  nl  sp   <   H   D   R   .   c   r   e   a   t   i   o   n   _
                             0a  20  3c  48  44  52  2e  63  72  65  61  74  69  6f  6e  5f
                    0000200   d   t   t   m  sp   V   =   "   2   0   0   1   -   1   1   -
                             64  74  74  6d  20  56  3d  22  32  30  30  31  2d  31  31  2d
                    0000220   0   1   T   1   6   :   3   2   :   4   5   -   8   :   0   0
                             30  31  54  31  36  3a  33  32  3a  34  35  2d  38  3a  30  30
                    0000240   "   /   >  nl  sp   <   /   H   D   R   >  nl  sp   <   P   T
                             22  2f  3e  0a  20  3c  2f  48  44  52  3e  0a  20  3c  50  54
                    0000260   >  nl  sp   <   P   T   .   p   a   t   i   e   n   t   _   i
                             3e  0a  20  3c  50  54  2e  70  61  74  69  65  6e  74  5f  69
                    0000300   d  sp   V   =   "   b stx   ,   p   a   t   h   "   /   >  nl
                             64  20  56  3d  22  e2  82  ac  70  61  74  68  22  2f  3e  0a
                    0000320  sp   <   /   P   T   >  nl   <   /   P   T   L   .   R   0   1
                             20  3c  2f  50  54  3e  0a  3c  2f  50  54  4c  2e  52  30  31
                    0000340   >  nl
                             3e  0a
                    0000342
                    

                    You should be looking for random rubbish before and after the xml you expected.

                    F Offline
                    F Offline
                    Frenk21
                    wrote on last edited by
                    #14

                    @ChrisW67 I see, thanks for all your support and time. Will check it. I did just copy the stuff places here and try it.

                    I see now, that at start I should be more specific and also write in which environment I am working. As I mention I am using qt 5.15.2 and working on Windows 10, where I tried to build with mingw 8.1.0 64 bit and MSVC2020 (c++ compile 17.2.32526.322) 64 bit. On both I get invalid. Sorry that I did not specify this information before.

                    1 Reply Last reply
                    0
                    • C ChrisW67

                      @Frenk21 Qt 5.15.2 or 6.3.1 on Linux with GCC 9.4.0. I just copy-n-pasted from my earlier post to create the test.xml.

                      I think you need to take the input your are given and, without imposing any character conversions at all, write it to a file in binary mode. Then inspect what is actually in it.

                      For my testing I added an xml processing instruction, but it works also without.

                      chrisw@newton:/tmp/test$ od -ta -tx1 test.xml 
                      0000000   <   ?   x   m   l  sp   v   e   r   s   i   o   n   =   "   1
                               3c  3f  78  6d  6c  20  76  65  72  73  69  6f  6e  3d  22  31
                      0000020   .   0   "  sp   e   n   c   o   d   i   n   g   =   "   U   T
                               2e  30  22  20  65  6e  63  6f  64  69  6e  67  3d  22  55  54
                      0000040   F   -   8   "   ?   >  nl   <   P   T   L   .   R   0   1   >
                               46  2d  38  22  3f  3e  0a  3c  50  54  4c  2e  52  30  31  3e
                      0000060  nl  sp   <   H   D   R   >  nl  sp   <   H   D   R   .   c   o
                               0a  20  3c  48  44  52  3e  0a  20  3c  48  44  52  2e  63  6f
                      0000100   n   t   r   o   l   _   i   d  sp   V   =   "   X   X   X   X
                               6e  74  72  6f  6c  5f  69  64  20  56  3d  22  58  58  58  58
                      0000120   "   /   >  nl  sp   <   H   D   R   .   v   e   r   s   i   o
                               22  2f  3e  0a  20  3c  48  44  52  2e  76  65  72  73  69  6f
                      0000140   n   _   i   d  sp   V   =   "   P   O   C   T   1   "   /   >
                               6e  5f  69  64  20  56  3d  22  50  4f  43  54  31  22  2f  3e
                      0000160  nl  sp   <   H   D   R   .   c   r   e   a   t   i   o   n   _
                               0a  20  3c  48  44  52  2e  63  72  65  61  74  69  6f  6e  5f
                      0000200   d   t   t   m  sp   V   =   "   2   0   0   1   -   1   1   -
                               64  74  74  6d  20  56  3d  22  32  30  30  31  2d  31  31  2d
                      0000220   0   1   T   1   6   :   3   2   :   4   5   -   8   :   0   0
                               30  31  54  31  36  3a  33  32  3a  34  35  2d  38  3a  30  30
                      0000240   "   /   >  nl  sp   <   /   H   D   R   >  nl  sp   <   P   T
                               22  2f  3e  0a  20  3c  2f  48  44  52  3e  0a  20  3c  50  54
                      0000260   >  nl  sp   <   P   T   .   p   a   t   i   e   n   t   _   i
                               3e  0a  20  3c  50  54  2e  70  61  74  69  65  6e  74  5f  69
                      0000300   d  sp   V   =   "   b stx   ,   p   a   t   h   "   /   >  nl
                               64  20  56  3d  22  e2  82  ac  70  61  74  68  22  2f  3e  0a
                      0000320  sp   <   /   P   T   >  nl   <   /   P   T   L   .   R   0   1
                               20  3c  2f  50  54  3e  0a  3c  2f  50  54  4c  2e  52  30  31
                      0000340   >  nl
                               3e  0a
                      0000342
                      

                      You should be looking for random rubbish before and after the xml you expected.

                      F Offline
                      F Offline
                      Frenk21
                      wrote on last edited by Frenk21
                      #15

                      @ChrisW67 Hello Chris,

                      You example with test.xml is working on my place. I did a little more testing and I see where is the problem now. The problem is not in the string conversion, but as I am using TCP/IP stream for reading of the XML messages, the error will appear if we parse next XML message. For example 2 messages as shown bellow:

                      <?xml version="1.0" encoding="UTF-8"?>
                      <PTL.R01>
                       <HDR>
                       <HDR.control_id V="XXXX"/>
                       <HDR.version_id V="POCT1"/>
                       <HDR.creation_dttm V="2001-11-01T16:32:45-8:00"/>
                       </HDR>
                       <PT>
                       <PT.patient_id V="€path"/>
                       </PT>
                      </PTL.R01>
                      <?xml version="1.0" encoding="UTF-8"?>
                      <PTL.R01>
                       <HDR>
                       <HDR.control_id V="XXXX"/>
                       <HDR.version_id V="POCT1"/>
                       <HDR.creation_dttm V="2001-11-01T16:32:45-8:00"/>
                       </HDR>
                       <PT>
                       <PT.patient_id V="€path"/>
                       </PT>
                      </PTL.R01>
                      

                      The reason is, that if I dont have a multi byte character, then XML reader properly parses first message and then the second one and comes to an end and no leftovers remain in the buffer after parsing the first message. As leftover I mean that characterOffset() return proper location of the last parsed character.

                      But if I add multi byte character and parse the first message then I see this "1>" which remains in the buffer (characterOffset() return index to this location) and when I try to parse the second message I get error : Premature end of the document/Start tag expected. So I need to somehow improve buffer/stream data handling and adding/copying.

                      If I after parsing the first message, manually increase the index/set start buffer to +2 then parsing of the second message also works fine (by +2 I just corrected the start of the input data buffer of the QtXmlStreamReader)

                      So it seems to me that the QtXmlStreamReader does not properly return the offset when I am using characterOffset() (EUR is of length 3 bytes and index is increased by 1 byte only. If I add more multibyte chars, then the difference is even higher - it accumulates over the whole XML message for all the detected multibyte chars)

                      What I need is a proper index of the last end of the XML message so that I can advance the buffer for the QtXmlStreamReader and he can properly process next message.

                      F 1 Reply Last reply
                      0
                      • F Frenk21

                        @ChrisW67 Hello Chris,

                        You example with test.xml is working on my place. I did a little more testing and I see where is the problem now. The problem is not in the string conversion, but as I am using TCP/IP stream for reading of the XML messages, the error will appear if we parse next XML message. For example 2 messages as shown bellow:

                        <?xml version="1.0" encoding="UTF-8"?>
                        <PTL.R01>
                         <HDR>
                         <HDR.control_id V="XXXX"/>
                         <HDR.version_id V="POCT1"/>
                         <HDR.creation_dttm V="2001-11-01T16:32:45-8:00"/>
                         </HDR>
                         <PT>
                         <PT.patient_id V="€path"/>
                         </PT>
                        </PTL.R01>
                        <?xml version="1.0" encoding="UTF-8"?>
                        <PTL.R01>
                         <HDR>
                         <HDR.control_id V="XXXX"/>
                         <HDR.version_id V="POCT1"/>
                         <HDR.creation_dttm V="2001-11-01T16:32:45-8:00"/>
                         </HDR>
                         <PT>
                         <PT.patient_id V="€path"/>
                         </PT>
                        </PTL.R01>
                        

                        The reason is, that if I dont have a multi byte character, then XML reader properly parses first message and then the second one and comes to an end and no leftovers remain in the buffer after parsing the first message. As leftover I mean that characterOffset() return proper location of the last parsed character.

                        But if I add multi byte character and parse the first message then I see this "1>" which remains in the buffer (characterOffset() return index to this location) and when I try to parse the second message I get error : Premature end of the document/Start tag expected. So I need to somehow improve buffer/stream data handling and adding/copying.

                        If I after parsing the first message, manually increase the index/set start buffer to +2 then parsing of the second message also works fine (by +2 I just corrected the start of the input data buffer of the QtXmlStreamReader)

                        So it seems to me that the QtXmlStreamReader does not properly return the offset when I am using characterOffset() (EUR is of length 3 bytes and index is increased by 1 byte only. If I add more multibyte chars, then the difference is even higher - it accumulates over the whole XML message for all the detected multibyte chars)

                        What I need is a proper index of the last end of the XML message so that I can advance the buffer for the QtXmlStreamReader and he can properly process next message.

                        F Offline
                        F Offline
                        Frenk21
                        wrote on last edited by Frenk21
                        #16

                        @Frenk21 I will just answer to myself:

                        Here is working solution for your provided example

                                    ...
                                    case QXmlStreamReader::StartElement: {
                                        uint adjustDataBufferOffset = 0;
                                        
                                                       qDebug() << m_xmlReader.tokenString() << m_xmlReader.qualifiedName();
                                        for (QXmlStreamAttribute &attr: m_xmlReader.attributes()) {
                                            qDebug() << " " << attr.name() << "==" << attr.value();
                        
                                            // Get length of the values in bytes
                                            uint valueLenBytes = attr.value().size();
                                            uint valueCharCnt = 0;
                                            
                                            // Count UTF8 characters (one character size can vary from 1..4 bytes)
                                            for (auto oneChar: attr.value())
                                            {
                                                if (((oneChar & 0x80) == 0) || ((oneChar & 0xc0) == 0xc0))
                                                {
                                                    ++valueCharCnt;
                                                }
                                            }
                        
                                            if (valueLenBytes > valueCharCnt)
                                            {
                                                adjustDataBufferOffset += valueLenBytes - valueCharCnt;
                                            }
                        
                                         }
                        
                                        if (adjustDataBufferOffset > 0)
                                        {
                                            QByteArray &dataBuffer = m_xmlData.buffer();
                        
                                            auto index = m_xmlReader.characterOffset();
                        
                                            index += adjustDataBufferOffset;
                        
                                            // FIX as xml reading does not advance properly to
                                            // the end of the element if multibyte chars are present 
                                            // in the XML message
                                            dataBuffer = dataBuffer.mid(index);
                                        }
                        
                                        break;
                                    }
                                    ....
                        
                        1 Reply Last reply
                        0

                        • Login

                        • Login or register to search.
                        • First post
                          Last post
                        0
                        • Categories
                        • Recent
                        • Tags
                        • Popular
                        • Users
                        • Groups
                        • Search
                        • Get Qt Extensions
                        • Unsolved