QtXmlStreamReader and UTF-8 multibytes characters
-
Hello to all,
I have problem parsing the XML string (inputXmlString):
```
"<PTL.R01>"
" <HDR>"
" <HDR.control_id V="XXXX"/>"
" <HDR.version_id V="POCT1"/>"
" <HDR.creation_dttm V="2001-11-01T16:32:45-8:00"/>"
" </HDR>"
" <PT>"
" <PT.patient_id V="€path"/>"
" </PT>"
"</PTL.R01>");I put the string into the QByteArray and use it as the input for the QtXmlStreamReader the parsing fails (traces):
BEFORE </HDR> <PT> <PT.patient_id V="Γé¼path"/> </PT></PTL.R01>
AFTER </HDR> <PT> <PT.patient_id V="Γé¼path"/> </PT></PTL.R01>
TOKEN TYPE 5
BEFORE </HDR> <PT> <PT.patient_id V="Γé¼path"/> </PT></PTL.R01>
AFTER /HDR> <PT> <PT.patient_id V="Γé¼path"/> </PT></PTL.R01>
TOKEN TYPE 6
BEFORE /HDR> <PT> <PT.patient_id V="Γé¼path"/> </PT></PTL.R01>
AFTER <PT> <PT.patient_id V="Γé¼path"/> </PT></PTL.R01>
TOKEN TYPE 5
BEFORE <PT> <PT.patient_id V="Γé¼path"/> </PT></PTL.R01>
AFTER PT> <PT.patient_id V="Γé¼path"/> </PT></PTL.R01>
TOKEN TYPE 6
BEFORE PT> <PT.patient_id V="Γé¼path"/> </PT></PTL.R01>
AFTER <PT.patient_id V="Γé¼path"/> </PT></PTL.R01>
TOKEN TYPE 4
BEFORE <PT.patient_id V="Γé¼path"/> </PT></PTL.R01>
AFTER PT.patient_id V="Γé¼path"/> </PT></PTL.R01>
TOKEN TYPE 6
BEFORE PT.patient_id V="Γé¼path"/> </PT></PTL.R01>
AFTER /> </PT></PTL.R01>
TOKEN TYPE 4
ATTRIBUTE NAME V
ATTRIBUTE VALUE Γé¼path
BEFORE /> </PT></PTL.R01>
AFTER /> </PT></PTL.R01>
TOKEN TYPE 5
BEFORE /> </PT></PTL.R01>
AFTER </PT></PTL.R01>
TOKEN TYPE 6
BEFORE </PT></PTL.R01>
AFTER T></PTL.R01>
TOKEN TYPE 5
BEFORE T></PTL.R01>
AFTER 1>
TOKEN TYPE 5
BEFORE 1>
AFTER 1>
TOKEN TYPE 1
INVALID 4: Premature end of document.I have implemented the parsing as usual: ``` QBuffer m_xmlData; ... INIT: if (m_xmlData.open(QIODevice::ReadWrite)) { m_xmlReader.setDevice(&m_xmlData); } ... ADD DATA (upper string): QByteArray inputXmlData; inputXmlData.append(QString::fromUtf8(inputXmlString.c_str()).toStdString().c_str()); m_xmlData.buffer().append(inputXmlData); ... PARSING: std::cout << "BEFORE " << m_xmlData.buffer().constData() + m_xmlReader.characterOffset() << std::endl; QXmlStreamReader::TokenType tokenType = m_xmlReader.readNext(); std::cout << "AFTER " << m_xmlData.buffer().constData() + m_xmlReader.characterOffset() << std::endl; std::cout << "TOKEN TYPE " << tokenType << std::endl; switch (tokenType) { case QXmlStreamReader::NoToken: { // The reader has not yet read anything. break; } case QXmlStreamReader::Invalid: { // An error occurred, handle it result = handleError(); end = true; break; } case QXmlStreamReader::StartElement: { ... break; } case QXmlStreamReader::EndElement: { ... break; } case QXmlStreamReader::Characters: { const std::string text = m_xmlReader.text().toString().toStdString(); ... break; } }
If I remove the sign "euro" before the "path" then parsing is working correctly. What am I missing or doing wrong on my side?
Thanks for any clues or advices, Frenk
-
Here is how QXmlStreamReader sees your file:
<?xml version="1.0" encoding="UTF-8"?> <PTL.R01> <HDR> <HDR.control_id V="XXXX"/> <HDR.version_id V="POCT1"/> <HDR.creation_dttm V="2001-11-01T16:32:45-8:00"/> </HDR> <PT> <PT.patient_id V="€path"/> </PT> </PTL.R01>
"StartDocument" "StartElement" "PTL.R01" "Characters" "\n " "StartElement" "HDR" "Characters" "\n " "StartElement" "HDR.control_id" "V" == "XXXX" "EndElement" "HDR.control_id" "Characters" "\n " "StartElement" "HDR.version_id" "V" == "POCT1" "EndElement" "HDR.version_id" "Characters" "\n " "StartElement" "HDR.creation_dttm" "V" == "2001-11-01T16:32:45-8:00" "EndElement" "HDR.creation_dttm" "Characters" "\n " "EndElement" "HDR" "Characters" "\n " "StartElement" "PT" "Characters" "\n " "StartElement" "PT.patient_id" "V" == "€path" "EndElement" "PT.patient_id" "Characters" "\n " "EndElement" "PT" "Characters" "\n" "EndElement" "PTL.R01" "EndDocument"
when you use it in the straightforward fashion:
#include <QCoreApplication> #include <QFile> #include <QXmlStreamReader> #include <QDebug> int main(int argc, char *argv[]) { QCoreApplication a(argc, argv); QFile input("/tmp/test/test.xml"); if (input.open(QFile::ReadOnly)) { // get the test data QByteArray m_xmlData = input.readAll(); QXmlStreamReader m_xmlReader(m_xmlData); while (!m_xmlReader.atEnd()) { QXmlStreamReader::TokenType tokenType = m_xmlReader.readNext(); switch (tokenType) { case QXmlStreamReader::NoToken: break; case QXmlStreamReader::Invalid: { qDebug() << m_xmlReader.tokenString(); // never seen break; } case QXmlStreamReader::Characters: { qDebug() << m_xmlReader.tokenString() << m_xmlReader.text(); break; } case QXmlStreamReader::StartElement: { qDebug() << m_xmlReader.tokenString() << m_xmlReader.qualifiedName(); for (QXmlStreamAttribute &attr: m_xmlReader.attributes()) { qDebug() << " " << attr.name() << "==" << attr.value(); } break; } case QXmlStreamReader::EndElement: { qDebug() << m_xmlReader.tokenString() << m_xmlReader.qualifiedName(); break; } default: { qDebug() << m_xmlReader.tokenString(); break; } } } } return 0; }
You see that Qt has no problem with correctly encoded data including the Euro sign in an attribute.
The problem is not QXmlStreamReader.Mangling the input is the one option. Your output shows that the three-byte UTF-8 Euro has been digested and recognised correctly as an attribute value (the non-UTF console notwithstanding). This suggests the stream is being asked to read past the end of the input, possibly because the end condition of whatever loop is feeding the parser is broken.
-
@Frenk21 Short update.
According to the example in the wiki: https://en.wikipedia.org/wiki/UTF-8 the character for EURO is 3 byte size:
€ U+20AC 0010 0000 1010 1100 11100010 10000010 10101100 E2 82 AC
And somehow the QtXMLStreamReader cannot properly handle this. 1 byte characters are handled ok, but more of this not.
As I can see from traces, it seems that when QtXmlStream reader reads the value with such utf8 character it does not properly advance the counter:
With "path":
BEFORE PT.patient_id V="path"/> </PT></PTL.R01>
AFTER </PT></PTL.R01>
TOKEN TYPE 4
ATTRIBUTE NAME V
ATTRIBUTE VALUE path
BEFORE </PT></PTL.R01>
AFTER </PT></PTL.R01>
TOKEN TYPE 5With "€path":
BEFORE PT.patient_id V="Γé¼path"/> </PT></PTL.R01>
AFTER /> </PT></PTL.R01>
TOKEN TYPE 4
ATTRIBUTE NAME V
ATTRIBUTE VALUE Γé¼path <-- USING std::cout and this is probably the reason it is shown differently
BEFORE /> </PT></PTL.R01>
AFTER /> </PT></PTL.R01>
TOKEN TYPE 5 -
Hi,
@Frenk21 said in QtXmlStreamReader and UTF-8 multibytes characters:
QByteArray inputXmlData; inputXmlData.append(QString::fromUtf8(inputXmlString.c_str()).toStdString().c_str()); m_xmlData.buffer().append(inputXmlData); ...
Just: why ?
There's a lot of useless conversions being done here.
QXmlStreamReader can take a QByteArray, a QString or even a QIODevice, which QFile is, so why are you doing all these conversions ?
-
@SGaist said in QtXmlStreamReader and UTF-8 multibytes characters:
inputXmlString
And to add @SGaist answer - what type is this (I assume std::string) and are you sure it's really utf-8 encoded?
-
@SGaist Well at start I did not have any conversion, but as qtXmlStreamReader is always failing I thought that maybe the input std::string does not have a valid utf8 string.
The really question is though why is it failing if I use utf8 char with length > 1 byte?
-
Can you provide a minimal compilable example that shows this behaviour ?
It's also unclear why you would use std::string in this case.
-
@Christian-Ehrlicher Well I just put this xml as you see into the std::string, then copy/add it to the qbyteArray. I checked the XML and it is utf8 encoded. std::string is just an array of characters, so sign "EUR" should be stored as 3 chars or am I wrong? And if qbytearray holds sign "EUR" as 3 chars do I need to do anything special so QtXmlStreamReader can properly handle it?
-
@Frenk21 The implementation was done by someone else. He did an abstract c++ part so that later you could integrate different real XML parsers and he used Qt XmlStreamParser as an example. Until now everything was working fine as noone tested it with multiple chars.
Here is the code snipped of the parsing part:
NOTE: class XmlElement holds the xml element info (name, node, childes) void XmlReader::parse(const std::string &xmlData) { std::list<XmlElement> openElements; QMutexLocker locker(&m_mutex); // Add data to the XML reader buffer (void)m_xmlData.buffer().append(xmlData.c_str()); // Read XML document from the XML data buffer bool end = false; while ((!m_xmlReader.atEnd()) && (!end)) { QXmlStreamReader::TokenType tokenType = m_xmlReader.readNext(); switch (tokenType) { case QXmlStreamReader::NoToken: { // The reader has not yet read anything. break; } case QXmlStreamReader::Invalid: { // An error occurred, handle it end = true; break; } case QXmlStreamReader::StartElement: { // Read start of the element (element name and attributes) const std::string elementName = m_xmlReader.name().toString().toStdString(); QXmlStreamAttributes qtAttributeList = m_xmlReader.attributes(); for (int32_t i = 0; i < qtAttributeList.size(); ++i) { const QXmlStreamAttribute &attribute = qtAttributeList.at(i); const std::string attributeName = attribute.name().toString().toStdString(); const std::string attributeValue = attribute.value().toString().toStdString(); } // Open a new element openElements.push_back(XmlElement(elementName)); break; } case QXmlStreamReader::EndElement: { // Check if the element is even open if (openElements.size() > 0U) { // Close the element XmlElement currentElement = openElements.back(); openElements.pop_back(); // Check if this is a child element or the root element if (openElements.size() > 0U) { // This is a child element, add it to its parent XmlElement &parentElement = openElements.back(); parentElement.addChildElement(currentElement); } else { // This is the root element, save it and finish reading the XML document end = true; } } else { end = true; } break; } case QXmlStreamReader::Characters: { // It reports characters also when whitespaces remain between XML tags and the // caller must decide if this is a valid const std::string text = m_xmlReader.text().toString().toStdString(); // Append the text node to the currently open element's text node XmlElement ¤tElement = openElements.back(); currentElement.appendTextNode(text); break; } default: { // Not supported types are skipped break; } } } }
-
Here is how QXmlStreamReader sees your file:
<?xml version="1.0" encoding="UTF-8"?> <PTL.R01> <HDR> <HDR.control_id V="XXXX"/> <HDR.version_id V="POCT1"/> <HDR.creation_dttm V="2001-11-01T16:32:45-8:00"/> </HDR> <PT> <PT.patient_id V="€path"/> </PT> </PTL.R01>
"StartDocument" "StartElement" "PTL.R01" "Characters" "\n " "StartElement" "HDR" "Characters" "\n " "StartElement" "HDR.control_id" "V" == "XXXX" "EndElement" "HDR.control_id" "Characters" "\n " "StartElement" "HDR.version_id" "V" == "POCT1" "EndElement" "HDR.version_id" "Characters" "\n " "StartElement" "HDR.creation_dttm" "V" == "2001-11-01T16:32:45-8:00" "EndElement" "HDR.creation_dttm" "Characters" "\n " "EndElement" "HDR" "Characters" "\n " "StartElement" "PT" "Characters" "\n " "StartElement" "PT.patient_id" "V" == "€path" "EndElement" "PT.patient_id" "Characters" "\n " "EndElement" "PT" "Characters" "\n" "EndElement" "PTL.R01" "EndDocument"
when you use it in the straightforward fashion:
#include <QCoreApplication> #include <QFile> #include <QXmlStreamReader> #include <QDebug> int main(int argc, char *argv[]) { QCoreApplication a(argc, argv); QFile input("/tmp/test/test.xml"); if (input.open(QFile::ReadOnly)) { // get the test data QByteArray m_xmlData = input.readAll(); QXmlStreamReader m_xmlReader(m_xmlData); while (!m_xmlReader.atEnd()) { QXmlStreamReader::TokenType tokenType = m_xmlReader.readNext(); switch (tokenType) { case QXmlStreamReader::NoToken: break; case QXmlStreamReader::Invalid: { qDebug() << m_xmlReader.tokenString(); // never seen break; } case QXmlStreamReader::Characters: { qDebug() << m_xmlReader.tokenString() << m_xmlReader.text(); break; } case QXmlStreamReader::StartElement: { qDebug() << m_xmlReader.tokenString() << m_xmlReader.qualifiedName(); for (QXmlStreamAttribute &attr: m_xmlReader.attributes()) { qDebug() << " " << attr.name() << "==" << attr.value(); } break; } case QXmlStreamReader::EndElement: { qDebug() << m_xmlReader.tokenString() << m_xmlReader.qualifiedName(); break; } default: { qDebug() << m_xmlReader.tokenString(); break; } } } } return 0; }
You see that Qt has no problem with correctly encoded data including the Euro sign in an attribute.
The problem is not QXmlStreamReader.Mangling the input is the one option. Your output shows that the three-byte UTF-8 Euro has been digested and recognised correctly as an attribute value (the non-UTF console notwithstanding). This suggests the stream is being asked to read past the end of the input, possibly because the end condition of whatever loop is feeding the parser is broken.
-
@ChrisW67 Thank you very much. I assumed this could be the problem, will check the flow again. And try to fix stream handling, am still not sure if the problem is std::string or QString or QByteArray. Would be fine to know what to expect in each of them when sign "EURO" is passed as multibyte utf8, but did not find anything in the QT documentation or maybe I missed something.
-
-
@Frenk21 said in QtXmlStreamReader and UTF-8 multibytes characters:
Would be fine to know what to expect in each of them when sign "EURO" is passed as multibyte utf8, but did not find anything in the QT documentation
std::string - no Qt, no encoding jsut a bunch of bytes, depends on you how to interpret them
QByteArray - just a bunch of bytes, depends on you how to interpret them
QString - an utf-16 encoded string -
@Frenk21 Qt 5.15.2 or 6.3.1 on Linux with GCC 9.4.0. I just copy-n-pasted from my earlier post to create the
test.xml
.I think you need to take the input your are given and, without imposing any character conversions at all, write it to a file in binary mode. Then inspect what is actually in it.
For my testing I added an xml processing instruction, but it works also without.
chrisw@newton:/tmp/test$ od -ta -tx1 test.xml 0000000 < ? x m l sp v e r s i o n = " 1 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 0000020 . 0 " sp e n c o d i n g = " U T 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 0000040 F - 8 " ? > nl < P T L . R 0 1 > 46 2d 38 22 3f 3e 0a 3c 50 54 4c 2e 52 30 31 3e 0000060 nl sp < H D R > nl sp < H D R . c o 0a 20 3c 48 44 52 3e 0a 20 3c 48 44 52 2e 63 6f 0000100 n t r o l _ i d sp V = " X X X X 6e 74 72 6f 6c 5f 69 64 20 56 3d 22 58 58 58 58 0000120 " / > nl sp < H D R . v e r s i o 22 2f 3e 0a 20 3c 48 44 52 2e 76 65 72 73 69 6f 0000140 n _ i d sp V = " P O C T 1 " / > 6e 5f 69 64 20 56 3d 22 50 4f 43 54 31 22 2f 3e 0000160 nl sp < H D R . c r e a t i o n _ 0a 20 3c 48 44 52 2e 63 72 65 61 74 69 6f 6e 5f 0000200 d t t m sp V = " 2 0 0 1 - 1 1 - 64 74 74 6d 20 56 3d 22 32 30 30 31 2d 31 31 2d 0000220 0 1 T 1 6 : 3 2 : 4 5 - 8 : 0 0 30 31 54 31 36 3a 33 32 3a 34 35 2d 38 3a 30 30 0000240 " / > nl sp < / H D R > nl sp < P T 22 2f 3e 0a 20 3c 2f 48 44 52 3e 0a 20 3c 50 54 0000260 > nl sp < P T . p a t i e n t _ i 3e 0a 20 3c 50 54 2e 70 61 74 69 65 6e 74 5f 69 0000300 d sp V = " b stx , p a t h " / > nl 64 20 56 3d 22 e2 82 ac 70 61 74 68 22 2f 3e 0a 0000320 sp < / P T > nl < / P T L . R 0 1 20 3c 2f 50 54 3e 0a 3c 2f 50 54 4c 2e 52 30 31 0000340 > nl 3e 0a 0000342
You should be looking for random rubbish before and after the xml you expected.
-
@ChrisW67 I see, thanks for all your support and time. Will check it. I did just copy the stuff places here and try it.
I see now, that at start I should be more specific and also write in which environment I am working. As I mention I am using qt 5.15.2 and working on Windows 10, where I tried to build with mingw 8.1.0 64 bit and MSVC2020 (c++ compile 17.2.32526.322) 64 bit. On both I get invalid. Sorry that I did not specify this information before.
-
@ChrisW67 Hello Chris,
You example with test.xml is working on my place. I did a little more testing and I see where is the problem now. The problem is not in the string conversion, but as I am using TCP/IP stream for reading of the XML messages, the error will appear if we parse next XML message. For example 2 messages as shown bellow:
<?xml version="1.0" encoding="UTF-8"?> <PTL.R01> <HDR> <HDR.control_id V="XXXX"/> <HDR.version_id V="POCT1"/> <HDR.creation_dttm V="2001-11-01T16:32:45-8:00"/> </HDR> <PT> <PT.patient_id V="€path"/> </PT> </PTL.R01> <?xml version="1.0" encoding="UTF-8"?> <PTL.R01> <HDR> <HDR.control_id V="XXXX"/> <HDR.version_id V="POCT1"/> <HDR.creation_dttm V="2001-11-01T16:32:45-8:00"/> </HDR> <PT> <PT.patient_id V="€path"/> </PT> </PTL.R01>
The reason is, that if I dont have a multi byte character, then XML reader properly parses first message and then the second one and comes to an end and no leftovers remain in the buffer after parsing the first message. As leftover I mean that characterOffset() return proper location of the last parsed character.
But if I add multi byte character and parse the first message then I see this "1>" which remains in the buffer (characterOffset() return index to this location) and when I try to parse the second message I get error : Premature end of the document/Start tag expected. So I need to somehow improve buffer/stream data handling and adding/copying.
If I after parsing the first message, manually increase the index/set start buffer to +2 then parsing of the second message also works fine (by +2 I just corrected the start of the input data buffer of the QtXmlStreamReader)
So it seems to me that the QtXmlStreamReader does not properly return the offset when I am using characterOffset() (EUR is of length 3 bytes and index is increased by 1 byte only. If I add more multibyte chars, then the difference is even higher - it accumulates over the whole XML message for all the detected multibyte chars)
What I need is a proper index of the last end of the XML message so that I can advance the buffer for the QtXmlStreamReader and he can properly process next message.
-
@Frenk21 I will just answer to myself:
Here is working solution for your provided example
... case QXmlStreamReader::StartElement: { uint adjustDataBufferOffset = 0; qDebug() << m_xmlReader.tokenString() << m_xmlReader.qualifiedName(); for (QXmlStreamAttribute &attr: m_xmlReader.attributes()) { qDebug() << " " << attr.name() << "==" << attr.value(); // Get length of the values in bytes uint valueLenBytes = attr.value().size(); uint valueCharCnt = 0; // Count UTF8 characters (one character size can vary from 1..4 bytes) for (auto oneChar: attr.value()) { if (((oneChar & 0x80) == 0) || ((oneChar & 0xc0) == 0xc0)) { ++valueCharCnt; } } if (valueLenBytes > valueCharCnt) { adjustDataBufferOffset += valueLenBytes - valueCharCnt; } } if (adjustDataBufferOffset > 0) { QByteArray &dataBuffer = m_xmlData.buffer(); auto index = m_xmlReader.characterOffset(); index += adjustDataBufferOffset; // FIX as xml reading does not advance properly to // the end of the element if multibyte chars are present // in the XML message dataBuffer = dataBuffer.mid(index); } break; } ....