How to really save XML special chars to UTF-8
-
I have some problem to write a special UTF-8 char into an XML document. This special char comes from an UTF-8 encoded xml document. This char is coded on 4 bytes. I basically need to save it in another xml document.
Depending on the method I use, I get either:
- a string that does not represent my special char at all
- an invalid entity
- an empty element
while I would expect to have my special char coded on 4 bytes again.
@ QDomDocument xmlDoc;
//create a string containing an utf-8 char encoded on 4 bytes (note, this is a valid char coming from a valid XML file encoded in UTF-8) QByteArray originalSpecialChar; originalSpecialChar.append(0xF0); originalSpecialChar.append(0x9D); originalSpecialChar.append(0x8C); originalSpecialChar.append(0x86); //put it in a string (thus converted in UNICODE but it keeps the right character) QString originalSpecialCharInString = QString::fromUtf8(originalSpecialChar.constData(), 4); {//add this string into a new XML doc (encoded in UTF-8) xmlDoc.appendChild(xmlDoc.createProcessingInstruction("xml", "version=\"1.0\" encoding=\"UTF-8\"")); QDomElement rootNode = xmlDoc.createElement("RootNode"); xmlDoc.appendChild(rootNode); QDomText textNode = xmlDoc.createTextNode(originalSpecialCharInString); rootNode.appendChild(textNode); //at this point, the specialChar is still correct in the QDomDocument (so the conversion from UTF-8 -> Unicode -> UTF-8 actually works !) if (textNode.nodeValue().toUtf8() != originalSpecialChar) qDebug() << "invalid (1)"; //this does not show } {//save the xml doc into a QByteArray (using save) QByteArray xmlContent; QTextStream textStream(&xmlContent); xmlDoc.save(textStream, 0, QDomNode::EncodingFromDocument); //note: same result if I force the textStream codec to UTF-8 and use EncodingFromTextStream qDebug() << xmlContent; //shows <?xml version="1.0" encoding="UTF-8"?><RootNode>#xdf06;</RootNode> //the node contains the string "#xdf06". This is really not the character I expect } {//save with toString() qDebug() << xmlDoc.toString(0); //shows <?xml version="1.0" encoding="UTF-8"?><RootNode>�</RootNode> //Qt is actually able to read this document but, not a C# client because it actually contains an invalid entity. If I use QDomImplementation::DropInvalidChars, the element is empty so, Qt knows it is invalid. qDebug() << xmlDoc.toString(0).toUtf8(); //shows <?xml version="1.0" encoding="UTF-8"?><RootNode>#xdf06;</RootNode> //it does not help ! } //what I would expect (this is actually what my original xml file looked like): //<?xml version="1.0" encoding="UTF-8"?><RootNode>my original char coded on 4 bytes in the utf-8 doc</RootNode>@
-
I can confirm this behavior and would regard this a bug.
All the string handling is ok, the error occurs during the writing of the XML document. That doesn't take into account surrogate pairs. The character causing the trouble is outside the basic plane and needs two QChars in the QString, that case is not handled in the XML writer.
I would recommend to open an issue in the "public bug tracker":https://bugreports.qt-project.org/
-
Bug filed here:
https://bugreports.qt-project.org/browse/QTBUG-25291 -
Thanks for the bug report. Just as a side note: to format code in Jira just wrap it between two {code} tags.