[Solved] QXmlStreamReader encoding guess



  • Hi,

    I'm parsing xml files using QXmlStreamReader (after reading them using QFile::readAll). Unfortunately I will have to deal with some files that provide the wrong encoding in the header line (<?xml version="1.0" encoding="UTF-16"?>).
    I've read that QTextCodec once had a function to guess the encoding of a QByteArray but it was removed in Qt4 because guessing encodings is unreliable.
    I also didn't find a function that lets me verify the encoding of data (something like QTextCodec::isThisLegalUTF8)

    However, I found that if I remove the header from the xml file altogether, QXmlStreamReader does parse all files I've tested correctly.
    That means QXmlStreamReader does guess the encoding (at least it interprets the BOM if it's there), right? Why use this unreliable guessing internally but not provide it as an API?
    Does anyone have a better idea on how to deal with my dilemma than ignoring the header altogether? I'd rather use the encoding specified and resort to guesswork only if the provided encoding is definitively wrong but as far as I can see it I'd have to parse the file once. Then, on error, try to guess if the error was in the header line and if so throw away the header and parse again. Seems awkward..

    Thanks in advance



  • As you said that, the problem is that:

    bq. Unfortunately I will have to deal with some files that provide the wrong encoding in the header line (<?xml version=“1.0” encoding=“UTF-16”?>).


    As we all know that, for a plain text file, you can not get it's charset used from itself. That why modern file format such as xml/html/... contains charset used by itself. And Qt support this very well, isn't it?

    For unicode encoded text file, BOM can used to gauss a charset used in a bytes stream. And QTextCodec, QTextStream, QXmlStreamReader, ... all support this. And public api exists for this, such as

    @
    QTextCodec::codecForUtfText()
    @



  • XML without encoding explicitly stated in the header defaults back to UTF-8 as the file encoding. So there's no guessing going on if the QXmlStreamReader conforms to the standard - which I will assume here.



  • Infact, QXmlStreamReader will gauss the encoding based on BOM when it start reading the xml file, if not BOM exists, utf-8 is used.

    Then it parse the "encoding" specified in xml header, if it's valid, the new codec will be used.

    Debao



  • Thanks for your replies.
    I think codecForUtfText() should work for my case. I don't know how I missed it :)


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.