Important: Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

How to automatically detect the codec text file?



  • Is it possible to automatically determine which encoding is used in the file?
    Or you can somehow determine that the decoding of text from a file error occurred?



  • No, there is not. It's the curse of "plain" text files: "they don't exist":http://www.joelonsoftware.com/articles/Unicode.html . But I guess you figured that out just based on posting this question.



  • [quote author="Andre" date="1324122197"]No, there is not. It's the curse of "plain" text files: "they don't exist":http://www.joelonsoftware.com/articles/Unicode.html . But I guess you figured that out just based on posting this question. [/quote]
    Yes, I see I'm trying to find a solution to this problem is beyond what is necessary to write something like OCR encoding. I am right now trying to write a project that generates a code listing of the source and collided with the fact that UI files are encoded in the UTF-8 while the source encoding native Windows. May already have a solution or reason for Qt?



  • UI files are XML and thus read and written with the appropriate classes that evaluate the encoding denoted in the XML header. That's completely different to regular C/C++ source files.



  • [quote author="Volker" date="1324167608"]UI files are XML and thus read and written with the appropriate classes that evaluate the encoding denoted in the XML header. That's completely different to regular C/C++ source files.[/quote]
    This is understandable, I write a program for all kinds of sources. I did not want to use any particular methods or reading for specific types of sources.



  • Detecting the encoding of text files is mostly plain guessing.

    For example, UTF8 and Latin1 are completely identical in the first 127 code points. So you might have a file that has a non-ASCII character after 5 MB. You would need to read up to that amount of text to discover this.



  • Basically all text encodings are the same for the first 127 code points :-)





  • Yeah, right, ok. All textcodecs that are relevant to your work today, I meant.



  • Well, that's not true -- take UTF-16, for example.



  • Still true. I was talking about code points, not bytes. UTF-16 just encodes code points in two bytes (for the most part anyway).



  • But to talk about code points you need the encoding beforehand, which is where the thread started :-)



  • Isn't UTF-16 supposed to start with a byte order mark? If so, you can at least detect that one quite reliably :-)



  • Unfortunately the unicode standard doesn't make the BOM required :(



  • Unfortunately, the byte order marks in UTF-8 or UTF-16 are valid 8bit ASCII code points too.

    UTF-8
    BOM = EF BB BF =  (in ISO-8859-1 = Latin-1)

    UTF-16:
    Big Endian BOM = FE FF = þÿ
    Little Endian BOM = FF FE = ÿþ

    Using other ASCI code pages just yields other valid screen representations.

    While having these three or two bytes as the very first bytes in a file is a strong sign of the use of unicode in the respective file, it is neither necessary (there is no mandatory BOM) nor sufficient to identify a UTF-8/16 encoded file.

    If it was so easy to detect a file's encoding, there wouldn't be so much software that fails miserably on that job...



  • So back to square 1: There is no such thing as plain text. :-)


Log in to reply