How to automatically detect the codec text file?
-
[quote author="Volker" date="1324167608"]UI files are XML and thus read and written with the appropriate classes that evaluate the encoding denoted in the XML header. That's completely different to regular C/C++ source files.[/quote]
This is understandable, I write a program for all kinds of sources. I did not want to use any particular methods or reading for specific types of sources. -
Detecting the encoding of text files is mostly plain guessing.
For example, UTF8 and Latin1 are completely identical in the first 127 code points. So you might have a file that has a non-ASCII character after 5 MB. You would need to read up to that amount of text to discover this.
-
"EBCDIC":http://en.wikipedia.org/wiki/EBCDIC :-P
-
Unfortunately, the byte order marks in UTF-8 or UTF-16 are valid 8bit ASCII code points too.
UTF-8
BOM = EF BB BF =  (in ISO-8859-1 = Latin-1)UTF-16:
Big Endian BOM = FE FF = þÿ
Little Endian BOM = FF FE = ÿþUsing other ASCI code pages just yields other valid screen representations.
While having these three or two bytes as the very first bytes in a file is a strong sign of the use of unicode in the respective file, it is neither necessary (there is no mandatory BOM) nor sufficient to identify a UTF-8/16 encoded file.
If it was so easy to detect a file's encoding, there wouldn't be so much software that fails miserably on that job...