How to automatically detect the codec text file?

goetz

Detecting the encoding of text files is mostly plain guessing.

For example, UTF8 and Latin1 are completely identical in the first 127 code points. So you might have a file that has a non-ASCII character after 5 MB. You would need to read up to that amount of text to discover this.

andre

Basically all text encodings are the same for the first 127 code points :-)

goetz

"EBCDIC":http://en.wikipedia.org/wiki/EBCDIC :-P

andre

Yeah, right, ok. All textcodecs that are relevant to your work today, I meant.

dangelog

Well, that's not true -- take UTF-16, for example.

andre

Still true. I was talking about code points, not bytes. UTF-16 just encodes code points in two bytes (for the most part anyway).

goetz

But to talk about code points you need the encoding beforehand, which is where the thread started :-)

andre

Isn't UTF-16 supposed to start with a byte order mark? If so, you can at least detect that one quite reliably :-)

dangelog

Unfortunately the unicode standard doesn't make the BOM required :(

goetz

Unfortunately, the byte order marks in UTF-8 or UTF-16 are valid 8bit ASCII code points too.

UTF-8
BOM = EF BB BF = ï»¿ (in ISO-8859-1 = Latin-1)

UTF-16:
Big Endian BOM = FE FF = þÿ
Little Endian BOM = FF FE = ÿþ

Using other ASCI code pages just yields other valid screen representations.

While having these three or two bytes as the very first bytes in a file is a strong sign of the use of unicode in the respective file, it is neither necessary (there is no mandatory BOM) nor sufficient to identify a UTF-8/16 encoded file.

If it was so easy to detect a file's encoding, there wouldn't be so much software that fails miserably on that job...

andre

So back to square 1: There is no such thing as plain text. :-)