How to automatically detect the codec text file?

andre

Basically all text encodings are the same for the first 127 code points :-)

goetz

"EBCDIC":http://en.wikipedia.org/wiki/EBCDIC :-P

andre

Yeah, right, ok. All textcodecs that are relevant to your work today, I meant.

dangelog

Well, that's not true -- take UTF-16, for example.

andre

Still true. I was talking about code points, not bytes. UTF-16 just encodes code points in two bytes (for the most part anyway).

goetz

But to talk about code points you need the encoding beforehand, which is where the thread started :-)

andre

Isn't UTF-16 supposed to start with a byte order mark? If so, you can at least detect that one quite reliably :-)

dangelog

Unfortunately the unicode standard doesn't make the BOM required :(

goetz

Unfortunately, the byte order marks in UTF-8 or UTF-16 are valid 8bit ASCII code points too.

UTF-8
BOM = EF BB BF = ï»¿ (in ISO-8859-1 = Latin-1)

UTF-16:
Big Endian BOM = FE FF = þÿ
Little Endian BOM = FF FE = ÿþ

Using other ASCI code pages just yields other valid screen representations.

While having these three or two bytes as the very first bytes in a file is a strong sign of the use of unicode in the respective file, it is neither necessary (there is no mandatory BOM) nor sufficient to identify a UTF-8/16 encoded file.

If it was so easy to detect a file's encoding, there wouldn't be so much software that fails miserably on that job...

andre

So back to square 1: There is no such thing as plain text. :-)