How to automatically detect the codec text file?
-
wrote on 17 Dec 2011, 11:31 last edited by
Is it possible to automatically determine which encoding is used in the file?
Or you can somehow determine that the decoding of text from a file error occurred? -
wrote on 17 Dec 2011, 11:43 last edited by
No, there is not. It's the curse of "plain" text files: "they don't exist":http://www.joelonsoftware.com/articles/Unicode.html . But I guess you figured that out just based on posting this question.
-
wrote on 17 Dec 2011, 14:01 last edited by
[quote author="Andre" date="1324122197"]No, there is not. It's the curse of "plain" text files: "they don't exist":http://www.joelonsoftware.com/articles/Unicode.html . But I guess you figured that out just based on posting this question. [/quote]
Yes, I see I'm trying to find a solution to this problem is beyond what is necessary to write something like OCR encoding. I am right now trying to write a project that generates a code listing of the source and collided with the fact that UI files are encoded in the UTF-8 while the source encoding native Windows. May already have a solution or reason for Qt? -
wrote on 18 Dec 2011, 00:20 last edited by
UI files are XML and thus read and written with the appropriate classes that evaluate the encoding denoted in the XML header. That's completely different to regular C/C++ source files.
-
wrote on 18 Dec 2011, 11:47 last edited by
[quote author="Volker" date="1324167608"]UI files are XML and thus read and written with the appropriate classes that evaluate the encoding denoted in the XML header. That's completely different to regular C/C++ source files.[/quote]
This is understandable, I write a program for all kinds of sources. I did not want to use any particular methods or reading for specific types of sources. -
wrote on 18 Dec 2011, 13:34 last edited by
Detecting the encoding of text files is mostly plain guessing.
For example, UTF8 and Latin1 are completely identical in the first 127 code points. So you might have a file that has a non-ASCII character after 5 MB. You would need to read up to that amount of text to discover this.
-
wrote on 18 Dec 2011, 14:21 last edited by
Basically all text encodings are the same for the first 127 code points :-)
-
wrote on 18 Dec 2011, 14:35 last edited by
"EBCDIC":http://en.wikipedia.org/wiki/EBCDIC :-P
-
wrote on 18 Dec 2011, 14:36 last edited by
Yeah, right, ok. All textcodecs that are relevant to your work today, I meant.
-
wrote on 18 Dec 2011, 14:42 last edited by
Well, that's not true -- take UTF-16, for example.
-
wrote on 18 Dec 2011, 14:46 last edited by
Still true. I was talking about code points, not bytes. UTF-16 just encodes code points in two bytes (for the most part anyway).
-
wrote on 18 Dec 2011, 14:56 last edited by
But to talk about code points you need the encoding beforehand, which is where the thread started :-)
-
wrote on 18 Dec 2011, 16:48 last edited by
Isn't UTF-16 supposed to start with a byte order mark? If so, you can at least detect that one quite reliably :-)
-
wrote on 18 Dec 2011, 17:02 last edited by
Unfortunately the unicode standard doesn't make the BOM required :(
-
wrote on 18 Dec 2011, 22:57 last edited by
Unfortunately, the byte order marks in UTF-8 or UTF-16 are valid 8bit ASCII code points too.
UTF-8
BOM = EF BB BF =  (in ISO-8859-1 = Latin-1)UTF-16:
Big Endian BOM = FE FF = þÿ
Little Endian BOM = FF FE = ÿþUsing other ASCI code pages just yields other valid screen representations.
While having these three or two bytes as the very first bytes in a file is a strong sign of the use of unicode in the respective file, it is neither necessary (there is no mandatory BOM) nor sufficient to identify a UTF-8/16 encoded file.
If it was so easy to detect a file's encoding, there wouldn't be so much software that fails miserably on that job...
-
wrote on 19 Dec 2011, 05:54 last edited by
So back to square 1: There is no such thing as plain text. :-)
1/16