Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. How to automatically detect the codec text file?
Forum Update on Monday, May 27th 2025

How to automatically detect the codec text file?

Scheduled Pinned Locked Moved General and Desktop
16 Posts 4 Posters 13.1k Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • H Offline
    H Offline
    Hronom
    wrote on 17 Dec 2011, 11:31 last edited by
    #1

    Is it possible to automatically determine which encoding is used in the file?
    Or you can somehow determine that the decoding of text from a file error occurred?

    1 Reply Last reply
    0
    • A Offline
      A Offline
      andre
      wrote on 17 Dec 2011, 11:43 last edited by
      #2

      No, there is not. It's the curse of "plain" text files: "they don't exist":http://www.joelonsoftware.com/articles/Unicode.html . But I guess you figured that out just based on posting this question.

      1 Reply Last reply
      0
      • H Offline
        H Offline
        Hronom
        wrote on 17 Dec 2011, 14:01 last edited by
        #3

        [quote author="Andre" date="1324122197"]No, there is not. It's the curse of "plain" text files: "they don't exist":http://www.joelonsoftware.com/articles/Unicode.html . But I guess you figured that out just based on posting this question. [/quote]
        Yes, I see I'm trying to find a solution to this problem is beyond what is necessary to write something like OCR encoding. I am right now trying to write a project that generates a code listing of the source and collided with the fact that UI files are encoded in the UTF-8 while the source encoding native Windows. May already have a solution or reason for Qt?

        1 Reply Last reply
        0
        • G Offline
          G Offline
          goetz
          wrote on 18 Dec 2011, 00:20 last edited by
          #4

          UI files are XML and thus read and written with the appropriate classes that evaluate the encoding denoted in the XML header. That's completely different to regular C/C++ source files.

          http://www.catb.org/~esr/faqs/smart-questions.html

          1 Reply Last reply
          0
          • H Offline
            H Offline
            Hronom
            wrote on 18 Dec 2011, 11:47 last edited by
            #5

            [quote author="Volker" date="1324167608"]UI files are XML and thus read and written with the appropriate classes that evaluate the encoding denoted in the XML header. That's completely different to regular C/C++ source files.[/quote]
            This is understandable, I write a program for all kinds of sources. I did not want to use any particular methods or reading for specific types of sources.

            1 Reply Last reply
            0
            • G Offline
              G Offline
              goetz
              wrote on 18 Dec 2011, 13:34 last edited by
              #6

              Detecting the encoding of text files is mostly plain guessing.

              For example, UTF8 and Latin1 are completely identical in the first 127 code points. So you might have a file that has a non-ASCII character after 5 MB. You would need to read up to that amount of text to discover this.

              http://www.catb.org/~esr/faqs/smart-questions.html

              1 Reply Last reply
              0
              • A Offline
                A Offline
                andre
                wrote on 18 Dec 2011, 14:21 last edited by
                #7

                Basically all text encodings are the same for the first 127 code points :-)

                1 Reply Last reply
                0
                • G Offline
                  G Offline
                  goetz
                  wrote on 18 Dec 2011, 14:35 last edited by
                  #8

                  "EBCDIC":http://en.wikipedia.org/wiki/EBCDIC :-P

                  http://www.catb.org/~esr/faqs/smart-questions.html

                  1 Reply Last reply
                  0
                  • A Offline
                    A Offline
                    andre
                    wrote on 18 Dec 2011, 14:36 last edited by
                    #9

                    Yeah, right, ok. All textcodecs that are relevant to your work today, I meant.

                    1 Reply Last reply
                    0
                    • D Offline
                      D Offline
                      dangelog
                      wrote on 18 Dec 2011, 14:42 last edited by
                      #10

                      Well, that's not true -- take UTF-16, for example.

                      Software Engineer
                      KDAB (UK) Ltd., a KDAB Group company

                      1 Reply Last reply
                      0
                      • A Offline
                        A Offline
                        andre
                        wrote on 18 Dec 2011, 14:46 last edited by
                        #11

                        Still true. I was talking about code points, not bytes. UTF-16 just encodes code points in two bytes (for the most part anyway).

                        1 Reply Last reply
                        0
                        • G Offline
                          G Offline
                          goetz
                          wrote on 18 Dec 2011, 14:56 last edited by
                          #12

                          But to talk about code points you need the encoding beforehand, which is where the thread started :-)

                          http://www.catb.org/~esr/faqs/smart-questions.html

                          1 Reply Last reply
                          0
                          • A Offline
                            A Offline
                            andre
                            wrote on 18 Dec 2011, 16:48 last edited by
                            #13

                            Isn't UTF-16 supposed to start with a byte order mark? If so, you can at least detect that one quite reliably :-)

                            1 Reply Last reply
                            0
                            • D Offline
                              D Offline
                              dangelog
                              wrote on 18 Dec 2011, 17:02 last edited by
                              #14

                              Unfortunately the unicode standard doesn't make the BOM required :(

                              Software Engineer
                              KDAB (UK) Ltd., a KDAB Group company

                              1 Reply Last reply
                              0
                              • G Offline
                                G Offline
                                goetz
                                wrote on 18 Dec 2011, 22:57 last edited by
                                #15

                                Unfortunately, the byte order marks in UTF-8 or UTF-16 are valid 8bit ASCII code points too.

                                UTF-8
                                BOM = EF BB BF =  (in ISO-8859-1 = Latin-1)

                                UTF-16:
                                Big Endian BOM = FE FF = þÿ
                                Little Endian BOM = FF FE = ÿþ

                                Using other ASCI code pages just yields other valid screen representations.

                                While having these three or two bytes as the very first bytes in a file is a strong sign of the use of unicode in the respective file, it is neither necessary (there is no mandatory BOM) nor sufficient to identify a UTF-8/16 encoded file.

                                If it was so easy to detect a file's encoding, there wouldn't be so much software that fails miserably on that job...

                                http://www.catb.org/~esr/faqs/smart-questions.html

                                1 Reply Last reply
                                0
                                • A Offline
                                  A Offline
                                  andre
                                  wrote on 19 Dec 2011, 05:54 last edited by
                                  #16

                                  So back to square 1: There is no such thing as plain text. :-)

                                  1 Reply Last reply
                                  0

                                  1/16

                                  17 Dec 2011, 11:31

                                  • Login

                                  • Login or register to search.
                                  1 out of 16
                                  • First post
                                    1/16
                                    Last post
                                  0
                                  • Categories
                                  • Recent
                                  • Tags
                                  • Popular
                                  • Users
                                  • Groups
                                  • Search
                                  • Get Qt Extensions
                                  • Unsolved