Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. How to automatically detect the codec text file?
Forum Updated to NodeBB v4.3 + New Features

How to automatically detect the codec text file?

Scheduled Pinned Locked Moved General and Desktop
16 Posts 4 Posters 13.2k Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • H Offline
    H Offline
    Hronom
    wrote on last edited by
    #1

    Is it possible to automatically determine which encoding is used in the file?
    Or you can somehow determine that the decoding of text from a file error occurred?

    1 Reply Last reply
    0
    • A Offline
      A Offline
      andre
      wrote on last edited by
      #2

      No, there is not. It's the curse of "plain" text files: "they don't exist":http://www.joelonsoftware.com/articles/Unicode.html . But I guess you figured that out just based on posting this question.

      1 Reply Last reply
      0
      • H Offline
        H Offline
        Hronom
        wrote on last edited by
        #3

        [quote author="Andre" date="1324122197"]No, there is not. It's the curse of "plain" text files: "they don't exist":http://www.joelonsoftware.com/articles/Unicode.html . But I guess you figured that out just based on posting this question. [/quote]
        Yes, I see I'm trying to find a solution to this problem is beyond what is necessary to write something like OCR encoding. I am right now trying to write a project that generates a code listing of the source and collided with the fact that UI files are encoded in the UTF-8 while the source encoding native Windows. May already have a solution or reason for Qt?

        1 Reply Last reply
        0
        • G Offline
          G Offline
          goetz
          wrote on last edited by
          #4

          UI files are XML and thus read and written with the appropriate classes that evaluate the encoding denoted in the XML header. That's completely different to regular C/C++ source files.

          http://www.catb.org/~esr/faqs/smart-questions.html

          1 Reply Last reply
          0
          • H Offline
            H Offline
            Hronom
            wrote on last edited by
            #5

            [quote author="Volker" date="1324167608"]UI files are XML and thus read and written with the appropriate classes that evaluate the encoding denoted in the XML header. That's completely different to regular C/C++ source files.[/quote]
            This is understandable, I write a program for all kinds of sources. I did not want to use any particular methods or reading for specific types of sources.

            1 Reply Last reply
            0
            • G Offline
              G Offline
              goetz
              wrote on last edited by
              #6

              Detecting the encoding of text files is mostly plain guessing.

              For example, UTF8 and Latin1 are completely identical in the first 127 code points. So you might have a file that has a non-ASCII character after 5 MB. You would need to read up to that amount of text to discover this.

              http://www.catb.org/~esr/faqs/smart-questions.html

              1 Reply Last reply
              0
              • A Offline
                A Offline
                andre
                wrote on last edited by
                #7

                Basically all text encodings are the same for the first 127 code points :-)

                1 Reply Last reply
                0
                • G Offline
                  G Offline
                  goetz
                  wrote on last edited by
                  #8

                  "EBCDIC":http://en.wikipedia.org/wiki/EBCDIC :-P

                  http://www.catb.org/~esr/faqs/smart-questions.html

                  1 Reply Last reply
                  0
                  • A Offline
                    A Offline
                    andre
                    wrote on last edited by
                    #9

                    Yeah, right, ok. All textcodecs that are relevant to your work today, I meant.

                    1 Reply Last reply
                    0
                    • D Offline
                      D Offline
                      dangelog
                      wrote on last edited by
                      #10

                      Well, that's not true -- take UTF-16, for example.

                      Software Engineer
                      KDAB (UK) Ltd., a KDAB Group company

                      1 Reply Last reply
                      0
                      • A Offline
                        A Offline
                        andre
                        wrote on last edited by
                        #11

                        Still true. I was talking about code points, not bytes. UTF-16 just encodes code points in two bytes (for the most part anyway).

                        1 Reply Last reply
                        0
                        • G Offline
                          G Offline
                          goetz
                          wrote on last edited by
                          #12

                          But to talk about code points you need the encoding beforehand, which is where the thread started :-)

                          http://www.catb.org/~esr/faqs/smart-questions.html

                          1 Reply Last reply
                          0
                          • A Offline
                            A Offline
                            andre
                            wrote on last edited by
                            #13

                            Isn't UTF-16 supposed to start with a byte order mark? If so, you can at least detect that one quite reliably :-)

                            1 Reply Last reply
                            0
                            • D Offline
                              D Offline
                              dangelog
                              wrote on last edited by
                              #14

                              Unfortunately the unicode standard doesn't make the BOM required :(

                              Software Engineer
                              KDAB (UK) Ltd., a KDAB Group company

                              1 Reply Last reply
                              0
                              • G Offline
                                G Offline
                                goetz
                                wrote on last edited by
                                #15

                                Unfortunately, the byte order marks in UTF-8 or UTF-16 are valid 8bit ASCII code points too.

                                UTF-8
                                BOM = EF BB BF =  (in ISO-8859-1 = Latin-1)

                                UTF-16:
                                Big Endian BOM = FE FF = þÿ
                                Little Endian BOM = FF FE = ÿþ

                                Using other ASCI code pages just yields other valid screen representations.

                                While having these three or two bytes as the very first bytes in a file is a strong sign of the use of unicode in the respective file, it is neither necessary (there is no mandatory BOM) nor sufficient to identify a UTF-8/16 encoded file.

                                If it was so easy to detect a file's encoding, there wouldn't be so much software that fails miserably on that job...

                                http://www.catb.org/~esr/faqs/smart-questions.html

                                1 Reply Last reply
                                0
                                • A Offline
                                  A Offline
                                  andre
                                  wrote on last edited by
                                  #16

                                  So back to square 1: There is no such thing as plain text. :-)

                                  1 Reply Last reply
                                  0

                                  • Login

                                  • Login or register to search.
                                  • First post
                                    Last post
                                  0
                                  • Categories
                                  • Recent
                                  • Tags
                                  • Popular
                                  • Users
                                  • Groups
                                  • Search
                                  • Get Qt Extensions
                                  • Unsolved