Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. How to automatically detect the codec text file?

How to automatically detect the codec text file?

Scheduled Pinned Locked Moved General and Desktop
16 Posts 4 Posters 14.5k Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • G Offline
    G Offline
    goetz
    wrote on last edited by
    #6

    Detecting the encoding of text files is mostly plain guessing.

    For example, UTF8 and Latin1 are completely identical in the first 127 code points. So you might have a file that has a non-ASCII character after 5 MB. You would need to read up to that amount of text to discover this.

    http://www.catb.org/~esr/faqs/smart-questions.html

    1 Reply Last reply
    0
    • A Offline
      A Offline
      andre
      wrote on last edited by
      #7

      Basically all text encodings are the same for the first 127 code points :-)

      1 Reply Last reply
      0
      • G Offline
        G Offline
        goetz
        wrote on last edited by
        #8

        "EBCDIC":http://en.wikipedia.org/wiki/EBCDIC :-P

        http://www.catb.org/~esr/faqs/smart-questions.html

        1 Reply Last reply
        0
        • A Offline
          A Offline
          andre
          wrote on last edited by
          #9

          Yeah, right, ok. All textcodecs that are relevant to your work today, I meant.

          1 Reply Last reply
          0
          • D Offline
            D Offline
            dangelog
            wrote on last edited by
            #10

            Well, that's not true -- take UTF-16, for example.

            Software Engineer
            KDAB (UK) Ltd., a KDAB Group company

            1 Reply Last reply
            0
            • A Offline
              A Offline
              andre
              wrote on last edited by
              #11

              Still true. I was talking about code points, not bytes. UTF-16 just encodes code points in two bytes (for the most part anyway).

              1 Reply Last reply
              0
              • G Offline
                G Offline
                goetz
                wrote on last edited by
                #12

                But to talk about code points you need the encoding beforehand, which is where the thread started :-)

                http://www.catb.org/~esr/faqs/smart-questions.html

                1 Reply Last reply
                0
                • A Offline
                  A Offline
                  andre
                  wrote on last edited by
                  #13

                  Isn't UTF-16 supposed to start with a byte order mark? If so, you can at least detect that one quite reliably :-)

                  1 Reply Last reply
                  0
                  • D Offline
                    D Offline
                    dangelog
                    wrote on last edited by
                    #14

                    Unfortunately the unicode standard doesn't make the BOM required :(

                    Software Engineer
                    KDAB (UK) Ltd., a KDAB Group company

                    1 Reply Last reply
                    0
                    • G Offline
                      G Offline
                      goetz
                      wrote on last edited by
                      #15

                      Unfortunately, the byte order marks in UTF-8 or UTF-16 are valid 8bit ASCII code points too.

                      UTF-8
                      BOM = EF BB BF =  (in ISO-8859-1 = Latin-1)

                      UTF-16:
                      Big Endian BOM = FE FF = þÿ
                      Little Endian BOM = FF FE = ÿþ

                      Using other ASCI code pages just yields other valid screen representations.

                      While having these three or two bytes as the very first bytes in a file is a strong sign of the use of unicode in the respective file, it is neither necessary (there is no mandatory BOM) nor sufficient to identify a UTF-8/16 encoded file.

                      If it was so easy to detect a file's encoding, there wouldn't be so much software that fails miserably on that job...

                      http://www.catb.org/~esr/faqs/smart-questions.html

                      1 Reply Last reply
                      0
                      • A Offline
                        A Offline
                        andre
                        wrote on last edited by
                        #16

                        So back to square 1: There is no such thing as plain text. :-)

                        1 Reply Last reply
                        0

                        • Login

                        • Login or register to search.
                        • First post
                          Last post
                        0
                        • Categories
                        • Recent
                        • Tags
                        • Popular
                        • Users
                        • Groups
                        • Search
                        • Get Qt Extensions
                        • Unsolved