Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. How to automatically detect the codec text file?
Forum Updated to NodeBB v4.3 + New Features

How to automatically detect the codec text file?

Scheduled Pinned Locked Moved General and Desktop
16 Posts 4 Posters 14.1k Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • G Offline
    G Offline
    goetz
    wrote on last edited by
    #6

    Detecting the encoding of text files is mostly plain guessing.

    For example, UTF8 and Latin1 are completely identical in the first 127 code points. So you might have a file that has a non-ASCII character after 5 MB. You would need to read up to that amount of text to discover this.

    http://www.catb.org/~esr/faqs/smart-questions.html

    1 Reply Last reply
    0
    • A Offline
      A Offline
      andre
      wrote on last edited by
      #7

      Basically all text encodings are the same for the first 127 code points :-)

      1 Reply Last reply
      0
      • G Offline
        G Offline
        goetz
        wrote on last edited by
        #8

        "EBCDIC":http://en.wikipedia.org/wiki/EBCDIC :-P

        http://www.catb.org/~esr/faqs/smart-questions.html

        1 Reply Last reply
        0
        • A Offline
          A Offline
          andre
          wrote on last edited by
          #9

          Yeah, right, ok. All textcodecs that are relevant to your work today, I meant.

          1 Reply Last reply
          0
          • D Offline
            D Offline
            dangelog
            wrote on last edited by
            #10

            Well, that's not true -- take UTF-16, for example.

            Software Engineer
            KDAB (UK) Ltd., a KDAB Group company

            1 Reply Last reply
            0
            • A Offline
              A Offline
              andre
              wrote on last edited by
              #11

              Still true. I was talking about code points, not bytes. UTF-16 just encodes code points in two bytes (for the most part anyway).

              1 Reply Last reply
              0
              • G Offline
                G Offline
                goetz
                wrote on last edited by
                #12

                But to talk about code points you need the encoding beforehand, which is where the thread started :-)

                http://www.catb.org/~esr/faqs/smart-questions.html

                1 Reply Last reply
                0
                • A Offline
                  A Offline
                  andre
                  wrote on last edited by
                  #13

                  Isn't UTF-16 supposed to start with a byte order mark? If so, you can at least detect that one quite reliably :-)

                  1 Reply Last reply
                  0
                  • D Offline
                    D Offline
                    dangelog
                    wrote on last edited by
                    #14

                    Unfortunately the unicode standard doesn't make the BOM required :(

                    Software Engineer
                    KDAB (UK) Ltd., a KDAB Group company

                    1 Reply Last reply
                    0
                    • G Offline
                      G Offline
                      goetz
                      wrote on last edited by
                      #15

                      Unfortunately, the byte order marks in UTF-8 or UTF-16 are valid 8bit ASCII code points too.

                      UTF-8
                      BOM = EF BB BF =  (in ISO-8859-1 = Latin-1)

                      UTF-16:
                      Big Endian BOM = FE FF = þÿ
                      Little Endian BOM = FF FE = ÿþ

                      Using other ASCI code pages just yields other valid screen representations.

                      While having these three or two bytes as the very first bytes in a file is a strong sign of the use of unicode in the respective file, it is neither necessary (there is no mandatory BOM) nor sufficient to identify a UTF-8/16 encoded file.

                      If it was so easy to detect a file's encoding, there wouldn't be so much software that fails miserably on that job...

                      http://www.catb.org/~esr/faqs/smart-questions.html

                      1 Reply Last reply
                      0
                      • A Offline
                        A Offline
                        andre
                        wrote on last edited by
                        #16

                        So back to square 1: There is no such thing as plain text. :-)

                        1 Reply Last reply
                        0

                        • Login

                        • Login or register to search.
                        • First post
                          Last post
                        0
                        • Categories
                        • Recent
                        • Tags
                        • Popular
                        • Users
                        • Groups
                        • Search
                        • Get Qt Extensions
                        • Unsolved