Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. How to automatically detect the codec text file?
Forum Updated to NodeBB v4.3 + New Features

How to automatically detect the codec text file?

Scheduled Pinned Locked Moved General and Desktop
16 Posts 4 Posters 13.1k Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • A Offline
    A Offline
    andre
    wrote on 18 Dec 2011, 14:21 last edited by
    #7

    Basically all text encodings are the same for the first 127 code points :-)

    1 Reply Last reply
    0
    • G Offline
      G Offline
      goetz
      wrote on 18 Dec 2011, 14:35 last edited by
      #8

      "EBCDIC":http://en.wikipedia.org/wiki/EBCDIC :-P

      http://www.catb.org/~esr/faqs/smart-questions.html

      1 Reply Last reply
      0
      • A Offline
        A Offline
        andre
        wrote on 18 Dec 2011, 14:36 last edited by
        #9

        Yeah, right, ok. All textcodecs that are relevant to your work today, I meant.

        1 Reply Last reply
        0
        • D Offline
          D Offline
          dangelog
          wrote on 18 Dec 2011, 14:42 last edited by
          #10

          Well, that's not true -- take UTF-16, for example.

          Software Engineer
          KDAB (UK) Ltd., a KDAB Group company

          1 Reply Last reply
          0
          • A Offline
            A Offline
            andre
            wrote on 18 Dec 2011, 14:46 last edited by
            #11

            Still true. I was talking about code points, not bytes. UTF-16 just encodes code points in two bytes (for the most part anyway).

            1 Reply Last reply
            0
            • G Offline
              G Offline
              goetz
              wrote on 18 Dec 2011, 14:56 last edited by
              #12

              But to talk about code points you need the encoding beforehand, which is where the thread started :-)

              http://www.catb.org/~esr/faqs/smart-questions.html

              1 Reply Last reply
              0
              • A Offline
                A Offline
                andre
                wrote on 18 Dec 2011, 16:48 last edited by
                #13

                Isn't UTF-16 supposed to start with a byte order mark? If so, you can at least detect that one quite reliably :-)

                1 Reply Last reply
                0
                • D Offline
                  D Offline
                  dangelog
                  wrote on 18 Dec 2011, 17:02 last edited by
                  #14

                  Unfortunately the unicode standard doesn't make the BOM required :(

                  Software Engineer
                  KDAB (UK) Ltd., a KDAB Group company

                  1 Reply Last reply
                  0
                  • G Offline
                    G Offline
                    goetz
                    wrote on 18 Dec 2011, 22:57 last edited by
                    #15

                    Unfortunately, the byte order marks in UTF-8 or UTF-16 are valid 8bit ASCII code points too.

                    UTF-8
                    BOM = EF BB BF =  (in ISO-8859-1 = Latin-1)

                    UTF-16:
                    Big Endian BOM = FE FF = þÿ
                    Little Endian BOM = FF FE = ÿþ

                    Using other ASCI code pages just yields other valid screen representations.

                    While having these three or two bytes as the very first bytes in a file is a strong sign of the use of unicode in the respective file, it is neither necessary (there is no mandatory BOM) nor sufficient to identify a UTF-8/16 encoded file.

                    If it was so easy to detect a file's encoding, there wouldn't be so much software that fails miserably on that job...

                    http://www.catb.org/~esr/faqs/smart-questions.html

                    1 Reply Last reply
                    0
                    • A Offline
                      A Offline
                      andre
                      wrote on 19 Dec 2011, 05:54 last edited by
                      #16

                      So back to square 1: There is no such thing as plain text. :-)

                      1 Reply Last reply
                      0

                      16/16

                      19 Dec 2011, 05:54

                      • Login

                      • Login or register to search.
                      16 out of 16
                      • First post
                        16/16
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • Users
                      • Groups
                      • Search
                      • Get Qt Extensions
                      • Unsolved