Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Way to determine utf-8 without BOM encoding
Forum Updated to NodeBB v4.3 + New Features

Way to determine utf-8 without BOM encoding

Scheduled Pinned Locked Moved Unsolved General and Desktop
10 Posts 6 Posters 2.8k Views 2 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • B Offline
    B Offline
    Budeykin
    wrote on last edited by
    #1

    Hello. I am trying to read text from file and try to determine encoding of the file using encodingForData:

    const QByteArray data = file.read( BufferSize ); //first 5 bytes of the file
    encoding = QStringConverter::encodingForData( data );
    

    It determines utf-8 BOM correctly but without BOM it doesn't wotk. How i can determine that file is utf-8(without BOM) or ansi(ansi doesn't determine by encodingForData also) ? I need Qt6 way, without using QTextCodec

    Christian EhrlicherC C 2 Replies Last reply
    0
    • B Budeykin

      Hello. I am trying to read text from file and try to determine encoding of the file using encodingForData:

      const QByteArray data = file.read( BufferSize ); //first 5 bytes of the file
      encoding = QStringConverter::encodingForData( data );
      

      It determines utf-8 BOM correctly but without BOM it doesn't wotk. How i can determine that file is utf-8(without BOM) or ansi(ansi doesn't determine by encodingForData also) ? I need Qt6 way, without using QTextCodec

      Christian EhrlicherC Offline
      Christian EhrlicherC Offline
      Christian Ehrlicher
      Lifetime Qt Champion
      wrote on last edited by
      #2

      Read more data until an utf-8 character is found (or not, then it's ansi).

      Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
      Visit the Qt Academy at https://academy.qt.io/catalog

      B 1 Reply Last reply
      1
      • Christian EhrlicherC Christian Ehrlicher

        Read more data until an utf-8 character is found (or not, then it's ansi).

        B Offline
        B Offline
        Budeykin
        wrote on last edited by
        #3

        @Christian-Ehrlicher it doesn't work, i've tried. I think, that encodingForData can't determine utf8 without BOM

        Chris KawaC 1 Reply Last reply
        0
        • B Budeykin

          @Christian-Ehrlicher it doesn't work, i've tried. I think, that encodingForData can't determine utf8 without BOM

          Chris KawaC Offline
          Chris KawaC Offline
          Chris Kawa
          Lifetime Qt Champion
          wrote on last edited by Chris Kawa
          #4

          @Budeykin encodingForData only looks for BOM in up to 4 first bytes to determine which UTF variant it is.

          You need to do what Christian said - read more data until you find a character that can differentiate the two.

          But first of all - do you really mean ANSI or ASCII?
          ASCII is a subset of UTF-8, so you can just look for a character out of its range and then you know it's UTF. Simple.

          ANSI is not a single encoding, but a common name for a family of different localized encodings. Most of them are not a subset of UTF, so there might not be a way to clearly distinguish the two without a UTF BOM.

          B 1 Reply Last reply
          2
          • Chris KawaC Chris Kawa

            @Budeykin encodingForData only looks for BOM in up to 4 first bytes to determine which UTF variant it is.

            You need to do what Christian said - read more data until you find a character that can differentiate the two.

            But first of all - do you really mean ANSI or ASCII?
            ASCII is a subset of UTF-8, so you can just look for a character out of its range and then you know it's UTF. Simple.

            ANSI is not a single encoding, but a common name for a family of different localized encodings. Most of them are not a subset of UTF, so there might not be a way to clearly distinguish the two without a UTF BOM.

            B Offline
            B Offline
            Budeykin
            wrote on last edited by
            #5

            @Chris-Kawa I meant ANSI. And I need to determine utf-8 without BOM. If file content is UTF8, I can use Encoding::UTF8 and it will work corretly even if it without BOM. If it not utf8, i can use Encoding::System (ANSI in my case) and it will work correctly too. So i really don't need to know that exactly ANSI it is, all i need is to determine if it UTF-8.

            Is there any way to do it automatically with Qt instruments? Or i really need to analyze butes one by one?

            Christian EhrlicherC Chris KawaC 2 Replies Last reply
            0
            • B Budeykin

              @Chris-Kawa I meant ANSI. And I need to determine utf-8 without BOM. If file content is UTF8, I can use Encoding::UTF8 and it will work corretly even if it without BOM. If it not utf8, i can use Encoding::System (ANSI in my case) and it will work correctly too. So i really don't need to know that exactly ANSI it is, all i need is to determine if it UTF-8.

              Is there any way to do it automatically with Qt instruments? Or i really need to analyze butes one by one?

              Christian EhrlicherC Offline
              Christian EhrlicherC Offline
              Christian Ehrlicher
              Lifetime Qt Champion
              wrote on last edited by
              #6

              @Budeykin said in Way to determine utf-8 without BOM encoding:

              Is there any way to do it automatically with Qt instruments? Or i really need to analyze butes one by one?

              As we already said - read until you encounter an utf-8 byte sequence. How else should it work?

              Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
              Visit the Qt Academy at https://academy.qt.io/catalog

              1 Reply Last reply
              0
              • B Budeykin

                @Chris-Kawa I meant ANSI. And I need to determine utf-8 without BOM. If file content is UTF8, I can use Encoding::UTF8 and it will work corretly even if it without BOM. If it not utf8, i can use Encoding::System (ANSI in my case) and it will work correctly too. So i really don't need to know that exactly ANSI it is, all i need is to determine if it UTF-8.

                Is there any way to do it automatically with Qt instruments? Or i really need to analyze butes one by one?

                Chris KawaC Offline
                Chris KawaC Offline
                Chris Kawa
                Lifetime Qt Champion
                wrote on last edited by
                #7

                @Budeykin But ANSI and UTF-8 are not exclusive. A string can be both valid ANSI and UTF-8 at the same time. If you know the exact encoding and it is exclusive with UTF then you can look for the differentiating character, otherwise there is no way to say.

                I don't know of any Qt function that attempts to do that, as there simply is no reliable way to tell without a BOM.

                1 Reply Last reply
                2
                • hskoglundH Offline
                  hskoglundH Offline
                  hskoglund
                  wrote on last edited by
                  #8

                  Hi, also if you know what language the text files are written in, that could help determine the encoding flavor.

                  1 Reply Last reply
                  0
                  • B Budeykin

                    Hello. I am trying to read text from file and try to determine encoding of the file using encodingForData:

                    const QByteArray data = file.read( BufferSize ); //first 5 bytes of the file
                    encoding = QStringConverter::encodingForData( data );
                    

                    It determines utf-8 BOM correctly but without BOM it doesn't wotk. How i can determine that file is utf-8(without BOM) or ansi(ansi doesn't determine by encodingForData also) ? I need Qt6 way, without using QTextCodec

                    C Offline
                    C Offline
                    ChrisW67
                    wrote on last edited by
                    #9

                    @Budeykin said in Way to determine utf-8 without BOM encoding:

                    It determines utf-8 BOM correctly but without BOM it doesn't work.

                    That's because it only checks for byte-order-marks in various flavours, or the optional expected first character.

                    It's not too taxing on one's Google-fu to find generic code or library that will check if a block of data could be entirely valid UTF-8 text. It may still be something other than that though because, as @Chris-Kawa points out, there are valid strings in arbitrary eight-bit encodings that are also valid UTF-8.

                    Take, for example, this string of bytes (hex).

                    C3 A0 20 C3 A1 20 C3 A2 20 C3 A3 20 C3 A4 20 C3 A5 20 C3 A6
                    

                    If treated as valid UTF-8 this is the characters:
                    à á â ã ä å æ
                    If treated as Windows-1252 (commonly, imprecisely called ANSI) or ISO-8859-1 encoded:
                    Ã⍽ Ã¡ â ã ä Ã¥ æ
                    (where ⍽ is a placeholder for a non-breaking space).
                    They are both equally valid interpretations that require context the computer does not have to select between.

                    1 Reply Last reply
                    2
                    • S Offline
                      S Offline
                      SimonSchroeder
                      wrote on last edited by SimonSchroeder
                      #10

                      Have quick look at UTF-8 on Wikipedia: There are certain rules to be followed for something to be a valid UTF-8 character. Bytes that start with 0... are also valid ASCII (ASCII only specifies the first 128 characters; others are use depending on different languages). If you find a byte that starts with 1... you know you got a possible multibyte sequence. However, a multibyte sequence can only start with 110..., 1110..., or 11110... It can not start with 10... because 10... is always the following bytes of a multibyte sequence. 110... means one byte with 10... follows, 1110... is followed by two bytes with 10... and 11110... is followed by three bytes with 10... If this pattern is not met you don't have UTF-8 encoding. In many languages (using the latin alphabet) there will often be just a single letter in the upper range (byte starting with 1...) immediately followed by a regular ASCII character from the lower range (byte starting with 0...) -> invalid UTF-8. This makes it feasible to distinguish between ANSI encodings and UTF-8 based on the byte patterns. This will work for regular text most of the time.

                      Only downside: as long as only ASCII characters are used you cannot distinguish between ANSI or UTF-8 because they will be exactly the same. It is up to you to decide if handling this case always as UTF-8 is alright (only matters if you also write and not just read).

                      Qt also provides some functions that can be (mis-)used: https://stackoverflow.com/questions/18227530/check-if-utf-8-string-is-valid-in-qt . You can either use ConverterState when trying to interpret bytes as UTF-8 or provide the alternative encoding if UTF-8 fails (see comment of accepted answer on StackOverflow).

                      1 Reply Last reply
                      1

                      • Login

                      • Login or register to search.
                      • First post
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • Users
                      • Groups
                      • Search
                      • Get Qt Extensions
                      • Unsolved