Way to determine utf-8 without BOM encoding

Budeykin

Hello. I am trying to read text from file and try to determine encoding of the file using encodingForData:

const QByteArray data = file.read( BufferSize ); //first 5 bytes of the file
encoding = QStringConverter::encodingForData( data );

It determines utf-8 BOM correctly but without BOM it doesn't wotk. How i can determine that file is utf-8(without BOM) or ansi(ansi doesn't determine by encodingForData also) ? I need Qt6 way, without using QTextCodec

Christian Ehrlicher

Read more data until an utf-8 character is found (or not, then it's ansi).

Budeykin

@Christian-Ehrlicher it doesn't work, i've tried. I think, that encodingForData can't determine utf8 without BOM

Chris Kawa

@Budeykin encodingForData only looks for BOM in up to 4 first bytes to determine which UTF variant it is.

You need to do what Christian said - read more data until you find a character that can differentiate the two.

But first of all - do you really mean ANSI or ASCII?
ASCII is a subset of UTF-8, so you can just look for a character out of its range and then you know it's UTF. Simple.

ANSI is not a single encoding, but a common name for a family of different localized encodings. Most of them are not a subset of UTF, so there might not be a way to clearly distinguish the two without a UTF BOM.

Budeykin

@Chris-Kawa I meant ANSI. And I need to determine utf-8 without BOM. If file content is UTF8, I can use Encoding::UTF8 and it will work corretly even if it without BOM. If it not utf8, i can use Encoding::System (ANSI in my case) and it will work correctly too. So i really don't need to know that exactly ANSI it is, all i need is to determine if it UTF-8.

Is there any way to do it automatically with Qt instruments? Or i really need to analyze butes one by one?

Christian Ehrlicher

@Budeykin said in Way to determine utf-8 without BOM encoding:

Is there any way to do it automatically with Qt instruments? Or i really need to analyze butes one by one?

As we already said - read until you encounter an utf-8 byte sequence. How else should it work?

Chris Kawa

@Budeykin But ANSI and UTF-8 are not exclusive. A string can be both valid ANSI and UTF-8 at the same time. If you know the exact encoding and it is exclusive with UTF then you can look for the differentiating character, otherwise there is no way to say.

I don't know of any Qt function that attempts to do that, as there simply is no reliable way to tell without a BOM.

hskoglund

Hi, also if you know what language the text files are written in, that could help determine the encoding flavor.

ChrisW67

@Budeykin said in Way to determine utf-8 without BOM encoding:

It determines utf-8 BOM correctly but without BOM it doesn't work.

That's because it only checks for byte-order-marks in various flavours, or the optional expected first character.

It's not too taxing on one's Google-fu to find generic code or library that will check if a block of data could be entirely valid UTF-8 text. It may still be something other than that though because, as @Chris-Kawa points out, there are valid strings in arbitrary eight-bit encodings that are also valid UTF-8.

Take, for example, this string of bytes (hex).

C3 A0 20 C3 A1 20 C3 A2 20 C3 A3 20 C3 A4 20 C3 A5 20 C3 A6

If treated as valid UTF-8 this is the characters:
à á â ã ä å æ
If treated as Windows-1252 (commonly, imprecisely called ANSI) or ISO-8859-1 encoded:
Ã⍽ Ã¡ Ã¢ Ã£ Ã¤ Ã¥ Ã¦
(where ⍽ is a placeholder for a non-breaking space).
They are both equally valid interpretations that require context the computer does not have to select between.

SimonSchroeder

Have quick look at UTF-8 on Wikipedia: There are certain rules to be followed for something to be a valid UTF-8 character. Bytes that start with 0... are also valid ASCII (ASCII only specifies the first 128 characters; others are use depending on different languages). If you find a byte that starts with 1... you know you got a possible multibyte sequence. However, a multibyte sequence can only start with 110..., 1110..., or 11110... It can not start with 10... because 10... is always the following bytes of a multibyte sequence. 110... means one byte with 10... follows, 1110... is followed by two bytes with 10... and 11110... is followed by three bytes with 10... If this pattern is not met you don't have UTF-8 encoding. In many languages (using the latin alphabet) there will often be just a single letter in the upper range (byte starting with 1...) immediately followed by a regular ASCII character from the lower range (byte starting with 0...) -> invalid UTF-8. This makes it feasible to distinguish between ANSI encodings and UTF-8 based on the byte patterns. This will work for regular text most of the time.

Only downside: as long as only ASCII characters are used you cannot distinguish between ANSI or UTF-8 because they will be exactly the same. It is up to you to decide if handling this case always as UTF-8 is alright (only matters if you also write and not just read).

Qt also provides some functions that can be (mis-)used: https://stackoverflow.com/questions/18227530/check-if-utf-8-string-is-valid-in-qt . You can either use ConverterState when trying to interpret bytes as UTF-8 or provide the alternative encoding if UTF-8 fails (see comment of accepted answer on StackOverflow).