Is toLocal8bit() safe in my situation?

qwe3

Hi,

I have a file with UTF-8 without BOM encoding. I have to load data from that file ( only one line ) and hold it in QString variable. Next I have to convert this data from QString variable to other QString, but with UTF-8 encoding. So I have:

QFile file(R"(C:\Users\tom\Desktop\myFile.txt)");
file.open(QIODevice::ReadOnly);
QTextStream textStream(file.readAll());
textStream.setCodec(QTextCodec::codecForLocale());
QString stringVar = textStream.readLine();

...

QTextStream textStream2(stringVar.toLocal8Bit(), QIODevice::ReadOnly);
textStream2.setCodec(QTextCodec::codecForName("UTF-8"));
QString stringVar2 = textStream2.readLine();

And it works. But in toLocal8bit() docs there is a text:

Returns the local 8-bit representation of the string as a QByteArray. The returned byte 
array is undefined if the string contains characters not supported by the local 8-bit encoding.

In windows-1250 ( my locale codec ) encoding we have 251 characters ( 251 / 256 ). I tried situation, when in file I have a 2 bytes character which contain a byte doesn't appear in windows-1250 ( 0x81 ) and qDebug show 'Ă\u0081', when I show stringVar and next the proper character when I show stringVar2. Everything is ok. So using toLocal8bit() is safe in my situation?

I can't load data from file using UTF-8 - I have to load it using locale codec.

ChrisW67

Your requirement seems to be, retrieve the first line of a file and output it UTF-8 encoded. I assume you mean you want the output to have a UTF-8 byte-order-mark.

You tell us you have a file known to be UTF-8 encoded text. Why are you telling QTextStream to treat it as your Windows 8-bit encoding (i.e. QTextCodec::codecForLocale())? This will probably not end well if there is anything other than basic single-byte characters in the input file.

I think you are overthinking the problem.

Using this as input, which contains UTF-8 multi-bytes characters:

Lorem convert αβγ ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod 
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam...

and this code:

#include <QCoreApplication>
#include <QFile>
#include <QTextStream>
#include <QTextCodec>

int main(int argc, char *argv[])
{
    QCoreApplication a(argc, argv);

    QFile file("/tmp/testin");
    if (file.open(QIODevice::ReadOnly)) {
      QTextStream inStream(&file);
      inStream.setCodec("UTF-8");
      QString firstLine = inStream.readLine();
      file.close();

      QFile outFile("/tmp/testout");
      if (outFile.open(QIODevice::WriteOnly)) {
          QTextStream outStream(&outFile);
          outStream.setCodec("UTF-8");
          outStream.setGenerateByteOrderMark(true);
          outStream << firstLine << Qt::endl;
          outFile.close();
      }
    }

    return 0;
}

I get this output:

Lorem αβγ ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod

which is this as chars and hex

$ od -a -tx1 testout 
0000000   o   ;   ?   L   o   r   e   m  sp   N   1   N   2   N   3  sp
         ef  bb  bf  4c  6f  72  65  6d  20  ce  b1  ce  b2  ce  b3  20
0000020   i   p   s   u   m  sp   d   o   l   o   r  sp   s   i   t  sp
         69  70  73  75  6d  20  64  6f  6c  6f  72  20  73  69  74  20
0000040   a   m   e   t   ,  sp   c   o   n   s   e   c   t   e   t   u
         61  6d  65  74  2c  20  63  6f  6e  73  65  63  74  65  74  75
0000060   r  sp   a   d   i   p   i   s   c   i   n   g  sp   e   l   i
         72  20  61  64  69  70  69  73  63  69  6e  67  20  65  6c  69
0000100   t   ,  sp   s   e   d  sp   d   o  sp   e   i   u   s   m   o
         74  2c  20  73  65  64  20  64  6f  20  65  69  75  73  6d  6f
0000120   d  sp  nl
         64  20  0a
0000123

You can see that the multi-byte UTF-8 characters survive and the output has a BOM.

qwe3

@ChrisW67 Thank you.

Maybe I tell more. I don't know, which encoding will have my input file. There are 2 possibilities:

UTF-8 without BOM
windows-1250

When file will have windows-1250 encoding and I load it as UTF-8, I can lost data. So I would like to load that 2 files as windows-1250. When file will have encoding windows-1250 - no problem. When file will have encoding UTF-8 without BOM - maybe no problem.

Can you tell me more about situation

This will probably not end well if there is anything other than basic single-byte characters in the input file.

?

For example my file have encoding UTF-8. I load it as windows-1250. In this file I have multibytes character for example ( U+00C1 ).
This character is: c3 81. In windows-1250 we don't have byte 81 ( hex ). Look here: https://en.wikipedia.org/wiki/Windows-1250. And my question is: is it safe? I tried load this character ( U+00C1 ) from file to my app using codec locale and everything was ok.

ChrisW67

@qwe3: To handle encoding changes accurately you need to know the encodings involved. That is the reason the result of mismatches is undefined (your quote from the docs) and why I said it will probably not end well.

If you use my code and change "UTF-8" to "Windows-1250" on the inStream you have the equivalent of what you describe. With similar UTF-8 encoded input:

$ od -t c -t x1 testin
0000000   L   o   r   e   m     316 261 316 262 316 263       i   p   s
         4c  6f  72  65  6d  20  ce  b1  ce  b2  ce  b3  20  69  70  73
0000020   u   m       d   o   l   o   r       s   i   t       a   m   e
         75  6d  20  64  6f  6c  6f  72  20  73  69  74  20  61  6d  65
0000040   t   ,       c   o   n   s   e   c   t   e   t   u   r       a
         74  2c  20  63  6f  6e  73  65  63  74  65  74  75  72  20  61
0000060   d   i   p   i   s   c   i   n   g       e   l   i   t   ,    
         64  69  70  69  73  63  69  6e  67  20  65  6c  69  74  2c  20
0000100   s   e   d       d   o       e   i   u   s   m   o   d      \n
         73  65  64  20  64  6f  20  65  69  75  73  6d  6f  64  20  0a
0000120

you get this out:

$ od -t c -t x1 testout
0000000 357 273 277   L   o   r   e   m     303 216 302 261 303 216 313
         ef  bb  bf  4c  6f  72  65  6d  20  c3  8e  c2  b1  c3  8e  cb
0000020 233 303 216 305 202       i   p   s   u   m       d   o   l   o
         9b  c3  8e  c5  82  20  69  70  73  75  6d  20  64  6f  6c  6f
0000040   r       s   i   t       a   m   e   t   ,       c   o   n   s
         72  20  73  69  74  20  61  6d  65  74  2c  20  63  6f  6e  73
0000060   e   c   t   e   t   u   r       a   d   i   p   i   s   c   i
         65  63  74  65  74  75  72  20  61  64  69  70  69  73  63  69
0000100   n   g       e   l   i   t   ,       s   e   d       d   o    
         6e  67  20  65  6c  69  74  2c  20  73  65  64  20  64  6f  20
0000120   e   i   u   s   m   o   d      \n
         65  69  75  73  6d  6f  64  20  0a
0000131

That is, the multi-byte UTF-8 code points in testin (ce b1 ce b2 ce b3) get mangled into (c3 8e c2 b1 c3 8e cb 9b c3 8e c5 82) in the output. If this was "working" then you would expect a UTF-8 input to be unchanged on the way through.

It is possible that on Windows these conversions happen slightly differently (perhaps because Qt uses a system function) but it is plain that it is not portable or reliable.

qwe3

@ChrisW67 Do you know a way to check if some UTF-8 character exists in windows-1250?

For example I have a sequence letters: "ćąą". in file, which has windows-1250 encoding. When I load it to my app I will see "0xE6 0xB9 0xB9". And this is correct japan character. Of course in windows-1250 it doesn't exists. And I would like to have a way to check it using QT.

I tried:

QByteArray byteArray;
byteArray.resize(3);
byteArray[0]=0xE6;
byteArray[1]=0xB9;
byteArray[2]=0xB9;
QTextStream textStream(byteArray);
textStream.setCodec(QTextCodec::codecForName("UTF-8"));
QString zzz = textStream.readLine();
qDebug()<<QTextCodec::codecForLocale()->canEncode(zzz); // "true"

So this is not correct way.

SimonSchroeder

I personally know that problem (with Latin 1 encoding used before switching to UTF-8). You have to decide for ambiguous files if you rather want to interpret as UTF-8 or windows-1250. However, it is very unlikely that this occurs. UTF-8 starts out with 7-bit ASCII. This means for regular ASCII characters the highest bit is 0. If the highest bit is 1 multiple bytes will encode a character. The count of leading 1's then tells us if the character has 2, 3, or 4 bytes. Also, the following bytes all have to start with a 1. It is highly unlikely that older encodings have multiple characters following exactly the UTF-8 restrictions throughout the entire file. It is thus (mostly) save to first try to read a file as UTF-8 and on failure switch to windows-1250.

Here is some code I pulled from our project:

    QTextCodec::ConverterState state;
    QTextCodec *codec = QTextCodec::codecForName("UTF-8");
    QByteArray byteArray(text);
    QString str = codec->toUnicode(byteArray.constData(), byteArray.size(), &state);
    if (state.invalidChars > 0)
    {
        str = QString::fromLatin1(text);
    }

text is a char * which reads in through an old C API. In your case you can directly read the contents of the file to byteArray. I would guess that you will have to handle the BOM separately (just check the first 3 bytes in QByteArray and strip them if necessary).