Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Is toLocal8bit() safe in my situation?
QtWS25 Last Chance

Is toLocal8bit() safe in my situation?

Scheduled Pinned Locked Moved Unsolved General and Desktop
6 Posts 3 Posters 1.0k Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • Q Offline
    Q Offline
    qwe3
    wrote on last edited by
    #1

    Hi,

    I have a file with UTF-8 without BOM encoding. I have to load data from that file ( only one line ) and hold it in QString variable. Next I have to convert this data from QString variable to other QString, but with UTF-8 encoding. So I have:

    QFile file(R"(C:\Users\tom\Desktop\myFile.txt)");
    file.open(QIODevice::ReadOnly);
    QTextStream textStream(file.readAll());
    textStream.setCodec(QTextCodec::codecForLocale());
    QString stringVar = textStream.readLine();
    
    ...
    
    QTextStream textStream2(stringVar.toLocal8Bit(), QIODevice::ReadOnly);
    textStream2.setCodec(QTextCodec::codecForName("UTF-8"));
    QString stringVar2 = textStream2.readLine();
    

    And it works. But in toLocal8bit() docs there is a text:

    Returns the local 8-bit representation of the string as a QByteArray. The returned byte 
    array is undefined if the string contains characters not supported by the local 8-bit encoding.
    

    In windows-1250 ( my locale codec ) encoding we have 251 characters ( 251 / 256 ). I tried situation, when in file I have a 2 bytes character which contain a byte doesn't appear in windows-1250 ( 0x81 ) and qDebug show 'Ă\u0081', when I show stringVar and next the proper character when I show stringVar2. Everything is ok. So using toLocal8bit() is safe in my situation?

    I can't load data from file using UTF-8 - I have to load it using locale codec.

    1 Reply Last reply
    0
    • C Offline
      C Offline
      ChrisW67
      wrote on last edited by ChrisW67
      #2

      Your requirement seems to be, retrieve the first line of a file and output it UTF-8 encoded. I assume you mean you want the output to have a UTF-8 byte-order-mark.

      You tell us you have a file known to be UTF-8 encoded text. Why are you telling QTextStream to treat it as your Windows 8-bit encoding (i.e. QTextCodec::codecForLocale())? This will probably not end well if there is anything other than basic single-byte characters in the input file.

      I think you are overthinking the problem.

      Using this as input, which contains UTF-8 multi-bytes characters:

      Lorem convert αβγ ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod 
      tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam...
      

      and this code:

      #include <QCoreApplication>
      #include <QFile>
      #include <QTextStream>
      #include <QTextCodec>
      
      int main(int argc, char *argv[])
      {
          QCoreApplication a(argc, argv);
      
          QFile file("/tmp/testin");
          if (file.open(QIODevice::ReadOnly)) {
            QTextStream inStream(&file);
            inStream.setCodec("UTF-8");
            QString firstLine = inStream.readLine();
            file.close();
      
            QFile outFile("/tmp/testout");
            if (outFile.open(QIODevice::WriteOnly)) {
                QTextStream outStream(&outFile);
                outStream.setCodec("UTF-8");
                outStream.setGenerateByteOrderMark(true);
                outStream << firstLine << Qt::endl;
                outFile.close();
            }
          }
      
          return 0;
      }
      

      I get this output:

      Lorem αβγ ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod 
      

      which is this as chars and hex

      $ od -a -tx1 testout 
      0000000   o   ;   ?   L   o   r   e   m  sp   N   1   N   2   N   3  sp
               ef  bb  bf  4c  6f  72  65  6d  20  ce  b1  ce  b2  ce  b3  20
      0000020   i   p   s   u   m  sp   d   o   l   o   r  sp   s   i   t  sp
               69  70  73  75  6d  20  64  6f  6c  6f  72  20  73  69  74  20
      0000040   a   m   e   t   ,  sp   c   o   n   s   e   c   t   e   t   u
               61  6d  65  74  2c  20  63  6f  6e  73  65  63  74  65  74  75
      0000060   r  sp   a   d   i   p   i   s   c   i   n   g  sp   e   l   i
               72  20  61  64  69  70  69  73  63  69  6e  67  20  65  6c  69
      0000100   t   ,  sp   s   e   d  sp   d   o  sp   e   i   u   s   m   o
               74  2c  20  73  65  64  20  64  6f  20  65  69  75  73  6d  6f
      0000120   d  sp  nl
               64  20  0a
      0000123
      

      You can see that the multi-byte UTF-8 characters survive and the output has a BOM.

      1 Reply Last reply
      2
      • Q Offline
        Q Offline
        qwe3
        wrote on last edited by
        #3

        @ChrisW67 Thank you.

        Maybe I tell more. I don't know, which encoding will have my input file. There are 2 possibilities:

        1. UTF-8 without BOM
        2. windows-1250

        When file will have windows-1250 encoding and I load it as UTF-8, I can lost data. So I would like to load that 2 files as windows-1250. When file will have encoding windows-1250 - no problem. When file will have encoding UTF-8 without BOM - maybe no problem.

        Can you tell me more about situation

        This will probably not end well if there is anything other than basic single-byte characters in the input file.
        

        ?

        For example my file have encoding UTF-8. I load it as windows-1250. In this file I have multibytes character for example ( U+00C1 ).
        This character is: c3 81. In windows-1250 we don't have byte 81 ( hex ). Look here: https://en.wikipedia.org/wiki/Windows-1250. And my question is: is it safe? I tried load this character ( U+00C1 ) from file to my app using codec locale and everything was ok.

        1 Reply Last reply
        0
        • C Offline
          C Offline
          ChrisW67
          wrote on last edited by
          #4

          @qwe3: To handle encoding changes accurately you need to know the encodings involved. That is the reason the result of mismatches is undefined (your quote from the docs) and why I said it will probably not end well.

          If you use my code and change "UTF-8" to "Windows-1250" on the inStream you have the equivalent of what you describe. With similar UTF-8 encoded input:

          $ od -t c -t x1 testin
          0000000   L   o   r   e   m     316 261 316 262 316 263       i   p   s
                   4c  6f  72  65  6d  20  ce  b1  ce  b2  ce  b3  20  69  70  73
          0000020   u   m       d   o   l   o   r       s   i   t       a   m   e
                   75  6d  20  64  6f  6c  6f  72  20  73  69  74  20  61  6d  65
          0000040   t   ,       c   o   n   s   e   c   t   e   t   u   r       a
                   74  2c  20  63  6f  6e  73  65  63  74  65  74  75  72  20  61
          0000060   d   i   p   i   s   c   i   n   g       e   l   i   t   ,    
                   64  69  70  69  73  63  69  6e  67  20  65  6c  69  74  2c  20
          0000100   s   e   d       d   o       e   i   u   s   m   o   d      \n
                   73  65  64  20  64  6f  20  65  69  75  73  6d  6f  64  20  0a
          0000120
          

          you get this out:

          $ od -t c -t x1 testout
          0000000 357 273 277   L   o   r   e   m     303 216 302 261 303 216 313
                   ef  bb  bf  4c  6f  72  65  6d  20  c3  8e  c2  b1  c3  8e  cb
          0000020 233 303 216 305 202       i   p   s   u   m       d   o   l   o
                   9b  c3  8e  c5  82  20  69  70  73  75  6d  20  64  6f  6c  6f
          0000040   r       s   i   t       a   m   e   t   ,       c   o   n   s
                   72  20  73  69  74  20  61  6d  65  74  2c  20  63  6f  6e  73
          0000060   e   c   t   e   t   u   r       a   d   i   p   i   s   c   i
                   65  63  74  65  74  75  72  20  61  64  69  70  69  73  63  69
          0000100   n   g       e   l   i   t   ,       s   e   d       d   o    
                   6e  67  20  65  6c  69  74  2c  20  73  65  64  20  64  6f  20
          0000120   e   i   u   s   m   o   d      \n
                   65  69  75  73  6d  6f  64  20  0a
          0000131
          

          That is, the multi-byte UTF-8 code points in testin (ce b1 ce b2 ce b3) get mangled into (c3 8e c2 b1 c3 8e cb 9b c3 8e c5 82) in the output. If this was "working" then you would expect a UTF-8 input to be unchanged on the way through.

          It is possible that on Windows these conversions happen slightly differently (perhaps because Qt uses a system function) but it is plain that it is not portable or reliable.

          1 Reply Last reply
          1
          • Q Offline
            Q Offline
            qwe3
            wrote on last edited by qwe3
            #5

            @ChrisW67 Do you know a way to check if some UTF-8 character exists in windows-1250?

            For example I have a sequence letters: "ćąą". in file, which has windows-1250 encoding. When I load it to my app I will see "0xE6 0xB9 0xB9". And this is correct japan character. Of course in windows-1250 it doesn't exists. And I would like to have a way to check it using QT.

            I tried:

            QByteArray byteArray;
            byteArray.resize(3);
            byteArray[0]=0xE6;
            byteArray[1]=0xB9;
            byteArray[2]=0xB9;
            QTextStream textStream(byteArray);
            textStream.setCodec(QTextCodec::codecForName("UTF-8"));
            QString zzz = textStream.readLine();
            qDebug()<<QTextCodec::codecForLocale()->canEncode(zzz); // "true"
            

            So this is not correct way.

            1 Reply Last reply
            0
            • S Offline
              S Offline
              SimonSchroeder
              wrote on last edited by
              #6

              I personally know that problem (with Latin 1 encoding used before switching to UTF-8). You have to decide for ambiguous files if you rather want to interpret as UTF-8 or windows-1250. However, it is very unlikely that this occurs. UTF-8 starts out with 7-bit ASCII. This means for regular ASCII characters the highest bit is 0. If the highest bit is 1 multiple bytes will encode a character. The count of leading 1's then tells us if the character has 2, 3, or 4 bytes. Also, the following bytes all have to start with a 1. It is highly unlikely that older encodings have multiple characters following exactly the UTF-8 restrictions throughout the entire file. It is thus (mostly) save to first try to read a file as UTF-8 and on failure switch to windows-1250.

              Here is some code I pulled from our project:

                  QTextCodec::ConverterState state;
                  QTextCodec *codec = QTextCodec::codecForName("UTF-8");
                  QByteArray byteArray(text);
                  QString str = codec->toUnicode(byteArray.constData(), byteArray.size(), &state);
                  if (state.invalidChars > 0)
                  {
                      str = QString::fromLatin1(text);
                  }
              

              text is a char * which reads in through an old C API. In your case you can directly read the contents of the file to byteArray. I would guess that you will have to handle the BOM separately (just check the first 3 bytes in QByteArray and strip them if necessary).

              1 Reply Last reply
              1

              • Login

              • Login or register to search.
              • First post
                Last post
              0
              • Categories
              • Recent
              • Tags
              • Popular
              • Users
              • Groups
              • Search
              • Get Qt Extensions
              • Unsolved