Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. ISO-8859-1 and UTF-8 Hell 🤯

ISO-8859-1 and UTF-8 Hell 🤯

Scheduled Pinned Locked Moved Unsolved General and Desktop
10 Posts 4 Posters 3.4k Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • A Offline
    A Offline
    Adam Crowe
    wrote on 28 Apr 2020, 09:58 last edited by
    #1

    Hello!

    Due to unfortunate circumstances, a file that was created in C# was saved with ISO-8859-1 encoding but contained unicode characters as well. Those familiar with C#, the method used was Encoding.Default.GetBytes() The problem never came to light because on Windows, in the original app, the file decodes correctly via the same method. I am now porting the app to Qt and having trouble decoding the unicode characters in the file.

    Take this character for example:

    ↕
    

    It is saved as 6 bytes in the file instead of 3. The six bytes are:

    0xC3 0xA2 0xE2 0x80 0xA0 0xE2 0x80 0xA2
    

    So it gets interpreted as:

    ↕
    

    I have managed to use two online tools to manually do the conversion but cannot find a way to do replicate this process in Qt code. I can take the original the three characters ↕ and convert them with ISO-8859-1 decoding to their respective hex representations here:

    https://www.rapidtables.com/convert/number/ascii-to-hex.html

    This way ↕ becomes:

    0xE2 0x86 0x95
    

    I can then paste these three hex bytes to the Hex to UTF-8 tool below and get the right character:

    https://onlineutf8tools.com/convert-hexadecimal-to-utf8

    I need to somehow take the original six bytes or the three characters they get interpreted as and make one unicode character out of them. I think I've tried every permutation of QTextEncode with ISO-8859-1 and UTF-8 codecs but I never get the right results.

    Any help would be massively appreciated!

    K 1 Reply Last reply 28 Apr 2020, 12:16
    0
    • A Adam Crowe
      28 Apr 2020, 09:58

      Hello!

      Due to unfortunate circumstances, a file that was created in C# was saved with ISO-8859-1 encoding but contained unicode characters as well. Those familiar with C#, the method used was Encoding.Default.GetBytes() The problem never came to light because on Windows, in the original app, the file decodes correctly via the same method. I am now porting the app to Qt and having trouble decoding the unicode characters in the file.

      Take this character for example:

      ↕
      

      It is saved as 6 bytes in the file instead of 3. The six bytes are:

      0xC3 0xA2 0xE2 0x80 0xA0 0xE2 0x80 0xA2
      

      So it gets interpreted as:

      ↕
      

      I have managed to use two online tools to manually do the conversion but cannot find a way to do replicate this process in Qt code. I can take the original the three characters ↕ and convert them with ISO-8859-1 decoding to their respective hex representations here:

      https://www.rapidtables.com/convert/number/ascii-to-hex.html

      This way ↕ becomes:

      0xE2 0x86 0x95
      

      I can then paste these three hex bytes to the Hex to UTF-8 tool below and get the right character:

      https://onlineutf8tools.com/convert-hexadecimal-to-utf8

      I need to somehow take the original six bytes or the three characters they get interpreted as and make one unicode character out of them. I think I've tried every permutation of QTextEncode with ISO-8859-1 and UTF-8 codecs but I never get the right results.

      Any help would be massively appreciated!

      K Offline
      K Offline
      KroMignon
      wrote on 28 Apr 2020, 12:16 last edited by
      #2

      @Adam-Crowe Just to be sure to right understand your problem: you want a way to import your ISO-8859-1 content from a file into a QString?
      IMHO, the easiest way is to read the file content into a QByteArray and to use QString::fromLatin1().

      Like this:

      QFile latin1File(filePath);
      QString fileData;
      if(latin1File.open(QIODevice::ReadOnly)) 
      {
          QByteArray fileData = latin1File.readAll();
          latin1File.close();
          fileData = QString::fromLatin1(fileData);
      }
      
      

      It is an old maxim of mine that when you have excluded the impossible, whatever remains, however improbable, must be the truth. (Sherlock Holmes)

      1 Reply Last reply
      3
      • A Offline
        A Offline
        Adam Crowe
        wrote on 28 Apr 2020, 18:40 last edited by
        #3

        @KroMignon Thank you very much for this but sadly it doesn't work as expected. The characters that are meant to be unicode get read as:

        ââ\u0080 Ã¢\u0080\u009C
        

        It's like the original ↕ character is encoded twice or something.

        P 1 Reply Last reply 28 Apr 2020, 18:59
        0
        • A Adam Crowe
          28 Apr 2020, 18:40

          @KroMignon Thank you very much for this but sadly it doesn't work as expected. The characters that are meant to be unicode get read as:

          ââ\u0080 Ã¢\u0080\u009C
          

          It's like the original ↕ character is encoded twice or something.

          P Offline
          P Offline
          Pablo J. Rogina
          wrote on 28 Apr 2020, 18:59 last edited by
          #4

          @Adam-Crowe said in ISO-8859-1 and UTF-8 Hell 🤯:

          Take this character for example:
          ↕

          It is saved as 6 bytes in the file instead of 3. The six bytes are:
          0xC3 0xA2 0xE2 0x80 0xA0 0xE2 0x80 0xA2

          So it gets interpreted as:
          ↕

          Well, the interpretation seems to be Ok. "Usually" UTF-8 characters take up to 2 bytes, so in this case you'll end up with 3 characters.

          I pasted all those 6 hex values at the online tool you mentioned (https://onlineutf8tools.com/convert-hexadecimal-to-utf8) and I've got... well, 3 characters.

          in the original app, the file decodes correctly via the same method.

          Perhaps the original app was doing something else beyond interpreting Unicode chars, i.e. "composing" the arrow char from all those 3 resulting Unicode characters.

          Upvote the answer(s) that helped you solve the issue
          Use "Topic Tools" button to mark your post as Solved
          Add screenshots via postimage.org
          Don't ask support requests via chat/PM. Please use the forum so others can benefit from the solution in the future

          A 1 Reply Last reply 28 Apr 2020, 20:41
          1
          • P Pablo J. Rogina
            28 Apr 2020, 18:59

            @Adam-Crowe said in ISO-8859-1 and UTF-8 Hell 🤯:

            Take this character for example:
            ↕

            It is saved as 6 bytes in the file instead of 3. The six bytes are:
            0xC3 0xA2 0xE2 0x80 0xA0 0xE2 0x80 0xA2

            So it gets interpreted as:
            ↕

            Well, the interpretation seems to be Ok. "Usually" UTF-8 characters take up to 2 bytes, so in this case you'll end up with 3 characters.

            I pasted all those 6 hex values at the online tool you mentioned (https://onlineutf8tools.com/convert-hexadecimal-to-utf8) and I've got... well, 3 characters.

            in the original app, the file decodes correctly via the same method.

            Perhaps the original app was doing something else beyond interpreting Unicode chars, i.e. "composing" the arrow char from all those 3 resulting Unicode characters.

            A Offline
            A Offline
            Adam Crowe
            wrote on 28 Apr 2020, 20:41 last edited by
            #5

            @Pablo-J-Rogina That's exactly what I experienced. For whatever reason, in C# those 3 characters that you ended up with as well get decoded (encoded?) into one single unicode arrow char. And it's breaking my head how that's happening and why it's working correctly in .Net.

            P K 2 Replies Last reply 28 Apr 2020, 20:52
            0
            • A Adam Crowe
              28 Apr 2020, 20:41

              @Pablo-J-Rogina That's exactly what I experienced. For whatever reason, in C# those 3 characters that you ended up with as well get decoded (encoded?) into one single unicode arrow char. And it's breaking my head how that's happening and why it's working correctly in .Net.

              P Offline
              P Offline
              Pablo J. Rogina
              wrote on 28 Apr 2020, 20:52 last edited by
              #6

              @Adam-Crowe said in ISO-8859-1 and UTF-8 Hell 🤯:

              how that's happening

              Don't you have access to the C# source code to check?

              why it's working correctly in .Net.

              I wouldn't say "correctly" :-)
              Getting 3 Unicode characters combined into just one doesn't sound good, unless the application is doing that on purpose

              Upvote the answer(s) that helped you solve the issue
              Use "Topic Tools" button to mark your post as Solved
              Add screenshots via postimage.org
              Don't ask support requests via chat/PM. Please use the forum so others can benefit from the solution in the future

              1 Reply Last reply
              1
              • A Adam Crowe
                28 Apr 2020, 20:41

                @Pablo-J-Rogina That's exactly what I experienced. For whatever reason, in C# those 3 characters that you ended up with as well get decoded (encoded?) into one single unicode arrow char. And it's breaking my head how that's happening and why it's working correctly in .Net.

                K Offline
                K Offline
                KroMignon
                wrote on 28 Apr 2020, 21:46 last edited by
                #7

                @Adam-Crowe said in ISO-8859-1 and UTF-8 Hell 🤯:

                That's exactly what I experienced. For whatever reason, in C# those 3 characters that you ended up with as well get decoded (encoded?) into one single unicode arrow char. And it's breaking my head how that's happening and why it's working correctly in .Net.

                Hmm that's is very strange, perhaps you can try it with some other converters:

                fileData = QString::fromLocal8Bit(fileData);
                

                Or try different QTextCodec.
                First find out your system code page on your Windows system with chcp :

                c:\>chcp
                Page de codes active : 850
                

                And then use the corresponding codec

                QTextCodec *codec = QTextCodec::codecForName("IBM850");
                fileData = codec->toUnicode(fileData);
                

                It is an old maxim of mine that when you have excluded the impossible, whatever remains, however improbable, must be the truth. (Sherlock Holmes)

                A 1 Reply Last reply 29 Apr 2020, 20:09
                2
                • K KroMignon
                  28 Apr 2020, 21:46

                  @Adam-Crowe said in ISO-8859-1 and UTF-8 Hell 🤯:

                  That's exactly what I experienced. For whatever reason, in C# those 3 characters that you ended up with as well get decoded (encoded?) into one single unicode arrow char. And it's breaking my head how that's happening and why it's working correctly in .Net.

                  Hmm that's is very strange, perhaps you can try it with some other converters:

                  fileData = QString::fromLocal8Bit(fileData);
                  

                  Or try different QTextCodec.
                  First find out your system code page on your Windows system with chcp :

                  c:\>chcp
                  Page de codes active : 850
                  

                  And then use the corresponding codec

                  QTextCodec *codec = QTextCodec::codecForName("IBM850");
                  fileData = codec->toUnicode(fileData);
                  
                  A Offline
                  A Offline
                  Adam Crowe
                  wrote on 29 Apr 2020, 20:09 last edited by
                  #8

                  @KroMignon Thank you very much but unfortunately I have tried this. The Windows encoding used is definitely ISO-8859-1. I can debug the encoder code and that's what it reports.

                  I tried QString::fromLocal8Bit and many variations of QTextCodec processing :(

                  I fear that I will have to fix this at the source with the Windows software. Bugger.

                  1 Reply Last reply
                  0
                  • C Offline
                    C Offline
                    Christian Ehrlicher
                    Lifetime Qt Champion
                    wrote on 29 Apr 2020, 20:58 last edited by Christian Ehrlicher
                    #9

                    @Adam-Crowe said in ISO-8859-1 and UTF-8 Hell 🤯:

                    I fear that I will have to fix this at the source with the Windows software.

                    Correct since the bytes you give in your first post are neither valid latin1 nor utf-8 nor anything else.

                    iconv -f utf-8 -t cp1252 file.txt, then interpret this as utf-8 data.

                    /edit: got it

                    const char *str = "\xC3\xA2\xE2\x80\xA0\xE2\x80\xA2";
                    const QString utf8_1 = QString::fromUtf8(str);
                    QTextCodec *tc = QTextCodec::codecForName("CP1252");
                    const QByteArray ba = tc->fromUnicode(utf8_1);
                    const QString utf8_2 = QString::fromUtf8(ba);
                    qDebug() << utf8_2.size() << utf8_2 << QString::number(utf8_2.at(0).unicode(), 16);
                    

                    -->
                    1 "↕" "2195"

                    Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                    Visit the Qt Academy at https://academy.qt.io/catalog

                    A 1 Reply Last reply 30 Apr 2020, 19:41
                    4
                    • C Christian Ehrlicher
                      29 Apr 2020, 20:58

                      @Adam-Crowe said in ISO-8859-1 and UTF-8 Hell 🤯:

                      I fear that I will have to fix this at the source with the Windows software.

                      Correct since the bytes you give in your first post are neither valid latin1 nor utf-8 nor anything else.

                      iconv -f utf-8 -t cp1252 file.txt, then interpret this as utf-8 data.

                      /edit: got it

                      const char *str = "\xC3\xA2\xE2\x80\xA0\xE2\x80\xA2";
                      const QString utf8_1 = QString::fromUtf8(str);
                      QTextCodec *tc = QTextCodec::codecForName("CP1252");
                      const QByteArray ba = tc->fromUnicode(utf8_1);
                      const QString utf8_2 = QString::fromUtf8(ba);
                      qDebug() << utf8_2.size() << utf8_2 << QString::number(utf8_2.at(0).unicode(), 16);
                      

                      -->
                      1 "↕" "2195"

                      A Offline
                      A Offline
                      Adam Crowe
                      wrote on 30 Apr 2020, 19:41 last edited by
                      #10

                      @Christian-Ehrlicher Holy Smokes Batman! It actually worked!!!!!!!!!!!!

                      How on Earth?!?!? I would have never thought of this solution. I'm really shocked. It's working. It's working perfectly.

                      Hats off and a massive thank you Sir!!!! 🙌

                      1 Reply Last reply
                      0

                      7/10

                      28 Apr 2020, 21:46

                      • Login

                      • Login or register to search.
                      7 out of 10
                      • First post
                        7/10
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • Users
                      • Groups
                      • Search
                      • Get Qt Extensions
                      • Unsolved