Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Converting accented characters to std::string returns mangled text
Forum Updated to NodeBB v4.3 + New Features

Converting accented characters to std::string returns mangled text

Scheduled Pinned Locked Moved Unsolved General and Desktop
12 Posts 6 Posters 4.2k Views 2 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J Offline
    J Offline
    jacobP
    wrote on last edited by jacobP
    #1

    If my QString's value is

    WeŠ.txt
    

    the function QString::toStdString() will return

    WeÅ .txt
    

    whereas the function QString::toStdU16String will return the actual string

    WeŠ.txt.
    

    Why is that ? Š is a UTF-8 character as can be seen here : http://www.fileformat.info/info/charset/UTF-8/list.htm.

    jsulmJ 1 Reply Last reply
    0
    • J jacobP

      If my QString's value is

      WeŠ.txt
      

      the function QString::toStdString() will return

      WeÅ .txt
      

      whereas the function QString::toStdU16String will return the actual string

      WeŠ.txt.
      

      Why is that ? Š is a UTF-8 character as can be seen here : http://www.fileformat.info/info/charset/UTF-8/list.htm.

      jsulmJ Offline
      jsulmJ Offline
      jsulm
      Lifetime Qt Champion
      wrote on last edited by
      #2

      @jacobP How do you show this string? In console? Could be just an issue with the font/encoding in your console.

      https://forum.qt.io/topic/113070/qt-code-of-conduct

      J 1 Reply Last reply
      0
      • S Offline
        S Offline
        shabdaclinic
        Banned
        wrote on last edited by
        #3
        This post is deleted!
        1 Reply Last reply
        0
        • jsulmJ jsulm

          @jacobP How do you show this string? In console? Could be just an issue with the font/encoding in your console.

          J Offline
          J Offline
          jacobP
          wrote on last edited by jacobP
          #4

          Hi @jsulm ,

          I am viewing the strings using the Visual Studio watcher while debugging. Below is the code I have currently:

          std::string u8 = entry.toUtf8().constData();
          auto u16 = entry.toStdU16String();
          auto u32 = entry.toStdU32String();
          auto wstd = entry.toStdWString();
          auto stds = entry.toStdString();
          

          The variables u8 and stds have the value WeÅ .txt (notice the space between the A and .txt) while the rest have WeŠ.txt.

          I am trying to use C library that only takes const char* as inputs and it is currently crashing due to the strings being mangled.

          mrjjM 1 Reply Last reply
          0
          • J jacobP

            Hi @jsulm ,

            I am viewing the strings using the Visual Studio watcher while debugging. Below is the code I have currently:

            std::string u8 = entry.toUtf8().constData();
            auto u16 = entry.toStdU16String();
            auto u32 = entry.toStdU32String();
            auto wstd = entry.toStdWString();
            auto stds = entry.toStdString();
            

            The variables u8 and stds have the value WeÅ .txt (notice the space between the A and .txt) while the rest have WeŠ.txt.

            I am trying to use C library that only takes const char* as inputs and it is currently crashing due to the strings being mangled.

            mrjjM Offline
            mrjjM Offline
            mrjj
            Lifetime Qt Champion
            wrote on last edited by
            #5

            @jacobP
            Hi I wonder what encoding the input is in ?
            https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

            J 1 Reply Last reply
            0
            • mrjjM mrjj

              @jacobP
              Hi I wonder what encoding the input is in ?
              https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

              J Offline
              J Offline
              jacobP
              wrote on last edited by
              #6

              @mrjj

              They are Windows file names. From what I just looked up Windows file names are encoded in UTF-16.

              mrjjM 1 Reply Last reply
              0
              • J jacobP

                @mrjj

                They are Windows file names. From what I just looked up Windows file names are encoded in UTF-16.

                mrjjM Offline
                mrjjM Offline
                mrjj
                Lifetime Qt Champion
                wrote on last edited by
                #7

                @jacobP
                And you are sure they are not mangled from the source ?
                As when you read them?

                J 1 Reply Last reply
                0
                • mrjjM mrjj

                  @jacobP
                  And you are sure they are not mangled from the source ?
                  As when you read them?

                  J Offline
                  J Offline
                  jacobP
                  wrote on last edited by
                  #8

                  @mrjj

                  Using the Visual Studio watcher,

                  entry
                  

                  by itself will return WeŠ.txt but

                  entry.toUtf8()
                  

                  will return WeÅ .txt.

                  I can see from my file explorer that the file name is WeŠ.txt.

                  1 Reply Last reply
                  0
                  • J Offline
                    J Offline
                    jacobP
                    wrote on last edited by
                    #9

                    I also wanted to point this out: The character Š does not fit on a single byte. It's UTF-8 encoding is 197 160, Unicode is 352. Trying to fit this character in a char will result in 2 chars, Å(197) and <No break space>1(60) respectively.

                    mrjjM 1 Reply Last reply
                    0
                    • J jacobP

                      I also wanted to point this out: The character Š does not fit on a single byte. It's UTF-8 encoding is 197 160, Unicode is 352. Trying to fit this character in a char will result in 2 chars, Å(197) and <No break space>1(60) respectively.

                      mrjjM Offline
                      mrjjM Offline
                      mrjj
                      Lifetime Qt Champion
                      wrote on last edited by
                      #10

                      @jacobP
                      Yes, so i do wonder how to get correct ascii file name out of that.
                      Give it some time, some of the others might have inputs.

                      1 Reply Last reply
                      0
                      • hskoglundH Offline
                        hskoglundH Offline
                        hskoglund
                        wrote on last edited by
                        #11

                        @jacobP That C library that only takes const char* as inputs, how old is it? Maybe it's for FAT file systems and not NTFS? (Qt and NTFS are about the same age (~25 years) that's why QString also uses UTF-16).

                        You could try the technology used before Unicode was invented: code pages. Pros: everything fits in single bytes. Cons: depending on what codepage you set your system for, different characters will be displayed for the same byte :-(

                        QString has a function for converting down from UTF-16 to your current Windows codepage: toLocal8Bit, example:
                        (also you need to #include "windows.h" to enable the ::GetACP() function)

                        QString s("WeŠ.txt");
                        qDebug() << s.toUcs4();
                        
                        qDebug() << ::GetACP();
                        qDebug() << s.toLocal8Bit();
                        

                        Output:

                        QVector(87, 101, 352, 46, 116, 120, 116)
                        1252
                        "We\x8A.txt"
                        

                        First, I use toUcs4() to display the UTF-16 contents of the QString, and the Š is as you say 352. (Ucs4 is a bigger and newer brother to UTF-16).

                        Then I query Windows for which code page will be used for the toLocal8Bit() function, on my machine is 1252, this will vary from country to country.

                        The final line reveals that on code page 1252 the Š character has the code 0x8A (138 decimal), which fits into a byte. Try giving that QByteArray to your C library...

                        1 Reply Last reply
                        1
                        • SGaistS Offline
                          SGaistS Offline
                          SGaist
                          Lifetime Qt Champion
                          wrote on last edited by
                          #12

                          Hi,

                          What C library is that ?

                          Interested in AI ? www.idiap.ch
                          Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

                          1 Reply Last reply
                          0

                          • Login

                          • Login or register to search.
                          • First post
                            Last post
                          0
                          • Categories
                          • Recent
                          • Tags
                          • Popular
                          • Users
                          • Groups
                          • Search
                          • Get Qt Extensions
                          • Unsolved