Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. About Qt5.5.1's default encoding setting.
Forum Updated to NodeBB v4.3 + New Features

About Qt5.5.1's default encoding setting.

Scheduled Pinned Locked Moved Unsolved General and Desktop
qtutf16
14 Posts 5 Posters 8.1k Views 3 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • sdjskrS Offline
    sdjskrS Offline
    sdjskr
    wrote on last edited by sdjskr
    #1

    I'm using Qt 5.5.1, the latest version.
    By the way, I have trouble in setting the default encoding. Since the default encoding was "system", when I change to "UTF-8" and Qt works well. However, as I change the default encoding to "UTF-16", "UTF-16LE", "UTF-16BE", the strings in editor become garbish.
    I have to set it back to work correctly.

    So, there is a one question.
    Is there any more setting for this to accomplish this? or that's just Qt's bug?

    JKSHJ 1 Reply Last reply
    0
    • SGaistS Offline
      SGaistS Offline
      SGaist
      Lifetime Qt Champion
      wrote on last edited by
      #2

      Hi and welcome to devnet,

      Are you talking about Qt or Qt Creator ?

      Interested in AI ? www.idiap.ch
      Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

      sdjskrS 1 Reply Last reply
      0
      • SGaistS SGaist

        Hi and welcome to devnet,

        Are you talking about Qt or Qt Creator ?

        sdjskrS Offline
        sdjskrS Offline
        sdjskr
        wrote on last edited by
        #3

        @SGaist hi, I am talking about Qt Creator. Do I have to install the Unicode set manually for Qt or configure something? As fare as I know Windows is Unicode based since Windows NT. I'm currently using Windows 10.

        1 Reply Last reply
        0
        • sdjskrS sdjskr

          I'm using Qt 5.5.1, the latest version.
          By the way, I have trouble in setting the default encoding. Since the default encoding was "system", when I change to "UTF-8" and Qt works well. However, as I change the default encoding to "UTF-16", "UTF-16LE", "UTF-16BE", the strings in editor become garbish.
          I have to set it back to work correctly.

          So, there is a one question.
          Is there any more setting for this to accomplish this? or that's just Qt's bug?

          JKSHJ Offline
          JKSHJ Offline
          JKSH
          Moderators
          wrote on last edited by
          #4

          @sdjskr said:

          when I change to "UTF-8" and Qt works well. However, as I change the default encoding to "UTF-16", "UTF-16LE", "UTF-16BE", the strings in editor become garbish.

          I presume you used Tools -> Options -> Text Editor -> Behavior -> File Encodings -> Default encoding?

          This options tells Qt Creator how to interpret your source code files.

          • If your files are encoded in UTF-8 and you tell Qt Creator to interpret them as UTF-8, then your code will be displayed correctly.
          • If your files are encoded in UTF-16 and you tell Qt Creator to interpret them as UTF-16, then your code will be displayed correctly.
          • If your files are encoded in UTF-8 but you tell Qt Creator to interpret them as UTF-16, then your code will be displayed as garbage.

          So my question is: How are your files encoded?

          Qt Doc Search for browsers: forum.qt.io/topic/35616/web-browser-extension-for-improved-doc-searches

          1 Reply Last reply
          2
          • sdjskrS Offline
            sdjskrS Offline
            sdjskr
            wrote on last edited by sdjskr
            #5

            @JKSH said:

            This options tells Qt Creator how to interpret your source code files.

            Hi!!!

            "This options tells Qt Creator how to interpret your source code files."

            That explains everything. I thought the file encoding settings in the Tools menu were initially for "CREATING a project WITH THE SPECIFIC ENCODING" that I set. Actually, it is just about "how to interpret"!!!!!

            then, how should I do to create a project encoded with UTF-16LE from the start??????
            I haven't found the related option until now.

            THANK YOU @JKSH!!!!

            JKSHJ 1 Reply Last reply
            0
            • sdjskrS sdjskr

              @JKSH said:

              This options tells Qt Creator how to interpret your source code files.

              Hi!!!

              "This options tells Qt Creator how to interpret your source code files."

              That explains everything. I thought the file encoding settings in the Tools menu were initially for "CREATING a project WITH THE SPECIFIC ENCODING" that I set. Actually, it is just about "how to interpret"!!!!!

              then, how should I do to create a project encoded with UTF-16LE from the start??????
              I haven't found the related option until now.

              THANK YOU @JKSH!!!!

              JKSHJ Offline
              JKSHJ Offline
              JKSH
              Moderators
              wrote on last edited by
              #6

              You're welcome :)

              @sdjskr said:

              then, how should I do to create a project encoded with UTF-16LE from the start??????
              I haven't found the related option until now.

              I'm not sure, sorry... I've never done that before.

              May I ask why you want to encode your project files in UTF-16LE?

              Qt Doc Search for browsers: forum.qt.io/topic/35616/web-browser-extension-for-improved-doc-searches

              1 Reply Last reply
              0
              • hskoglundH Offline
                hskoglundH Offline
                hskoglund
                wrote on last edited by
                #7

                Hi just want to add to @JKSH, while it's not possible to create Qt new projects in UTF-16LE; what you can do, is once you've created your project and have the files in UTF-8 format, use iconv to convert them from UTF-8 to UTF-16LE, e.g.
                iconv -f UTF-8 -t UTF-16 ../main.cpp -o main.cpp

                Note that it's best to specify UTF-16 instead of UTF-16LE as the output format, so that a BOM is created. Then Qt Creator will read and compile your C++ files just fine. However, when I tried I couldn't get moc to compile the .h files :-( maybe moc supports UTF-8 flavored files only).
                Also: iconv is a Linux utility, in Windows you have to download it

                Finally (to repeat @JKSH's question): UTF-8 is the future and UTF-16LE is a format from the 90's , everything will be easier for you if you can use UTF-8 :-)

                sdjskrS 1 Reply Last reply
                0
                • hskoglundH hskoglund

                  Hi just want to add to @JKSH, while it's not possible to create Qt new projects in UTF-16LE; what you can do, is once you've created your project and have the files in UTF-8 format, use iconv to convert them from UTF-8 to UTF-16LE, e.g.
                  iconv -f UTF-8 -t UTF-16 ../main.cpp -o main.cpp

                  Note that it's best to specify UTF-16 instead of UTF-16LE as the output format, so that a BOM is created. Then Qt Creator will read and compile your C++ files just fine. However, when I tried I couldn't get moc to compile the .h files :-( maybe moc supports UTF-8 flavored files only).
                  Also: iconv is a Linux utility, in Windows you have to download it

                  Finally (to repeat @JKSH's question): UTF-8 is the future and UTF-16LE is a format from the 90's , everything will be easier for you if you can use UTF-8 :-)

                  sdjskrS Offline
                  sdjskrS Offline
                  sdjskr
                  wrote on last edited by sdjskr
                  #8

                  @hskoglund Hi, thank you for the information.

                  The reason I want to use UTF-16LE is that I felt some limitation of Qt basic types when handling UTF-8 encoded files.

                  For example, QChar is two bytes, which means it can contain a letter within 2 bytes like 'a' 'b' 'c' and so on.
                  However, when it comes to Korean letters in UTF-8 encoding, they occupy 3 Bytes per letter in memory, like 'e3' '84' 'b1' allocated for 'ㄱ'.

                  Being said that, following code makes nonsense.

                  #include <QCoreApplication>
                  #include <QtCore>
                  
                  QTextStream cout(stdout, QIODevice::WriteOnly);
                  
                  int main(int argc, char *argv[])
                  {
                      QCoreApplication a(argc, argv);
                  
                      QChar korean_letter = 'ㄱ';
                  
                      cout << korean_letter << endl;
                      return a.exec();
                  }
                  
                  

                  That shows nothing on the screen.
                  Even the basic code does not work for Asian Characters.

                  To accomplish this with Korean Character I have to use some conversion functions with QString.

                   QString letter = QString::fromUtf8("가");
                  

                  There is no option for QChar to convert from UTF-8 letter, while QChar itself is UTF-16 format.
                  Only QChar::fromLatin1() exists. We are supposed to have the corresponding option like QChar::fromUtf8 or fromLocal8Bit

                  Anyway, UTF-16 characters are uniformly 2Bytes. It's quite handy to accomplish a solution for a software that needs the word counting.

                  In UTF-8 encoded files, some letter is 2bytes, some is 3bytes.
                  I have to consider the memory size by each character , when two languages are mixed in a sentence. It's time consuming with a headache.

                  Various solutions for various situations!!!

                  1 Reply Last reply
                  0
                  • Chris KawaC Offline
                    Chris KawaC Offline
                    Chris Kawa
                    Lifetime Qt Champion
                    wrote on last edited by
                    #9

                    Anyway, UTF-16 characters are uniformly 2Bytes

                    That's not true. UTF-16 is a variable length encoding (like UTF-8). In UTF-16 a code-point is 16 bits. A character can consist of one or more code-points. Despite the misleading name QChar represents a code-point, not a character, so some characters may require several QChars to represent it. Note, for example, that there's a surrogateToUcs4 function to convert two QChars to a single UCS-4 letter stored on 32bits.

                    There is no option to convert from UTF-8 to QChar because, as you pointed out, some UTF-8 characters don't fit into a single UTF-16 codepoint. To create a sequence of QChars representing a 3byte UTF-8 character you would Use QString::fromUtf8().

                    sdjskrS 1 Reply Last reply
                    0
                    • hskoglundH Offline
                      hskoglundH Offline
                      hskoglund
                      wrote on last edited by
                      #10

                      Hi, I understand your problem a bit more now. (I use Swedish UTF-8 letters in Qt, it's ok, but my problem is with Notepad, if I by mistake open a UTF-8 .cpp file with Swedish letters inside in Notepad, then Notepad adds a BOM, MSVC2013 compiles differently, and bom I get gibberish instead.)

                      Anyway, you shouldn't need to think about which letters are 2 bytes and which are 3 bytes, for example, if we test 2 korean letters and one Western letter together:

                          QString threeLetters = "가A가";
                          for (auto c : threeLetters)
                              cout << c << endl;
                      

                      then Qt's string handling will correctly step to the next character, so the output will be correctly on 3 lines (note: correct on my Ubuntu 14.04, on Windows CMD window I get three lines correctly also but two are ?).

                      So my point is, let QString worry about which how many bytes each character takes etc. For example, this will return the correct number of 3:
                      cout << threeLetters.count();

                      P.S. For even more advanced Unicode string handling, you should look at Apple's Swift, where it's forbidden to index into a string, because of this problem with 2 or 3 (or even 4) bytes, see StackOverflow discussion

                      sdjskrS 1 Reply Last reply
                      0
                      • Chris KawaC Chris Kawa

                        Anyway, UTF-16 characters are uniformly 2Bytes

                        That's not true. UTF-16 is a variable length encoding (like UTF-8). In UTF-16 a code-point is 16 bits. A character can consist of one or more code-points. Despite the misleading name QChar represents a code-point, not a character, so some characters may require several QChars to represent it. Note, for example, that there's a surrogateToUcs4 function to convert two QChars to a single UCS-4 letter stored on 32bits.

                        There is no option to convert from UTF-8 to QChar because, as you pointed out, some UTF-8 characters don't fit into a single UTF-16 codepoint. To create a sequence of QChars representing a 3byte UTF-8 character you would Use QString::fromUtf8().

                        sdjskrS Offline
                        sdjskrS Offline
                        sdjskr
                        wrote on last edited by sdjskr
                        #11

                        @Chris-Kawa

                        Hi Chris!

                        Unlike UTF-8, all UTF-16 code point characters consist of two bytes(16 bits).

                        '0061' for 'a', '0062' for 'b' , and as for Korean Characters, 'ac00' for '가' , 'b098' for '나'

                        All the code above occupies uniformly 2 bytes in memory, whatever English or Korean.
                        and it is stored in reverse on little endian machines.

                        Like this '6100' '6200' '00ac' '98b0'

                        You're omitting 1 byte, '00'

                        1 Reply Last reply
                        0
                        • hskoglundH hskoglund

                          Hi, I understand your problem a bit more now. (I use Swedish UTF-8 letters in Qt, it's ok, but my problem is with Notepad, if I by mistake open a UTF-8 .cpp file with Swedish letters inside in Notepad, then Notepad adds a BOM, MSVC2013 compiles differently, and bom I get gibberish instead.)

                          Anyway, you shouldn't need to think about which letters are 2 bytes and which are 3 bytes, for example, if we test 2 korean letters and one Western letter together:

                              QString threeLetters = "가A가";
                              for (auto c : threeLetters)
                                  cout << c << endl;
                          

                          then Qt's string handling will correctly step to the next character, so the output will be correctly on 3 lines (note: correct on my Ubuntu 14.04, on Windows CMD window I get three lines correctly also but two are ?).

                          So my point is, let QString worry about which how many bytes each character takes etc. For example, this will return the correct number of 3:
                          cout << threeLetters.count();

                          P.S. For even more advanced Unicode string handling, you should look at Apple's Swift, where it's forbidden to index into a string, because of this problem with 2 or 3 (or even 4) bytes, see StackOverflow discussion

                          sdjskrS Offline
                          sdjskrS Offline
                          sdjskr
                          wrote on last edited by sdjskr
                          #12

                          @hskoglund
                          Hi again.

                          Yes, QString handles it exaclty as I expected.

                          According to the Qstring manual,

                          "Internally, QString stores the string using the UTF-16 encoding. Each of the 2 bytes of UTF-16 is represented using a QChar. "

                          If that's true, Qchar should have supported 2 bytes of Unicode letter without problem.
                          Actually, it's not.

                              QChar english = 'a';
                              QChar korean = 'ㄱ';
                             
                              cout << english << endl;  <--- working
                              cout << korean << endl;  <--- not working
                             
                          

                          By the way, C++ Standard Library handles wide characters without issues.

                               wchar_t korean_letter = L'ㄱ';
                               wcout.imbue(locale("korean"));
                               wcout << korean_letter << endl;  <--- this shows 'ㄱ' correctly.
                          

                          QChar is the basic unit while it's behavior is not basic when it comes to Unicode.
                          The conclusion is to use QString only in Qt.

                          Thank you anyway. best regard!!!

                          1 Reply Last reply
                          0
                          • Chris KawaC Offline
                            Chris KawaC Offline
                            Chris Kawa
                            Lifetime Qt Champion
                            wrote on last edited by Chris Kawa
                            #13

                            @sdjskr said:

                            Unlike UTF-8, all UTF-16 code point characters consist of two bytes(16 bits).

                            Nope, not true. You are thinking of UCS-2. And you are mixing things. A code point is not the same as character. There's no such thing as "code point character". UTF-16 is a variable length encoding. It can be one or two 16bit code points i.e. one character can occupy 2 or 4 bytes. A QChar represents a code point, not a character, so some characters will need one, and some two QChars.

                            UTF-8 is also a variable length encoding, but with 8bit code points and each character can consist of 1 to 4 code points i.e. a character can occupy from 1 to 4 bytes.

                            From the above it should be clear that not all UTF-8 characters can be converted into a single UTF-16 code point. Some UTF-8 characters require two UTF-16 code points i.e. two QChars.

                            QString holds a sequence of QChars, that's why you can convert a UTF-8 string into QString. The number of QChars in the QString can differ from the number of characters.

                            "Internally, QString stores the string using the UTF-16 encoding. Each of the 2 bytes of UTF-16 is represented using a QChar. "

                            That's basically what I said. A QChar represents every 2 bytes (i.e. code point) of UTF-16. It doesn't mean a QChar represents a character. For some it will, for some it's just a half of a character.

                            sdjskrS 1 Reply Last reply
                            0
                            • Chris KawaC Chris Kawa

                              @sdjskr said:

                              Unlike UTF-8, all UTF-16 code point characters consist of two bytes(16 bits).

                              Nope, not true. You are thinking of UCS-2. And you are mixing things. A code point is not the same as character. There's no such thing as "code point character". UTF-16 is a variable length encoding. It can be one or two 16bit code points i.e. one character can occupy 2 or 4 bytes. A QChar represents a code point, not a character, so some characters will need one, and some two QChars.

                              UTF-8 is also a variable length encoding, but with 8bit code points and each character can consist of 1 to 4 code points i.e. a character can occupy from 1 to 4 bytes.

                              From the above it should be clear that not all UTF-8 characters can be converted into a single UTF-16 code point. Some UTF-8 characters require two UTF-16 code points i.e. two QChars.

                              QString holds a sequence of QChars, that's why you can convert a UTF-8 string into QString. The number of QChars in the QString can differ from the number of characters.

                              "Internally, QString stores the string using the UTF-16 encoding. Each of the 2 bytes of UTF-16 is represented using a QChar. "

                              That's basically what I said. A QChar represents every 2 bytes (i.e. code point) of UTF-16. It doesn't mean a QChar represents a character. For some it will, for some it's just a half of a character.

                              sdjskrS Offline
                              sdjskrS Offline
                              sdjskr
                              wrote on last edited by
                              #14

                              @Chris-Kawa

                              The code point is also composed of characters, so code point character could be used to refer to the code point. Human is not supposed to speak only words in the dictionary. We are not a robot.

                              Technically, each code point in UTF-16 is basically 2 bytes(16bit) unit. 4 bytes code point actually holds lead bytes and tail bytes. Still the basic unit is 2 bytes. And the 4 bytes unit is assigned to rarely used characters, which means we don’t need to care about the 4 bytes code point in UTF-16.

                              So, UTF-16 is uniformly 2 bytes does make sense.

                              @Chris Kawa said:

                              “That's basically what I said. A QChar represents every 2 bytes (i.e. code point) of UTF-16. It doesn't mean a QChar represents a character. For some it will, for some it's just a half of a character.”

                              If Korean characters are 4-byte code points, that’s reasonable. But every Korean characters are 2-byte code points. QChar shows the same 2 bytes code differently. It shows ‘a’ but not ‘ㄱ’.

                              For Latin letter it works, for Korean letter it works not.

                              The funny thing is that QChar itself lacks in ability to convert each encoding while it gets the job done inside QString by using some functions.

                              1 Reply Last reply
                              0

                              • Login

                              • Login or register to search.
                              • First post
                                Last post
                              0
                              • Categories
                              • Recent
                              • Tags
                              • Popular
                              • Users
                              • Groups
                              • Search
                              • Get Qt Extensions
                              • Unsolved