About Qt5.5.1's default encoding setting.
-
@sdjskr said:
when I change to "UTF-8" and Qt works well. However, as I change the default encoding to "UTF-16", "UTF-16LE", "UTF-16BE", the strings in editor become garbish.
I presume you used Tools -> Options -> Text Editor -> Behavior -> File Encodings -> Default encoding?
This options tells Qt Creator how to interpret your source code files.
- If your files are encoded in UTF-8 and you tell Qt Creator to interpret them as UTF-8, then your code will be displayed correctly.
- If your files are encoded in UTF-16 and you tell Qt Creator to interpret them as UTF-16, then your code will be displayed correctly.
- If your files are encoded in UTF-8 but you tell Qt Creator to interpret them as UTF-16, then your code will be displayed as garbage.
So my question is: How are your files encoded?
-
@JKSH said:
This options tells Qt Creator how to interpret your source code files.
Hi!!!
"This options tells Qt Creator how to interpret your source code files."
That explains everything. I thought the file encoding settings in the Tools menu were initially for "CREATING a project WITH THE SPECIFIC ENCODING" that I set. Actually, it is just about "how to interpret"!!!!!
then, how should I do to create a project encoded with UTF-16LE from the start??????
I haven't found the related option until now.THANK YOU @JKSH!!!!
-
You're welcome :)
@sdjskr said:
then, how should I do to create a project encoded with UTF-16LE from the start??????
I haven't found the related option until now.I'm not sure, sorry... I've never done that before.
May I ask why you want to encode your project files in UTF-16LE?
-
Hi just want to add to @JKSH, while it's not possible to create Qt new projects in UTF-16LE; what you can do, is once you've created your project and have the files in UTF-8 format, use iconv to convert them from UTF-8 to UTF-16LE, e.g.
iconv -f UTF-8 -t UTF-16 ../main.cpp -o main.cpp
Note that it's best to specify UTF-16 instead of UTF-16LE as the output format, so that a BOM is created. Then Qt Creator will read and compile your C++ files just fine. However, when I tried I couldn't get moc to compile the .h files :-( maybe moc supports UTF-8 flavored files only).
Also: iconv is a Linux utility, in Windows you have to download itFinally (to repeat @JKSH's question): UTF-8 is the future and UTF-16LE is a format from the 90's , everything will be easier for you if you can use UTF-8 :-)
-
@hskoglund Hi, thank you for the information.
The reason I want to use UTF-16LE is that I felt some limitation of Qt basic types when handling UTF-8 encoded files.
For example, QChar is two bytes, which means it can contain a letter within 2 bytes like 'a' 'b' 'c' and so on.
However, when it comes to Korean letters in UTF-8 encoding, they occupy 3 Bytes per letter in memory, like 'e3' '84' 'b1' allocated for 'ㄱ'.Being said that, following code makes nonsense.
#include <QCoreApplication> #include <QtCore> QTextStream cout(stdout, QIODevice::WriteOnly); int main(int argc, char *argv[]) { QCoreApplication a(argc, argv); QChar korean_letter = 'ㄱ'; cout << korean_letter << endl; return a.exec(); }
That shows nothing on the screen.
Even the basic code does not work for Asian Characters.To accomplish this with Korean Character I have to use some conversion functions with QString.
QString letter = QString::fromUtf8("가");
There is no option for QChar to convert from UTF-8 letter, while QChar itself is UTF-16 format.
Only QChar::fromLatin1() exists. We are supposed to have the corresponding option like QChar::fromUtf8 or fromLocal8BitAnyway, UTF-16 characters are uniformly 2Bytes. It's quite handy to accomplish a solution for a software that needs the word counting.
In UTF-8 encoded files, some letter is 2bytes, some is 3bytes.
I have to consider the memory size by each character , when two languages are mixed in a sentence. It's time consuming with a headache.Various solutions for various situations!!!
-
Anyway, UTF-16 characters are uniformly 2Bytes
That's not true. UTF-16 is a variable length encoding (like UTF-8). In UTF-16 a code-point is 16 bits. A character can consist of one or more code-points. Despite the misleading name QChar represents a code-point, not a character, so some characters may require several QChars to represent it. Note, for example, that there's a surrogateToUcs4 function to convert two QChars to a single UCS-4 letter stored on 32bits.
There is no option to convert from UTF-8 to QChar because, as you pointed out, some UTF-8 characters don't fit into a single UTF-16 codepoint. To create a sequence of QChars representing a 3byte UTF-8 character you would Use
QString::fromUtf8()
. -
Hi, I understand your problem a bit more now. (I use Swedish UTF-8 letters in Qt, it's ok, but my problem is with Notepad, if I by mistake open a UTF-8 .cpp file with Swedish letters inside in Notepad, then Notepad adds a BOM, MSVC2013 compiles differently, and bom I get gibberish instead.)
Anyway, you shouldn't need to think about which letters are 2 bytes and which are 3 bytes, for example, if we test 2 korean letters and one Western letter together:
QString threeLetters = "가A가"; for (auto c : threeLetters) cout << c << endl;
then Qt's string handling will correctly step to the next character, so the output will be correctly on 3 lines (note: correct on my Ubuntu 14.04, on Windows CMD window I get three lines correctly also but two are ?).
So my point is, let QString worry about which how many bytes each character takes etc. For example, this will return the correct number of 3:
cout << threeLetters.count();
P.S. For even more advanced Unicode string handling, you should look at Apple's Swift, where it's forbidden to index into a string, because of this problem with 2 or 3 (or even 4) bytes, see StackOverflow discussion
-
Hi Chris!
Unlike UTF-8, all UTF-16 code point characters consist of two bytes(16 bits).
'0061' for 'a', '0062' for 'b' , and as for Korean Characters, 'ac00' for '가' , 'b098' for '나'
All the code above occupies uniformly 2 bytes in memory, whatever English or Korean.
and it is stored in reverse on little endian machines.Like this '6100' '6200' '00ac' '98b0'
You're omitting 1 byte, '00'
-
@hskoglund
Hi again.Yes, QString handles it exaclty as I expected.
According to the Qstring manual,
"Internally, QString stores the string using the UTF-16 encoding. Each of the 2 bytes of UTF-16 is represented using a QChar. "
If that's true, Qchar should have supported 2 bytes of Unicode letter without problem.
Actually, it's not.QChar english = 'a'; QChar korean = 'ㄱ'; cout << english << endl; <--- working cout << korean << endl; <--- not working
By the way, C++ Standard Library handles wide characters without issues.
wchar_t korean_letter = L'ㄱ'; wcout.imbue(locale("korean")); wcout << korean_letter << endl; <--- this shows 'ㄱ' correctly.
QChar is the basic unit while it's behavior is not basic when it comes to Unicode.
The conclusion is to use QString only in Qt.Thank you anyway. best regard!!!
-
@sdjskr said:
Unlike UTF-8, all UTF-16 code point characters consist of two bytes(16 bits).
Nope, not true. You are thinking of UCS-2. And you are mixing things. A code point is not the same as character. There's no such thing as "code point character". UTF-16 is a variable length encoding. It can be one or two 16bit code points i.e. one character can occupy 2 or 4 bytes. A QChar represents a code point, not a character, so some characters will need one, and some two QChars.
UTF-8 is also a variable length encoding, but with 8bit code points and each character can consist of 1 to 4 code points i.e. a character can occupy from 1 to 4 bytes.
From the above it should be clear that not all UTF-8 characters can be converted into a single UTF-16 code point. Some UTF-8 characters require two UTF-16 code points i.e. two QChars.
QString holds a sequence of QChars, that's why you can convert a UTF-8 string into QString. The number of QChars in the QString can differ from the number of characters.
"Internally, QString stores the string using the UTF-16 encoding. Each of the 2 bytes of UTF-16 is represented using a QChar. "
That's basically what I said. A QChar represents every 2 bytes (i.e. code point) of UTF-16. It doesn't mean a QChar represents a character. For some it will, for some it's just a half of a character.
-
The code point is also composed of characters, so code point character could be used to refer to the code point. Human is not supposed to speak only words in the dictionary. We are not a robot.
Technically, each code point in UTF-16 is basically 2 bytes(16bit) unit. 4 bytes code point actually holds lead bytes and tail bytes. Still the basic unit is 2 bytes. And the 4 bytes unit is assigned to rarely used characters, which means we don’t need to care about the 4 bytes code point in UTF-16.
So, UTF-16 is uniformly 2 bytes does make sense.
@Chris Kawa said:
“That's basically what I said. A QChar represents every 2 bytes (i.e. code point) of UTF-16. It doesn't mean a QChar represents a character. For some it will, for some it's just a half of a character.”
If Korean characters are 4-byte code points, that’s reasonable. But every Korean characters are 2-byte code points. QChar shows the same 2 bytes code differently. It shows ‘a’ but not ‘ㄱ’.
For Latin letter it works, for Korean letter it works not.
The funny thing is that QChar itself lacks in ability to convert each encoding while it gets the job done inside QString by using some functions.