QString::fromUtf16 useless size parameter?
-
I'm reading the documentation for
QString::fromUtf16
. It says:"Returns a
QString
initialized with the firstsize
characters of the Unicode string unicode (ISO-10646-UTF-16 encoded)."I tested it out and with "characters" it doesn't mean bytes. I guess
size
is the number of symbols it tries to parse? But how am I supposed to know that before parsing the string since it is variable length? It seems like a chicken and egg problem, makingsize
(unless defaulting to -1) quite useless.Or does
size
mean the number of two byte pairs? In that case it's kind of confusing since the number of two byte pairs necessarily isn't the number of UTF-16 symbols due to the variable length.In my case the bytes lie in a buffer I cannot control which isn't null terminated. It seems like my best shot then is to copy the string over to another buffer, null terminate it and the pass
size
-1 which is kind of dumb.Is there something I'm missing here? Is the documentation for
QString::fromUtf16
broken or is the size parameter just kind of useless? -
It's the count of utf16 characters.
-
Now I got two different replies here. One that it is the number of UTF-16 characters and one that it's the number of two byte pairs.
Are you sure it is the number of two byte pairs? Because if I read the specification for ISO-10646-UTF-16 which the documentation refers to it specifically defines a character like this: "In the UTF-16 encoding, characters are
represented using either one or two unsigned 16-bit integers". -
Now I got two different replies here. One that it is the number of UTF-16 characters and one that it's the number of two byte pairs.
Are you sure it is the number of two byte pairs? Because if I read the specification for ISO-10646-UTF-16 which the documentation refers to it specifically defines a character like this: "In the UTF-16 encoding, characters are
represented using either one or two unsigned 16-bit integers".@potatis Dude, think this from the programming way, why does QString::fromUtf16 take
const ushort *
as the input data type?
Because it treat one ushort as one character. Also you can check from the QString source code how it get the size when you pass -1:if (size < 0) { size = 0; while (unicode[size] != 0) ++size; }
-
Now I got two different replies here. One that it is the number of UTF-16 characters and one that it's the number of two byte pairs.
Are you sure it is the number of two byte pairs? Because if I read the specification for ISO-10646-UTF-16 which the documentation refers to it specifically defines a character like this: "In the UTF-16 encoding, characters are
represented using either one or two unsigned 16-bit integers".@potatis
I see the problem with these "unusual" characters which actually require a pair of 16-bit wide characters. However these are obscure characters, and for all I know may actually cause problems in UTF-16 code from what I have seen looking around.I cannot justify it or be certain, but I believe you will find the size required, and returned by
wcsnlen_s()
, is always the number of 16-bit words required. Just IMO. You might like to actively seek out an example where a pair of 16-bit words are required for a "character" and verify so that you know. -
I tried it out now with:
#include <qstring.h> #include <iostream> int main(int argc, char *argv[]) { // Unicode characters AMPERSAND (U+0026) and LYCIAN LETTER BH (U+10283) ushort p[] = {0x0026, 0xd800, 0xde83}; QString s = QString::fromUtf16(p, 3); std::cout << s.size() << " " << s.toStdString() << std::endl; }
It prints out the two characters, although a size of 3, which is confusing since
size()
should print out the number of characters in the string. And in the documentation Qt has different definition of "character" than the UTF-16 standard they refer to. But anyway, that's how things are then. It clears some things up for me.In conclusion: The
size
parameter ofQString::fromUtf16
means the number of two byte pairs and not the number of characters as per ISO-10646-UTF-16. -
See https://doc.qt.io/qt-6/qstring.html#details
"QString stores a string of 16-bit QChars, where each QChar corresponds to one UTF-16 code unit. (Unicode characters with code values above 65535 are stored using surrogate pairs, that is, two consecutive QChars.)"