[SOLVED] Problem using C functions in QtCreator
-
wrote on 2 Dec 2014, 22:34 last edited by Chris Kawa 4 Aug 2021, 22:38
Hi,
the last weeks, I coded a little program to handle my file resources. I'm using the windows resource system.
I did not have QtCreator on my computer the last weeks, so I used to call g++ directly using windows shell.
The following code worked without errors and the QMessageBox was never called, what means everything worked fine:QString File = "C:/Path/To/File/file.exe"; QByteArray bFile = File.toStdString().c_str(); LPCTSTR cFile = (LPCTSTR)bFile.data(); HANDLE hFile = BeginUpdateResource(cFile, false); if(hFile == NULL) QMessageBox::information(&win_main, "Warning", "Unexpected error occured: failed to open library");
Now, being on my own computer again, I created a project using QtCreator and tested the code above. No errors occured while compiling, but I always get the QMessageBox called, what means the windows function "BeginUpdateResource" was not called right or Qt can't work with the data type "HANDLE"...
Does anyone have a clue what I need to change in my QtCreator to be able to use windows functions/C data types like the one above?
Thank you in anticipation!
-
Qt works just fine with Windows types. Also QtCreator is just an IDE (a fancy text editor) so it doesn't actually compile your code.
The actual type of LPCTSTR depends on UNICODE define and can be char* when it's not defined and wchar_t* when it is.
My guess is that you previously did not define UNICODE anywhere and so the type was char* and using c_str() worked.
Creator probably defines UNICODE for you (non UNICODE WinAPi is discouraged as obsolete) and so the type becomes wchar_t*.
You are using C-style cast (that's why they are evil!) to LPCTSTR so it compiles without warnings but produces garbage file name.To fix this you should use toWCharArray() instead of c_str and C-style cast.
Another workaround would be to use BeginUpdateResourceA (notice the A at the end) that uses the char* version explicitly.
-
wrote on 4 Dec 2014, 08:59 last edited by Chris Kawa 4 Aug 2021, 22:38
Hi,
thank you for your reply. Well, that could be the solution for my problem, but I'm not sure.
I tried the following:
QString File = "C:/Path/To/File/file.exe"; wchar_t* wFile; File.toWCharArray(wFile); LPCTSTR cFile = (LPCTSTR)*wFile; HANDLE hFile = BeginUpdateResource(cFile, false); if(hFile == NULL) QMessageBox::information(&win_main, "Warning", "Unexpected error occured: failed to open library");
I also tried it without the C cast, directly passing wFile to the function 'BeginUpdateResource'... the result is always the same:
QMessageBox is displayed, and after 20 seconds, my program crashes...Do you know where the problem could be then?
Where do I check which text format I use (like UNICODE) in QtCreator?
And how would you cast wchar_t* or char_t* to LPCTSTR without using C cast? -
Lifetime Qt Championwrote on 4 Dec 2014, 10:47 last edited by Chris Kawa 8 Oct 2018, 18:35
Ouch, wFile pointer is not initialized. You're writing over random memory :/
The easy way to check for UNICODE is something like this:
#ifdef UNICODE qDebug() << "UNICODE!"; #else qDebug() << "ANSI:("; #endif
The slightly harder is to go to the Qt installation directory and open the <Qt DIR>/mkspecs/<YOUR COMPILER>/qmake.conf
Near the top there's the DEFINE += line, and for MSVC it indeed defines UNICODE.As for proper way to do this some examples:
QString fileName = "C:/Path/To/File/file.exe"; //toStdWString returns std::wstring and its c_str returns wchar_t* //btw. FALSE not false, it's a WinAPI constant BeginUpdateResource(fileName.toStdWString().c_str(), FALSE);
//if you can't "inline" the conversion. //toWCharArray returns written size and does not append \0 auto arr= std::make_unique<wchar_t[]>(fileName.size() + 1); //+1 for \0 arr[fileName.toWCharArray(arr.get())] = '\0'; BeginUpdateResource(arr.get(), FALSE);
//or if you don't like C++14: wchar_t* arr = new wchar_t[fileName.size() + 1]; arr[fileName.toWCharArray(arr)] = '\0'; BeginUpdateResource(arr, FALSE); delete arr;
//or if you don't want dynamic alloc: wchar_t arr[MAX_PATH]; //MAX_PATH is a define from WinAPI arr[fileName.toWCharArray(arr)] = '\0'; BeginUpdateResource(arr, FALSE);
You don't need to explicitly have LPCTSTR in your code, it's just a typedef for either const char* or const wchar_t*.
-
wrote on 4 Dec 2014, 23:31 last edited by
Hi again,
that is a very good explanation of how a wchar_t array is defined, thank you for that.
I see, the only problem was to use QString::toStdString instead of using QString::toStdWString. So UNICODE is defined in my Qt creator.Now, just for understanding what that means:
Whether UNICODE is defined or not is only important for string handling in my editor? Example:
@QString mystring = "This is a string";@
When I define strings in Qt creator like that, compiler reserves more or less memory for each character, depending on what format is defined (UNICODE or ASCII)? I just don't know why this is relevant for the WinApi function BeginUpdateResource(LPCTSTR, FALSE)? I mean, the LPCTSTR is a fix data type, doesn't matter which format Qt Creator defines for my code, or am I wrong?
My idea is, when my code worked using QString::toStdString().c_str(), I come to the result that the needed parameter is a char* array and it works with my ASCII.
When I now use the Qt Creator and UNICODE is defined, then what changed? What would be the result of using QString::toStdString() on a QString used in Qt Creator with UNICODE ? A double sized char* array?
And what happens then when I use QString::toStdWString() instead?I also try to get some information about that stuff by googling a bit, but if you have another five minutes for me, I really like your explanations :-)
Thank you in anticipation.
-
Lifetime Qt Championwrote on 5 Dec 2014, 00:58 last edited by Chris Kawa 8 Oct 2018, 18:36
No problem. Unfortunately you got it a bit backwards ;)
UNICODE is a define used by WinAPI functions. Qt cares nothing about it and it has zero impact on QString or any other Qt type (That is true for the purpose of this topic. The reality is, as usual, a bit more complicated).
QString is always stored as a 2 byte UTF-16 string internally. Nothing changes that.
The reason Qt config defines UNICODE for you is so that the proper functions from WinAPI are used when needed. This is all because of how Microsoft decided to transition from ANSI to UNICODE versions without changing any user code. The ides was (back around Windows 95) that you just define UNICODE, recompile and your app is "magically" unicode aware. While the solution was good at the time it caused countless problems in the next decade up to now.
All (the older) WinAPI functions that take string parameters are defined something like this (for example SetWindowText):
#ifdef UNICODE #define SetWindowText SetWindowTextW #else #define SetWindowText SetWindowTextA #endif
The W functions take LPCWSTR and A functions take LPCSTR as a parameter.
LPCWSTR is basically (after some macro redirections) a typedef for const wchar_t* (LP stands for "Long Pointer", C stands for "Const" and W stands for "Wide").
LPCSTR is (after the same redirections) const char*.There is also LPCTSTR (T stands for "Type" if I remember correctly) and it is defined like this:
#ifdef UNICODE #define LPCTSTR LPCWSTR #else #define LPCTSTR LPCSTR #endif
If you're working with Windows types only the way you would have text is something like this:
LPCTSTR sz = TEXT("Whatever");
The TEXT macro expands the definition to either L"Whatever" when UNICODE is defined or plain "Whatever" when it's not.
So all this works nicely like this:
LPCTSTR sz = TEXT("Whatever"); SetWindowText(hWnd, sz); //When UNICODE is defined this expands to const wchar_t* sz = L"Whatever"; SetWindowTextW(hWnd, sz); //and when it's not const char* sz = "Whatever"; SetWindowTextA(hWnd, sz);
So you can have both ANSI and UNICODE version of your app just by defining UNICODE. For a long time many apps offered downloads of both versions. Then one version would save a config file in ANSI, another tried to read it as unicode and things would go downhill from there. It was a mess. Thankfully it is (mostly) gone these days. Microsoft discourages the A versions (they are still there for compatibility though) and new APIs only come in the unicode flavor without that UNICODE switch awareness.
-
Lifetime Qt Championwrote on 5 Dec 2014, 00:58 last edited by Chris Kawa 8 Oct 2018, 18:40
Of course this would all be fine and dandy except WinAPi is not the only API out there and very few libraries care for that UNICODE define nonsense (Qt doesn't). This means that the responsibility to convert strings properly falls to the programmer again, whenever you need to mix WinAPI with other libs.
So we arrive at yet another type of string: std::string and std::wstring. The first one holds plain old chars, and the second one is abomination, because it can hold whatever the platform wishes. On Windows it's a 2 byte short, on some other platforms it's something like 4 byte int. But as long as we're on Windows we can assume 2 byte shorts. It's unlikely to change for compatibility reasons.
Both of these types have a c_str method that returns a pointer to the underlying data. It's const char* for std::string and const wchar_t* for std::wstring.So now we need to transition somehow from that QString 2 byte UTF-16 to either LPCWSTR when using W functions from WinAPI or to LPCSTR when we use A versions. We can do that with arrays (using toWcharArray and toLatin1) or indirectly, converting to the std:: types first and then taking a pointer to their data.
One way to approach this is to use the W and A functions explicitly and not through the general define. This way you are sure what parameters it takes. If you choose the old A versions the downside is that the newer APIs don't offer this variant anymore.
You can also decide to only support one version and use the general functions assuming silently that they will expand to what you want. Of course this will crash if someone later on tries to compile your code with the opposite UNICODE setting (like you did here). If you go for this option it's good to place something like this somewhere in your code:
#ifndef UNICODE #error This program requires UNICODE define. #endif
This would at least tell the poor next guy what is going on instead of giving horrible unreadable errors about type mismatches.
But if you want to be "general" and support the obsolete A versions you need to do some more work:
QString str("Whatever"); #ifdef UNICODE SetWindowText(hWnd, str.toStdWString().c_str()); #else SetWindowText(hWnd, str.toStdString().c_str()); #endif
One important thing is to be noted here as this is a common mistake.
This way of converting can only be used inlined in the parameter like above, ie. this is a (horrible) bug:QString str("Whatever"); wchar_t* ptr = str.toStdWString().c_str(); //bug here SetWindowText(hWnd, ptr);
This is because of the magic of ";".
At the point of ; the temporary instance of std::wstring returned by toStdWString is destroyed and the ptr points to invalid memory region.This is also invalid:
QString str("Whatever"); wchar_t* ptr; str.toWCharArray(ptr);
toWCharArray does not allocate memory. It assumes the pointer you gave it points to enough space to do the conversion. If it doesn't it will happily write over invalid memory region for you and crash the app (if you're lucky).
And a small epilogue to all that is why your app worked when you called g++ directly and stopped when used from Creator.
When you compile directly the "environment" is clear ie. there are no defines and compiler knows nothing about qmake or Qt. It just compiles some code and doesn't care about much. You had (unknowingly) an assumption in your code that UNICODE is not defined and so the QString to LPCSTR conversion you used via the std::string worked just fine.
When you created a Creator project it is aware that you are using Qt. So it looks for the Qt kit you chose, opens the qmake.conf of that kit and defines whatever is specified there for you. One of those things happens to be the UNICODE define (because that is what MS recommends these days). This means that now the LPCTSTR is actually LPCWSTR (wchar_t*) and WinAPI functions expect wchar_t*. You provided it with char* (using std::string::c_str()). Usually that would not compile but you used the (oh so evil) C-style cast so the compiler just shrugged and assumed you knew what you're doing. That of course did not work like you wanted.
If you made it this far without falling asleep - thanks for reading ;) I hope this is informative and will help you in the future.
-
wrote on 6 Dec 2014, 10:27 last edited by Chris Kawa 4 Aug 2021, 22:39
Hi,
thank you so much for spending lots of time explaining this topic that way.
I knew or better I heard about the transition from ASCII/ANSI to UNICODE, but I did not really know what that means and in which cases I need to take care about that stuff.So, if I understood you right, the APIs (WinAPI and other APIs on other OS) don't handle strings (char arrays) as 1B char* anymore (like it was in times using ANSI/ASCII) but now they use 2B wchar_t* (UNICODE format). For reasons of compatibility, WinAPI still offers an obsolete version of their functions (ending with 'A') which still use the 1B char* format.
To tell the WinAPI now that I'd like to use the obsolete functions while calling a function like "SetWindowText(handle, str_ptr)", I only need to define ANSI/ASCII. The other way round I only need to define UNICODE to tell windows that I'd like to use the WinAPI functions which require a 2B wchar_t* ptr.What I'm asking myself then is: When I use an API other then WinAPI, can I be sure that this API also provides the obsolete ANSI/ASCII functions and automatically swiches when defining the macros (ANSI or UNICODE)?
If not, why should I ever use ANSI or even 1B char* arrays again?
I mean, when I'm right with my assumption, Qt also thinks like I do:
[quote]QString is always stored as a 2 byte UTF-16 string internally. Nothing changes that.[/quote]So I think it is much better only to use the 2B wchar_t* format as Qt does it by standard?
If the only thing against this purpose is that this could crash if someone else but me works with my code and wants to go backwards using ANSI format, then I don't care about that because noone else but me will work with my code :-PWhat I still didn't get is your example of invalid wchar_t* definition:
QString str("Whatever"); wchar_t* ptr = str.toStdWString().c_str(); //bug here SetWindowText(hWnd, ptr);
[quote]At the point of ; the temporary instance of std::wstring returned by toStdWString is destroyed and the ptr points to invalid memory region.[/quote]Why is the instance of std::wstring destroyed before it could be conveted to char* array?
[quote]If you made it this far without falling asleep – thanks for reading ;) I hope this is informative and will help you in the future.[/quote]Au contraire! I'm happy to get detailed information about stuff I don't understand and I'm still missing a "like" button for great replies ;-)
-
Lifetime Qt Championwrote on 6 Dec 2014, 11:18 last edited by Chris Kawa 8 Oct 2018, 18:42
the APIs (WinAPI and other APIs on other OS) don’t handle strings (char arrays) as 1B char* anymore (like it was in times using ANSI/ASCII) but now they use 2B wchar_t*
Windows uses 1 byte char* for ANSI and 2 byte wchar_t* for UNICODE. There's also MBCS(Multi Byte Character String) but I'm too afraid to talk about it and they already try to phase it out in newer VS :)
Other OSes took totally different path and they usually use a variable length char* utf-8 strings. Which is great but doesn't mix with Windows well at all.I only need to define ANSI/ASCII
No. There's no such define. No UNICODE define means ANSI, UNICODE means.. well, UNICODE :)
can I be sure that this API also provides the obsolete ANSI/ASCII functions
If it's a 3rd party library there's a 99.999999% chance it doesn't know anything about this comotion. It's a WinAPI only mechanism as far as you are concerned.
If not, why should I ever use ANSI or even 1B char* arrays again
You shouldn't. Not in a new project at least. Only if you have to work with old code or compiled libraries that used it. Single 8 bit is not enough to hold almost anything except english alphabet and some digits. char arrays are ok though when used in a variable length encodings like utf-8, where the length of a single character can vary from one byte to several. But WinAPI knows nothing about utf-8.
I mean, when I’m right with my assumption, Qt also thinks like I do
Qt had the pleasure to be created more or less after the whole industry switched to unicode and agreed what unicode is (yes, that varies too). So they did the right thing and ditched 8 bit strings from starters. QString does provide many conversion functions for compatibility though.
So I think it is much better only to use the 2B wchar_t* format as Qt does it by standard?
Qt doesn't use wchar_t as far as I know. As I said wchar_t is not very portable because its size differs on different platforms. Qt uses "some" 2 byte type, whatever that happens to be on the given platform. It's abstracted away from you. You don't have to know what type that is. For the user it's just QChar and it's 2 byte, whatever the underlying language type this might be. That's why Qt is so awesome ;) Use QString whenever you can and only do (careful) conversion on API boundary.
Why is the instance of std::wstring destroyed before it could be conveted to char* array?
First of all when c_str() is called there's no conversion at all. It just returns a pointer to underlying data of std::string. That means that when a string is destroyed the pointer returned by c_str becomes invalid.
Hm, how would I illustrate it better...
Let's say this is our poor-mans std::string:class MyString { public: MyString() { qDebug() << "constructor"; } ~MyString() { qDebug() << "destructor"; } char* c_str() { return data; } private: char data[42]; };
Now let's simulate what happens:
//let's say this is our poor-mans QString::toStdString: MyString toMyString() { return MyString(); } //this is the call site: qDebug() << "Hello"; char* ptr = toMyString().c_str(); qDebug() << "Bye"; qDebug() << "I'd like to use the ptr now :(";
The output is:
Hello constructor destructor Bye I'd like to use the ptr now :(
So as you see the MyString is destroyed before we arrive at using the ptr. It is destroyed at ; because it's a temporary (an r-value in c++ lingo). It has no name or a reference you could access it through. E's passed on! This string is no more! It has ceased to be! E's expired and gone to meet 'is maker! E's a stiff! Bereft of life, 'e rests in peace... sorry, I got a little Monty Pythonic there :P
Now let's modify this to use it in a parameter:
//let's say this is one of our WinAPI functions that take char* void someFunc(char* ptr) { qDebug() << "ptr is still valid! yay!"; } //We'll do the conversion "in place" qDebug() << "Hello"; someFunc(toMyString().c_str()); qDebug() << "Bye";
Now the output is:
Hello constructor ptr is still valid! yay! destructor Bye
Which is what we wanted. This works because the string returned from toMyString() is still destroyed at the ; but this time it happens to be after we already used it inside the function. Note though, that if that function stores that pointer somewhere and tries to use it later it will be invalid just like before. But WinAPI functions don't do that. They use the string and don't hold to it anymore after the return.
-
Lifetime Qt Championwrote on 6 Dec 2014, 11:42 last edited by Chris Kawa 8 Oct 2018, 18:43
So I think it is much better only to use the 2B wchar_t* format as Qt does it by standard?
Basically when working with WInAPI you should prefer the UNICODE version of it. This is especially important when working with user provided data, like paths. If you try to convert something like "C:\bądźmy\poważni\zdjęcie.jpg" to std::string and take a pointer it will result in garbage because of the characters from outside of ASCII.
-
wrote on 6 Dec 2014, 13:17 last edited by
Ok, I'm not sure if I got it. You say, ANSI/ASCII is a format only used by windows, but the basic datatype char is platform independent, right? Please correct me if I'm wrong, but I understand it like that:
The basic datatype char is always 1B and contains numbers. ANSI/ASCII is only a "decryption table" that changes numbers to characters (letters, signs, numbers). Each OS has its own "decryption table".
UNICODE is a OS independant "decrpytion table" that changes one wchar_t (like a char with double size for storing more different signs) to a character (letter or sign or number).Is that the way it goes?
[quote]So as you see the MyString is destroyed before we arrive at using the ptr. It is destroyed at ; because it’s a temporary (an r-value in c++ lingo). It has no name or a reference you could access it through.[/quote]Ahh, the problem is that the temporary std::wstring has no identifier, so it is only "created" while compiling to work with it, but it is never stored into any memory (stack or heap), so the c_str() pointer can never point to anything?
Sounds logically! Thank's :-) -
Lifetime Qt Championwrote on 6 Dec 2014, 15:54 last edited by Chris Kawa 8 Oct 2018, 18:55
so it is only “created” while compiling to work with it, but it is never stored into any memory (stack or heap)
No, it is very much created and in memory(on the stack). It's just that it is destroyed right away. It's exactly like if you wrote something like this in your code:
QString();
An instance of a string is created and then destroyed right away because you didn't give it any name. If you wrote instead
QString s;
It would create an instance of a string, bind it to name "s" and it will be destroyed when "s" goes out of scope.
As for the chars and ANSIs... sigh, it's a topic so wast it's easy to make unintentional shortcuts and mislead people.
You need to distinct various things.First there are storage types: char, wchar_t, short, int etc. These are just units of storage. The fact they have "char" in name is unfortunate, because they might hold anything really, not only characters. Maybe it would be better if char was called "byte" and wchar_t was called "multibyte", but we're stuck with those names. But don't think of them as characters because it's easy to get lost later. They are units of storage, nothing else.
The second thing are encodings. These are the "tables" as you called it that assign numbers to characters. There are countless encodings. One of them is ASCII (7 bit), There are also "families" of encodings, like ANSI, which is extension of ASCII and it represents a 8 bit encoding of national characters. Some examples of ANSI are Latin1, also called Windows 8859-1, Windows 1252 etc. Unicode is also a family and there are different encodings that get called that.
Third thing is the character encoding, which is basically how these encodings above are stored using the physical storage units.
For example: ASCII needs 7 bits of storage to encode a single character. char type in c++ is 8 bit in size (on most platforms, not all) so it is commonly used to store ASCII encoded characters.
UTF-8 is one of Unicode encodings that takes 1 to 4 characters, eg. the ASCII subset is also a valid utf-8 character set and is stored on 1 byte. Some national characters require 2 bytes and some chinese characters or weird pictograms can require up to 4 bytes. This means UTF-8 is a variable length encoding eg. "być" has length 3 but requires 4 bytes to encode in utf-8 (1 for b, 1 for y, 2 for ć). smallest UTF-8 letter requires 8 bits of storage so, again, char is a usual storage unit used for it. Now you see why char is not a good name? In UTF-8 a single character can require anywhere from 1 to 4 chars (bytes).
You can look for more info on encodings like UTF-16 or UCS-2 on the web. Wikipedia has this described in great detail.UNICODE is just a (very poorly chosen name) define on a Windows platform used to make WinAPI use UTF-16 encoding. UTF-16 needs at least 2 bytes of storage for a character and it so happens that on Windows wchar_t is 2 bytes in size so that's what was chosen specifically for WinAPI. Other libraries store UTF-16 in different ways, eg. short, since on 32 bit Windows that is also 2 bytes. Some others use uint16_t and so on. wchar_t is not portable and it would be (if I recall correctly) 4 bytes on Linux, which is wasteful for UTF-16 there, but might be used eg. for UTF-32. Similarly UTF-16 on Linux is rarely used. The most common encoding there is UTF-8 stored in chars.
WinAPI uses either ANSI (whichever codepage is used for your language) or UTF-16 (which they just unfortunately call UNICODE).
At some point they also recognized the existance of variable length encodings like UTF-8 and so they created something called MBCS (Multi Byte Character Set) and also methods to convert stuff to the general UTF-16. To make it easier for everybody they named it horribly, like MultiByteToWideChar which converts anything to UTF-16. Stay away from these if you can. Qt has it all covered and named a lot better.If you're not using WinAPI, defining UNICODE or not does nothing to anything.
If you do use WinAPI you should define UNICODE (qmake does that for you) and stick to the wchar_t for the strings, or the WinAPI defines like LPCTSTR.
If you also use Qt, stick to QString, and if you need to pass it to WinAPI make sure UNICODE is defined and use the conversions I posted previously. -
wrote on 7 Dec 2014, 09:37 last edited by
Ok, so the r-values are created on stack and immediately deleted after the ';'.
What I'm asking myself then is, why does the following line in my code work:
@QString myString = "Some String";
LPVOID lpMyString = (LPVOID)myString.toStdWString().c_str();@
I think, the pointer to the std::wstring which is returned by c_str() will point to invalid memory, so the long pointer void (LPVOID) should also do that, right? I'm wondering why there is no error and my program works just fine..[quote]Third thing is the character encoding, which is basically how these encodings above are stored using the physical storage units.
For example: ASCII needs 7 bits of storage to encode a single character. char type in c++ is 8 bit in size (on most platforms, not all) so it is commonly used to store ASCII encoded characters.
UTF-8 is one of Unicode encodings that takes 1 to 4 characters, eg. the ASCII subset is also a valid utf-8 character set and is stored on 1 byte. Some national characters require 2 bytes and some chinese characters or weird pictograms can require up to 4 bytes. This means UTF-8 is a variable length encoding eg. “być” has length 3 but requires 4 bytes to encode in utf-8 (1 for b, 1 for y, 2 for ć). smallest UTF-8 letter requires 8 bits of storage so, again, char is a usual storage unit used for it. Now you see why char is not a good name? In UTF-8 a single character can require anywhere from 1 to 4 chars (bytes).[/quote]So do I understand you right:
On every OS, the datatype 'char' (1B) exists and is used to store every encoding format (ANSI/ASCII, UNICODE etc..). The only difference is, ANSI/ASCII always uses only 1 char (1B) for each sign, while Unicode encodings (e.g. UTF-8) contain 'big' signs like pictorgrams, which use from 1 till 4 chars (1 till 4 Byte). So a Unicode sign can eighter be a single char or a char array char [4] right?The different thing now is UTF-16, because it does not use datatype char but another, not portable datatype like wchar_t (windows), short, uint16_t and some more. So with UTF-16, the minimum physical storage unit is 2B.
Is this all correct?
If yes, I have three other questions:- Why doesn't exist a portable 2B datatype like it does for 1B (char) which can be used on every OS?
- You said, UTF-8 uses from 1 till 4 chars (1-4B), how many wchar_ts/uint16_ts.. does UTF-16 use? I think, because UTF-8 is able to contain every needed sign in every language with a range of 1-4Byte, so UTF-16 should only need from 1-2 wchar_ts, right?
- Why does WinAPI use UTF-16 instead of UTF-8? I mean, UTF-8 uses the portable char datatype, so there is no conversion needed, right?
-
Lifetime Qt Championwrote on 7 Dec 2014, 12:07 last edited by Chris Kawa 8 Oct 2018, 19:02
I’m wondering why there is no error and my program works just fine..
Because this is the wonder of undefined behavior in C++ ;) There is no "syntax error". Compiler can't help you here. It means that it can be anything from working just fine (right now on your computer in debug mode for example) to formatting hard drives and calling dragons (on a user's computer in release mode under full moon next year). Plug in static analysis tool and it will tell you the same.
It works (or it might seem that) because there's high chance that there's nothing allocated in that place for the next few lines, so the bytes are still there (deleting an object does not zero the memory). But if for example MS Paint happens to allocate a byte in that spot or the next variable on the stack is created in the same place you're out of luck. Try allocating something big in the next line and before using that pointer eg. int foo[ 1000 ] and fill it with data. There's a chance (not a certainty) that you will see your string corrupted.
So a Unicode sign can either be a single char or a char array char [4] right?
Well yes, except of course when you have a string you don't allocate each character separately. For example:
const char* foo = u8"być"; //u8 is c++11 utf-8 string prefix foo[0] == 0x62 == 'b' foo[1] == 0x79 == 'y' foo[2] == 0xC4 == first half of 2 byte character 'ć' foo[3] == 0x87 == second half of 2 byte character 'ć' 0xC4 0x87 == 'ć'
The sad consequence of this encoding is that all the standard string functions from C like strlen don't work, because they really count bytes, not characters. so strlen(u8"być") will return 4 :(
So do I understand you right:
Mostly yes. One thing to note is that not all Unicode encodings are variable length. There are things like UCS-2 but more on that in a moment.
As to your questions:
-
History, legacy, hardware. There's too much of it all. For example there's a large group of people (eg. the Linux crowd) that think 2 bytes for every character is a waste. So they prefer the variable length encodings like utf-8. Some old hardware can't handle a word (2 bytes) in a single instruction etc. There are maaany unrelated reasons. You need to remember that c++ is not a new thing. It was created in times when dinosaurs roamed free and hardware it had to run on was very limited (eg. 1MHz processor was a monster). Have you wondered why the LPCTSTR starts with Long Pointer? Is there a Short Pointer and what the heck is it? Back then there was and it was pretty important thing ;)
C++11 actually defines a portable 2 byte type: uint16_t. But there's just too much software/hardware that was created before 2011 ;) -
Sigh, this is one of these shortcuts I was talking about. In principle you are right. 2 code units are enough to encode the whole Unicode character set in UTF-16. Unfortunately before there was UTF-16 there was also UCS-2, which is basically a subset of UTF-16, just fixed at size of 2 bytes and not variable length like UTF-16. For all practical uses UCS-2 is enough and there is a lot software claiming support for UTF-16 when they actually support only UCS-2. I'm not really sure what's the case with Qt here. It is supposed to use UTF-16 and I'm reluctant to second guess that but I never checked if it can correctly hold characters greater than 0xFFFF. A nice exercise if you're bored ;)
-
WinAPI is almost as old as I am (which is pretty old :P). It was originally all about ANSI and code pages, until around Windows 95 MS realized that "holy crackers! there's a world outside USA and they don't speak english!". They already had a small hell with all the codepages and a lot of software that did crazy things with them. Imagine what a word processor had to do to correctly handle eg. Polish, Swedish and Japanese characters in the same sentence. The language support in Win95 was abysmal to say the least and MS had a tough cookie to crack because they wanted to ship to asian markets and there was no single codepage that could handle that.
UTF-8 was created around 1993. It was new and slow. Why slow? To do a simple string length calculation you need to look at every character. Can't just do (end - begin) like with the constant size characters (see the strlen example). Sorting and other operations also become more complicated. If you get a pointer to the middle of a string you can't right away tell what letter is there because you might be in the middle of it. Back then computers were too slow for that so it was a big no-no. A standard for the future if you will.
But MS needed something there and then so if a single byte was not enough the obvious choice was to just double that and worry later (as they usually did back then). This way it was also easier to do the ANSI to UNICODE switch that I described.
All in all Unicode is hard to handle correctly because there's no simple < relation between letters anymore and it can vary between languages. It's a complex design really. Many programmers don't appreciate that thinking that it's as easy to use as ASCII. It's not.
Variable length encodings are good for compression (that's why the web adopted UTF-8 so fast) but constant size is a lot easier to transform, split and generally work with.
UTF-16 is not a bad encoding. A little wasteful most of the time (for ASCII subset) but that's not a big deal these days (IMHO). Qt seems to do a very good job with it and with the various SSE optimizations it performs pretty well.
-
-
wrote on 7 Dec 2014, 12:50 last edited by Chris Kawa 4 Aug 2021, 23:01
Thank you very much for that great lesson. I'm happy to know a little bit about that stuff now and I'm sure it will be helpful for others who visit this topic anytime in the future :-)
Well, maybe one last thing you mentioned at the end:
Variable length encodings are good for compression (that’s why the web adopted UTF-8 so fast)
Let's take an ANSI string with 10 characters --> so the char array has a length of 10B. How could any UTF format compress this string to less than 10B?
-
Sorry, compression might not have been a good word. They both zip pretty good probably :)
I meant the amount of information to storage size ratio. The same character from ASCII subset needs 7 bits out of 8 in UTF-8 and 7 out of 16 in UTF-16.
Since HTML syntax is all ASCII, UTF-8 takes a lot less bandwidth than UTF-16, while still is able to hold the whole Unicode range if needed. There's no difference in size between UTF-8 and ASCII HTML (not taking into account the actual content, just the syntax: tags, braces etc.).
Of course for a text with a lot of national characters this difference is shrinking. -
wrote on 7 Dec 2014, 15:54 last edited by
Ahh ok, that makes sense to me :-)
Thank you really much one more time!
1/17