Converting accented characters to std::string returns mangled text
If my QString's value is
the function QString::toStdString() will return
whereas the function QString::toStdU16String will return the actual string
Why is that ? Š is a UTF-8 character as can be seen here : http://www.fileformat.info/info/charset/UTF-8/list.htm.
@jacobP How do you show this string? In console? Could be just an issue with the font/encoding in your console.
This post is deleted!
Hi @jsulm ,
I am viewing the strings using the Visual Studio watcher while debugging. Below is the code I have currently:
std::string u8 = entry.toUtf8().constData(); auto u16 = entry.toStdU16String(); auto u32 = entry.toStdU32String(); auto wstd = entry.toStdWString(); auto stds = entry.toStdString();
The variables u8 and stds have the value WeÅ .txt (notice the space between the A and .txt) while the rest have WeŠ.txt.
I am trying to use C library that only takes const char* as inputs and it is currently crashing due to the strings being mangled.
Hi I wonder what encoding the input is in ?
They are Windows file names. From what I just looked up Windows file names are encoded in UTF-16.
And you are sure they are not mangled from the source ?
As when you read them?
Using the Visual Studio watcher,
by itself will return WeŠ.txt but
will return WeÅ .txt.
I can see from my file explorer that the file name is WeŠ.txt.
I also wanted to point this out: The character
Šdoes not fit on a single byte. It's UTF-8 encoding is
197 160, Unicode is
352. Trying to fit this character in a
charwill result in 2 chars,
<No break space>1(60)respectively.
Yes, so i do wonder how to get correct ascii file name out of that.
Give it some time, some of the others might have inputs.
@jacobP That C library that only takes const char* as inputs, how old is it? Maybe it's for FAT file systems and not NTFS? (Qt and NTFS are about the same age (~25 years) that's why QString also uses UTF-16).
You could try the technology used before Unicode was invented: code pages. Pros: everything fits in single bytes. Cons: depending on what codepage you set your system for, different characters will be displayed for the same byte :-(
QString has a function for converting down from UTF-16 to your current Windows codepage: toLocal8Bit, example:
(also you need to #include "windows.h" to enable the ::GetACP() function)
QString s("WeŠ.txt"); qDebug() << s.toUcs4(); qDebug() << ::GetACP(); qDebug() << s.toLocal8Bit();
QVector(87, 101, 352, 46, 116, 120, 116) 1252 "We\x8A.txt"
First, I use toUcs4() to display the UTF-16 contents of the QString, and the Š is as you say 352. (Ucs4 is a bigger and newer brother to UTF-16).
Then I query Windows for which code page will be used for the toLocal8Bit() function, on my machine is 1252, this will vary from country to country.
The final line reveals that on code page 1252 the Š character has the code 0x8A (138 decimal), which fits into a byte. Try giving that QByteArray to your C library...
What C library is that ?