Open unicode file name in Qt4.7

gamowaxaky

Hi everyone,
I have a problem opening unicode file name, I use QFileDialog::getOpenFileNames() to get a list of songs, if a song has unicode characters I can't get the proper file name. For examples: "phải chăng là muộn màng.mp3" --> "ph?i chang là mu?n màng.mp3" so I can't play the file.
Thanks everyone!

MuldeR

What exactly do you mean with "can’t get the proper file name"?

getOpenFileNames() gives you a QStringList, i.e. a list of QString's. And QString fully supports Unicode.

Actually QString internally uses the UTF-16 encoding.

So the problem is, most likely, not that the the strings themselves are wrong - but the way you use them.

How do you access or print out the QString's to get the "wrong" string with the "?" characters in it?

If you do something like toLatin1(), toAscii() or toLocal8Bit(), it is not surprising if the string gets screwed up.

Also printing Unicode strings on the console is inherently problematic, especially on Windows.

Try something like:
@QStringList files = QFileDialog::getOpenFileNames(/* ... */);
while(!files.isEmpty())
{
QMessageBox::information(this, tr("Test"), files.takeFirst());
}@

Does this show the strings correct ???

gamowaxaky

I can get the correct file name with QMessageBox::information instead of using qDebug(), but unfortunately, I use Bass library to play mp3 file and the library need const char* file name, so I have to convert QString to const char* like following, and I can't play the file:

@
bool Player::play(const QString &song){
QMessageBox::information(Glc::win, tr("Test"), song);
QByteArray ba = song.toUtf8();
const char *file = ba.data();
}
@

Can you suggest another way to convert unicode QString to const char*?
I tried toLatin1(), toAscii() or toLocal8Bit() and got the same problem.

tobias.hunger

On windows the char* encoding rarely is Utf8. Try toLocal8Bit() instead.

MuldeR

qDebug() internally uses 8-Bit local Codepage. In my experience passing Unicode strings through qDebug() doesn't work, even if you use QString().toUtf8() for passing the string into qDebug. The only thing that ever worked for me is QString().toUtf8().toBase64(). Of course this makes only sense if you install your own message handler function and then do QString::fromUtf8(QByteArray::fromBase64(msg)). Not really nice though.

Your much bigger problem though: If the API of Bass DLL is defined with char* type and the API doesn't explicitly state that it expects UTF-8 encoding (which means it expects the local ANSI codepage and that certain characters simply cannot be represented!), you are lost. You'd need an API that expects a char* String with UTF-8 encoding (possible on Windows, but rarely used). Either that or you need an API that expects a wchar_t* String with UTF-16 encoding (standard on Windows, but not all applications do provide it, even in year 2012, sadly).

If you really need to use a legacy DLL that only has an ANSI interface and you need to pass Unicode file names, then one trick/hackaround would be using "short" file names. But this doesn't always work and has other drawbacks. You better try to make the DLL author fix its API... or do it yourself ;-)

[quote author="Tobias Hunger" date="1355731009"]On windows the char* encoding rarely is Utf8. Try toLocal8Bit() instead.[/quote]

Won't work. And if it does work, it is sheer luck. Whatever ANSI Codepage happens to be configured on the computer, there will always be (Unicode) characters that cannot be represented in that Codepage. Even worse: You cannot know which Codepage the user will have configured on his computer. The same string that may work fine on your computer might fail on the user's machine. Big pain! Stay away from local Codepages. Use UTF-8 or UTF-16.

BTW: One of the reasons I initially made the switch from Delphi 7 to C++ and Qt was the lack of Unicode support in Delphi 7 (and I really had enough of the workarounds).

tobias.hunger

toLocal8Bit() should be fairly save here, considering that the filename was probably turned into a QString using fromLocal8Bit().

But yeah, you are right. APIs that do not support unicode in this day and age should be avoided.

MuldeR

[quote author="Tobias Hunger" date="1355760879"]toLocal8Bit() should be fairly save here, considering that the filename was probably turned into a QString using fromLocal8Bit().[/quote]

Well, he gets those strings from a Qt File Open dialog, so he could get any (Unicode) character that is allowed in a file name. Just imagine you are working on an English system (with Latin-1 codepage) and want to open some files that have Chinese and Cyrillic and Arabic characters in their names. No chance to get any useful result with toLocal8Bit() in that case. And with Music files this is especially problematic, because you quite often have Artist or Title names in the file name. So the example with Chinese and Cyrillic and Arabic characters is not beside the point. An 8-Bit codepage that could represent those characters (at the same time) doesn't even exist...

tobias.hunger

Just checked with wikipedia: NTFS actually uses 16bit chars, so you are right. I though windows was the last holdout of codepage fanboys, but it actually looks like someone did do the right thing in the filesystem layer:-)

gamowaxaky will need to figure out the encoding that his method expects and convert to that.

MuldeR

Windows has full Unicode support since Windows "NT" :-)

Though instead of extending the existing char* API's to use UTF-8 and this way making pretty much all existing code support Unicode without modification, there still is no UTF-8 support in the Win32 API or the MSVC CRT as of today. Instead, for Unicode support, we must re-write all code to use the new wchar_t* API's with UTF-16 encoding. In my opinion Linux did much better by going the UTF-8 route.

Anyway, the situation has been like this for 20 years now (Windows NT 3.1 was released in 1993), so there is pretty much no excuse to still write or use Windows code not aware of Unicode nowadays...

tobias.hunger

Well, there still is the codepage mess whenever you end up in cmd... or am I 20 years behind the times with that, too?

Sometimes it does show that I stopped using windows regularly about 20 years ago... right around the time NT came out;-)

MuldeR

Passing Unicode strings into command-line tools does work fine, as long as that tool uses wmain() or retrieves the command-line args via GetCommandlineW(). If the tool uses main(), it will get the command-line args converted to the local ANSI codepage, and the we have the mess again...

Printing Unicode strings to the console is a problem of its own. You can change the codepage of the Windows console to UTF-8 via SetConsoleOutputCP(), but then you still need to get your UTF-8 string into the console without the CRT messing up your string! printf() or cout<< don't work in my experience, because they are not UTF-8 aware and will mangle your UTF-8 strings. wprintf() or wcout<< will take UTF-16 (wchar_t) strings as input, but then convert them to the local ANSI codepage before passing them to console! My solution is using the Win32 API directly, i.e. print your string via GetStdHandle() and WriteFile(), bypassing the CRT. Either that or using "setmode(fileno(stdout), O_BINARY)" to change the stdout to binary mode, so printf() won't mangle your UTF-8 strings anymore. The latter has the drawback that, after that, using wprintf() or wcout<< will crash the CRT...