How to get the language type of a string?

justforfun

I want to get the language coding type (if exists, e.g. 1: English, 2: Chinese, etc) of any string.

E.g. to recognize the language coding type of string '你好' is Chinese; to recognize the language coding type of string '우리가 세상 아르 ' is Korean. Is it possible? And how to do?

Thanks!

tobias.hunger

You could try to get a rough guess on languages based on which "blocks":ftp://ftp.unicode.org/Public/6.0.0/ucd/Blocks.txt characters are in. Of course this is only a wild guess: Some blocks are used by different languages, a string may contain characters of more than one block, etc.

Especially figuring out whether a string is English, German, French etc. is mostly impossible using this approach. But even some east asian scripts share unicode codepoints, even though the actual glyphs are very different! For some of these languages you need to know the unicode string as well as the language used to get readable output!

justforfun

Thank you for your reply!
I know this is not easy to do. And for some kinds of languages, they use similar characters.
In my program,

@ui->infoEdit->append(QString::fromUtf8(purefilename.toUtf8().data()));@

And the result shown in the infoEdit is below:

@Celina Jade___曾经心痛.mp3
Successful to write data!
43 65 6c 69 6e 61 20 4a 61 64 65 5f 5f 5f ffffffffffffffe6 ffffffffffffff9b ffffffffffffffbe ffffffffffffffe7 ffffffffffffffbb ffffffffffffff8f ffffffffffffffe5 ffffffffffffffbf ffffffffffffff83 ffffffffffffffe7 ffffffffffffff97 ffffffffffffff9b 2e 6d 70 33 0 3b @

And as I know (from below link), UTF8 uses 3 bytes to record one Chinese character. By this way, I should be able to recognize it.

"UTF-8":http://en.wikipedia.org/wiki/UTF-8
11100000-11101111 E0-EF 224-239 Start of 3-byte sequence

tobias.hunger

Oh, I did not say it was not easy, I said it is impossible to do correctly;-)

Why do you convert purefilename to utf-8 and then straight back from utf-8? That conversion is rather costly and completely unnecessary.

Are you aware of surrogate pairs? Those are an extension mechanism to allow for more than 64k characters in unicode: Basically you encode characters at a codepooint > 64k as a sequence of two characters. IIRC there are some Chinese characters in that "extension space", so you might not get away with ignoring this mechanism. Better compare the unicode codepoints to blocks, not a sequence of utf-8/utf-16 characters.

Check "unicode.org":http://www.unicode.org/ for specifications on all the unicode related stuff:-)

justforfun

[quote author="Tobias Hunger" date="1287477303"]Oh, I did not say it was not easy, I said it is impossible to do correctly;-)

Why do you convert purefilename to utf-8 and then straight back from utf-8? That conversion is rather costly and completely unnecessary.

Are you aware of surrogate pairs? Those are an extension mechanism to allow for more than 64k characters in unicode: Basically you encode characters at a codepooint > 64k as a sequence of two characters. IIRC there are some Chinese characters in that "extension space", so you might not get away with ignoring this mechanism. Better compare the unicode codepoints to blocks, not a sequence of utf-8/utf-16 characters.

Check "unicode.org":http://www.unicode.org/ for specifications on all the unicode related stuff:-)[/quote]

@ui->infoEdit->append(QString::fromUtf8(purefilename.toUtf8().data()));@
This code is used just because I tested OK to show Chinese characters. And I didn't notice that in Qt the 'fromUcs4' can convert the character to unicode (per character/4 bytes), I will try it later.

And thank you for your hints! I notice that and will further dig it out.

How to get the language type of a string&#63;

How to get the language type of a string?