Qt Forum

    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Search
    • Unsolved

    How to get the language type of a string?

    General and Desktop
    2
    5
    3199
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • J
      justforfun last edited by

      I want to get the language coding type (if exists, e.g. 1: English, 2: Chinese, etc) of any string.

      E.g. to recognize the language coding type of string '你好' is Chinese; to recognize the language coding type of string '우리가 세상 아르 ' is Korean. Is it possible? And how to do?

      Thanks!

      1 Reply Last reply Reply Quote 0
      • T
        tobias.hunger last edited by

        You could try to get a rough guess on languages based on which "blocks":ftp://ftp.unicode.org/Public/6.0.0/ucd/Blocks.txt characters are in. Of course this is only a wild guess: Some blocks are used by different languages, a string may contain characters of more than one block, etc.

        Especially figuring out whether a string is English, German, French etc. is mostly impossible using this approach. But even some east asian scripts share unicode codepoints, even though the actual glyphs are very different! For some of these languages you need to know the unicode string as well as the language used to get readable output!

        1 Reply Last reply Reply Quote 0
        • J
          justforfun last edited by

          Thank you for your reply!
          I know this is not easy to do. And for some kinds of languages, they use similar characters.
          In my program,

          @ui->infoEdit->append(QString::fromUtf8(purefilename.toUtf8().data()));@

          And the result shown in the infoEdit is below:

          @Celina Jade___曾经心痛.mp3
          Successful to write data!
          43 65 6c 69 6e 61 20 4a 61 64 65 5f 5f 5f ffffffffffffffe6 ffffffffffffff9b ffffffffffffffbe ffffffffffffffe7 ffffffffffffffbb ffffffffffffff8f ffffffffffffffe5 ffffffffffffffbf ffffffffffffff83 ffffffffffffffe7 ffffffffffffff97 ffffffffffffff9b 2e 6d 70 33 0 3b @

          And as I know (from below link), UTF8 uses 3 bytes to record one Chinese character. By this way, I should be able to recognize it.

          "UTF-8":http://en.wikipedia.org/wiki/UTF-8
          11100000-11101111 E0-EF 224-239 Start of 3-byte sequence

          1 Reply Last reply Reply Quote 0
          • T
            tobias.hunger last edited by

            Oh, I did not say it was not easy, I said it is impossible to do correctly;-)

            Why do you convert purefilename to utf-8 and then straight back from utf-8? That conversion is rather costly and completely unnecessary.

            Are you aware of surrogate pairs? Those are an extension mechanism to allow for more than 64k characters in unicode: Basically you encode characters at a codepooint > 64k as a sequence of two characters. IIRC there are some Chinese characters in that "extension space", so you might not get away with ignoring this mechanism. Better compare the unicode codepoints to blocks, not a sequence of utf-8/utf-16 characters.

            Check "unicode.org":http://www.unicode.org/ for specifications on all the unicode related stuff:-)

            1 Reply Last reply Reply Quote 0
            • J
              justforfun last edited by

              [quote author="Tobias Hunger" date="1287477303"]Oh, I did not say it was not easy, I said it is impossible to do correctly;-)

              Why do you convert purefilename to utf-8 and then straight back from utf-8? That conversion is rather costly and completely unnecessary.

              Are you aware of surrogate pairs? Those are an extension mechanism to allow for more than 64k characters in unicode: Basically you encode characters at a codepooint > 64k as a sequence of two characters. IIRC there are some Chinese characters in that "extension space", so you might not get away with ignoring this mechanism. Better compare the unicode codepoints to blocks, not a sequence of utf-8/utf-16 characters.

              Check "unicode.org":http://www.unicode.org/ for specifications on all the unicode related stuff:-)[/quote]

              @ui->infoEdit->append(QString::fromUtf8(purefilename.toUtf8().data()));@
              This code is used just because I tested OK to show Chinese characters. And I didn't notice that in Qt the 'fromUcs4' can convert the character to unicode (per character/4 bytes), I will try it later.

              And thank you for your hints! I notice that and will further dig it out.

              1 Reply Last reply Reply Quote 0
              • First post
                Last post