Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. How to get the language type of a string?

How to get the language type of a string?

Scheduled Pinned Locked Moved General and Desktop
5 Posts 2 Posters 3.5k Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J Offline
    J Offline
    justforfun
    wrote on last edited by
    #1

    I want to get the language coding type (if exists, e.g. 1: English, 2: Chinese, etc) of any string.

    E.g. to recognize the language coding type of string '你好' is Chinese; to recognize the language coding type of string '우리가 세상 아르 ' is Korean. Is it possible? And how to do?

    Thanks!

    1 Reply Last reply
    0
    • T Offline
      T Offline
      tobias.hunger
      wrote on last edited by
      #2

      You could try to get a rough guess on languages based on which "blocks":ftp://ftp.unicode.org/Public/6.0.0/ucd/Blocks.txt characters are in. Of course this is only a wild guess: Some blocks are used by different languages, a string may contain characters of more than one block, etc.

      Especially figuring out whether a string is English, German, French etc. is mostly impossible using this approach. But even some east asian scripts share unicode codepoints, even though the actual glyphs are very different! For some of these languages you need to know the unicode string as well as the language used to get readable output!

      1 Reply Last reply
      0
      • J Offline
        J Offline
        justforfun
        wrote on last edited by
        #3

        Thank you for your reply!
        I know this is not easy to do. And for some kinds of languages, they use similar characters.
        In my program,

        @ui->infoEdit->append(QString::fromUtf8(purefilename.toUtf8().data()));@

        And the result shown in the infoEdit is below:

        @Celina Jade___曾经心痛.mp3
        Successful to write data!
        43 65 6c 69 6e 61 20 4a 61 64 65 5f 5f 5f ffffffffffffffe6 ffffffffffffff9b ffffffffffffffbe ffffffffffffffe7 ffffffffffffffbb ffffffffffffff8f ffffffffffffffe5 ffffffffffffffbf ffffffffffffff83 ffffffffffffffe7 ffffffffffffff97 ffffffffffffff9b 2e 6d 70 33 0 3b @

        And as I know (from below link), UTF8 uses 3 bytes to record one Chinese character. By this way, I should be able to recognize it.

        "UTF-8":http://en.wikipedia.org/wiki/UTF-8
        11100000-11101111 E0-EF 224-239 Start of 3-byte sequence

        1 Reply Last reply
        0
        • T Offline
          T Offline
          tobias.hunger
          wrote on last edited by
          #4

          Oh, I did not say it was not easy, I said it is impossible to do correctly;-)

          Why do you convert purefilename to utf-8 and then straight back from utf-8? That conversion is rather costly and completely unnecessary.

          Are you aware of surrogate pairs? Those are an extension mechanism to allow for more than 64k characters in unicode: Basically you encode characters at a codepooint > 64k as a sequence of two characters. IIRC there are some Chinese characters in that "extension space", so you might not get away with ignoring this mechanism. Better compare the unicode codepoints to blocks, not a sequence of utf-8/utf-16 characters.

          Check "unicode.org":http://www.unicode.org/ for specifications on all the unicode related stuff:-)

          1 Reply Last reply
          0
          • J Offline
            J Offline
            justforfun
            wrote on last edited by
            #5

            [quote author="Tobias Hunger" date="1287477303"]Oh, I did not say it was not easy, I said it is impossible to do correctly;-)

            Why do you convert purefilename to utf-8 and then straight back from utf-8? That conversion is rather costly and completely unnecessary.

            Are you aware of surrogate pairs? Those are an extension mechanism to allow for more than 64k characters in unicode: Basically you encode characters at a codepooint > 64k as a sequence of two characters. IIRC there are some Chinese characters in that "extension space", so you might not get away with ignoring this mechanism. Better compare the unicode codepoints to blocks, not a sequence of utf-8/utf-16 characters.

            Check "unicode.org":http://www.unicode.org/ for specifications on all the unicode related stuff:-)[/quote]

            @ui->infoEdit->append(QString::fromUtf8(purefilename.toUtf8().data()));@
            This code is used just because I tested OK to show Chinese characters. And I didn't notice that in Qt the 'fromUcs4' can convert the character to unicode (per character/4 bytes), I will try it later.

            And thank you for your hints! I notice that and will further dig it out.

            1 Reply Last reply
            0

            • Login

            • Login or register to search.
            • First post
              Last post
            0
            • Categories
            • Recent
            • Tags
            • Popular
            • Users
            • Groups
            • Search
            • Get Qt Extensions
            • Unsolved