[SOLVED] Need help with regexp for Kanji
-
"Unicode Chapter 12":http://www.unicode.org/versions/Unicode5.0.0/ch12.pdf will help you a lot.
|CJK Unified Ideographs|4E00–9FFF|Common|
|CJK Unified Ideographs Extension A|3400–4DBF|Rare|
|CJK Unified Ideographs Extension B|20000–2A6DF|Rare, historic|
|CJK Unified Ideographs Extension C|2A700–2B73F|Rare, historic|
|CJK Unified Ideographs Extension D|2B740–2B81F|Uncommon, some in current use|
|CJK Compatibility Ideographs|F900–FAFF|Duplicates, unifiable variants, corporate
characters|
|CJK Compatibility Ideographs Supplement|2F800–2FA1F|Unifiable variants|So, range of Kanji(Han) are very roughly U+3400-U+9FFF, U+F900-U+FAFF, and U+20000-U+2FFFF.
QRegExp:
@
QRegExp isHan("([\x3400-\x9FFF\xF900-\xFAFF]|[\xD840-\xD87F][\xDC00-\xDFFF])+");
@Note: This regexp(isHan) doesn't contain CJK Symbols(U+3000 - U+303F), Hiragana(U+3041 - U+309F), or Katakana(U+30A0 - U+30FF).
- "CJK Symbols and Punctuation":http://www.unicode.org/charts/PDF/U3000.pdf
- "Hiragana":http://www.unicode.org/charts/PDF/U3040.pdf
- "Katakana":http://www.unicode.org/charts/PDF/U30A0.pdf
If you would like to check them, please add them to regexp.