[SOLVED]Regular Expresion and national letters

dangelog

Which encoding is your source file saved in? Are you using a proper encoding for it AND a proper QString decoding method for building the string you pass to QRegExp ctor, like QString::fromUtf8? For instance, you can save the source file as UTF8, or save it as ASCII and put the unicode encoding of those characters, like "\xc4\x85" for the literal "ą".

goetz

If you want everything between two quoatation marks you can use a simpler regex:

@
QRegExp re("".+"");
re.setMinimal(true);
@

This matches if at least one character is between quotation marks.

Also, what's the single backslash before your quotation mark in the regex for?

@

"\"[A" in C/C++

is actually

"[A

@

BlackDante

thank you Volker again, yours solutions is perfect :)
[quote author="Volker" date="1294597497"]

Also, what's the single backslash before your quotation mark in the regex for?

@

"\"[A" in C/C++

is actually

"[A

@
[/quote]

When I looked in to QRegExp example, most of examples was started with "\" so I thought that in my case it's must to be, and it works except national letters ;)

Peppe, I almost forgot about encoding QString and this was a problem ;) eh, still I am amateur, thanks for anwser, next time I will be remember to encodnig QString ;)

andre

Note that you can get around most encoding issues by using the hexcodes instead for symbols outside the standard character range. That is less readable, but probably more relyable. The problem with text files (including source files) is that they carry no information on the encoding they are in. That means that trouble can arise as soon as somebody else, unaware of your encoding settings, start editing your file.

BlackDante

thanks Andre for advice :) but if text files don't carry inforrmations about encoding, how can I get this information? Suppose that in my apllication user can open every text file and content of this file is displayed on QPlainTextEdit, so I don't have any chance to unearth innformation about encoding?

andre

Nope, there is no relyable way. You can use some complicated routines that use some statistics or other heuristics to determine the likely encoding or something like that, but that's not all that relyable. Just hope that UFT-8 will soon replace all other local encodings that are in use...

BlackDante

oh, it's not good, but thanks for answer :)
[quote author="Andre" date="1294655465"]Just hope that UFT-8 will soon replace all other local encodings that are in use...
[/quote]
Yes, I will be prayed for this :)

goetz

If we write hex codes in sources no vendor will care for proper UTF-8 support in their products. Hexcodes are not the solution, they are the source of all that evil.

If you are thoroughly you can switch your entire code base to UTF-8 without problems in MS Visual Studio, Qt Creator and XCode.

Add to your .pro file
@
CODECFORTR = UTF-8
CODECFORSRC = UTF-8
@

and to your main.cpp
@
QTextCodec::setCodecForCStrings( QTextCodec::codecForName( "UTF-8" ) );
QTextCodec::setCodecForTr( QTextCodec::codecForName( "UTF-8" ) );
@

This way you just can tell your code editors to open the files in UTF-8 mode if not stated otherwise. It works like a charm here in our team, involving different operating systems, programming languages and IDEs.

We are in year 2k11, in times of mega-supercomputing and what the hell has see, and I simply refuse strictly to type hexcodes in a file to gain an 'ä' or 'ç'.

BlackDante

I am much grateful for this anwser :) This will be very helpful in my little project :)

ixSci

[quote]f we write hex codes in sources no vendor will care for proper UTF-8 support in their products. Hexcodes are not the solution, they are the source of all that evil.[/quote]
While it is a correct statement in general and I agree with you, it is not so right in regard to regexps. Regexps notion \uXXXX is a standard way to represent character in exact Unicode code point. And you have full control of what you are writing, thus you won't get any unexpected results if you use the hex notation in regexps. No encoding issues will bother you ever. BTW, there is \p{L} in regexps which is enough in the most cases.