Is there an easy way to escape all non-ASCII characters of a QString?
-
@Chris-Kawa Thanks for the input!
I didn't want to use base64, as it will make each string longer by ~30%, no matter if it's pure ASCII or not. I want to keep the strings as short as possible, so that the error correction of the QR code is as high as possible for a given size.
But you're right, the Latin-1 stuff can't be used here. I'm not so fit with this low-level stuff ;-)
I reworked my quoted-printable quoting functions above, I hope they are better now?
@l3u_ said:
I hope they are better now?
I'm afraid your encoding is ambiguous. Lets say I have a string
<SOH>27, where<SOH>is the ASCII character 1. It's gonna be encoded as=127and then decoded as<FF>7, where<FF>is the ASCII character 12. You can invent a better encoding, e.g. you can add a separator instead of fixing the number size to 2, but keep in mind that you are still reinventing a very old wheel.If Base64 is too big for you maybe look into some existing lossless encodings instead. The percent encoding @Kent-Dorfman mentioned might be an option. Qt already supports it through QUrl::toPercentEncoding().
-
@l3u_ said:
I hope they are better now?
I'm afraid your encoding is ambiguous. Lets say I have a string
<SOH>27, where<SOH>is the ASCII character 1. It's gonna be encoded as=127and then decoded as<FF>7, where<FF>is the ASCII character 12. You can invent a better encoding, e.g. you can add a separator instead of fixing the number size to 2, but keep in mind that you are still reinventing a very old wheel.If Base64 is too big for you maybe look into some existing lossless encodings instead. The percent encoding @Kent-Dorfman mentioned might be an option. Qt already supports it through QUrl::toPercentEncoding().
@Chris-Kawa I don't get it?
<SOH>27with<SOH>being1is just127and will stay127?! It walks through theQByteArraybyte per byte and checks if the byte represents an ASCII character in the defined range, and if not, it replaces it with=and the string representation of the hex value of that byte (which itself is ASCII again)? And if a=appears in the array, it means that the next two bytes represent the hex value of the byte in question (including the=itself)?I mean, I didn't make this up, it's just Quoted-Printable – at least I hope so?! So I'm not re-inventing an old wheel, I'm just trying to implement it …
Well,
QUrl::toPercentEncodingwould possibly be an option, but it escapes spaces when it doesn't have to … one could replace%20with a space before though and re-replace it again before decoding it … -
@Chris-Kawa I don't get it?
<SOH>27with<SOH>being1is just127and will stay127?! It walks through theQByteArraybyte per byte and checks if the byte represents an ASCII character in the defined range, and if not, it replaces it with=and the string representation of the hex value of that byte (which itself is ASCII again)? And if a=appears in the array, it means that the next two bytes represent the hex value of the byte in question (including the=itself)?I mean, I didn't make this up, it's just Quoted-Printable – at least I hope so?! So I'm not re-inventing an old wheel, I'm just trying to implement it …
Well,
QUrl::toPercentEncodingwould possibly be an option, but it escapes spaces when it doesn't have to … one could replace%20with a space before though and re-replace it again before decoding it …@l3u_ said in Is there an easy way to escape all non-ASCII characters of a QString?:
<SOH>27with<SOH>being1is just127and will stay127?Umm, no. I haven't followed the ins & outs of this discussion, so I may be mistaken about your context. But
<SOH>is ASCII/binary character with value1, not digit1. But the27are digits27(right or not?), which is different. The 3 character sequence<SOH>27is not the same as the 3 characters127. -
@l3u_ said in Is there an easy way to escape all non-ASCII characters of a QString?:
<SOH>27with<SOH>being1is just127and will stay127?Umm, no. I haven't followed the ins & outs of this discussion, so I may be mistaken about your context. But
<SOH>is ASCII/binary character with value1, not digit1. But the27are digits27(right or not?), which is different. The 3 character sequence<SOH>27is not the same as the 3 characters127. -
@Chris-Kawa I don't get it?
<SOH>27with<SOH>being1is just127and will stay127?! It walks through theQByteArraybyte per byte and checks if the byte represents an ASCII character in the defined range, and if not, it replaces it with=and the string representation of the hex value of that byte (which itself is ASCII again)? And if a=appears in the array, it means that the next two bytes represent the hex value of the byte in question (including the=itself)?I mean, I didn't make this up, it's just Quoted-Printable – at least I hope so?! So I'm not re-inventing an old wheel, I'm just trying to implement it …
Well,
QUrl::toPercentEncodingwould possibly be an option, but it escapes spaces when it doesn't have to … one could replace%20with a space before though and re-replace it again before decoding it …@l3u_
<SOH>is 1 as in binary 00000001, not as in text "1". It's a non printable character. Your range is 9,32-60,62-126, so 1 is below it and gets encoded as=1. The following27is text "27". The characters are in range, so don't get translated, so you get=127. When decoding you don't know where the encoded part ends, just assume two digit number, so you grab 2 as part of the encoded character, when really it's just text, so you decode=12followed by text "7" instead of decoding=1followed by text "27".Can this actually be typed in using a QLineEdit?!
Haven't tried, but if not typed then probably copy/pasted from somewhere.
-
@JonB Ah okay. Thanks for the clarification. Can this actually be typed in using a
QLineEdit?! This is only intended to be used for strings typed by a user …@l3u_
As I say, I have not followed the discussion. But, no, user will not be able to type the<SOH>character into a line edit. That would actually require Ctrl+A to be typed, and a line edit won't store that as a character, it will treat it as a control sequence (probably selecting the whole of the line edit contents if your press it). -
@l3u_
<SOH>is 1 as in binary 00000001, not as in text "1". It's a non printable character. Your range is 9,32-60,62-126, so 1 is below it and gets encoded as=1. The following27is text "27". The characters are in range, so don't get translated, so you get=127. When decoding you don't know where the encoded part ends, just assume two digit number, so you grab 2 as part of the encoded character, when really it's just text, so you decode=12followed by text "7" instead of decoding=1followed by text "27".Can this actually be typed in using a QLineEdit?!
Haven't tried, but if not typed then probably copy/pasted from somewhere.
@Chris-Kawa Okay. Here we go. I hoped using quoted-printable would be easy to implement … maybe, using
QUrl::toPercentEncodingadding some charaters that actually don't need to be escaped for this use-case (using theexcludebytearray) will have less pitfalls ;-) -
@l3u_
As I say, I have not followed the discussion. But, no, user will not be able to type the<SOH>character into a line edit. That would actually require Ctrl+A to be typed, and a line edit won't store that as a character, it will treat it as a control sequence (probably selecting the whole of the line edit contents if your press it). -
Okay, this seems to do the trick:
const auto test = QStringLiteral("abc, äöü!"); const auto exclude = QStringLiteral(" !\"#$&'()*+,/:;=?@[]").toUtf8(); const auto escaped = QUrl::toPercentEncoding(test, exclude); const auto unEscaped = QUrl::fromPercentEncoding(escaped); qDebug() << test; qDebug() << escaped; qDebug() << unEscaped;Output:
"abc, äöü!" "abc, %C3%A4%C3%B6%C3%BC!" "abc, äöü!"This seems to be what I inteded to achieve with the quoted-printable encoding, and just as long (as
%C3is not longer than=C3). And as one can exclude characters that don't need escaping for my use-case, it's essentially the same, but without programming shortcomings from me ;-)I think this is the correct way. Thanks for the input :-)
Edit: There's also
QByteArray::toPercentEncoding, which is actually called byQUrl::toPercentEncoding, with only the inputQStringbeing converted to aQByteArrayviaQString::toUtf8. No need to useQUrl. -
@JonB Ah okay. Thanks for the clarification. Can this actually be typed in using a
QLineEdit?! This is only intended to be used for strings typed by a user … -
L l3u_ has marked this topic as solved on