Is there an easy way to escape all non-ASCII characters of a QString?
-
Hi all!
I'm currently messing with a project creating QR codes. I store the data in a
QJsonObject, and create a string viaQJsonDocument, which is packed into the QR code.Now I tested my stuff with a hardware QR code scanner and had to see that it won't handle UTF-8 data. E.g. if some value contains a
äor such, it's simply left out. I first thought about working around this by using Latin-1 (QString::fromUtf8(jsonByteArray).toLatin1()), but I wondered if there's an easy way to simply escape and unescape all the non-ASCII characters to not lose any data.Like what
QDebugdoes:qDebug() << QStringLiteral("äöü").toUtf8(); "\xC3\xA4\xC3\xB6\xC3\xBC"Can I get those literal
\x...strings, and convert then back to be their unicode counterparts? So that I can use a pure-ASCII string for the QR code, that escapes everything that's non-ASCII, and get my real data back after having read it?Thanks for all help!
Edit:
QByteArray::toPercentEncodingdoes the trick. With additional permitted characters added, it essentially allows a quoted-printable encoding of a string, with only ASCII characters used. Here's what I ended up using:QString VendorDocumentsPrinter::escape(const QString &string) const { static const auto s_exclude = QStringLiteral(" !\"#$&'()*+,/:;=?@[]").toUtf8(); const auto byteArray = string.toUtf8(); return QString::fromUtf8(byteArray.toPercentEncoding(s_exclude)); }and it's counterpart:
QString ScanQrCodeWidget::unEscape(const QJsonValue &value) const { return QString::fromUtf8(QByteArray::fromPercentEncoding(value.toString().toUtf8())); } -
URL/URI encoding? non-printable characters are escaped to conform, and there are many libraries out there.
-
What I do right now is a – possibly a bit clumsy – quoted-printable escaping to get rid of the non-ASCII characters:
QString quote(const QString &unQuoted) { QString quoted; const auto utf8 = unQuoted.toUtf8(); for (int i = 0; i < utf8.size(); i++) { const auto value = static_cast<int>((unsigned char) utf8[i]); if (value == 9 || (value >= 32 && value <= 60) || (value >= 62 && value <= 126)) { quoted.append(QChar(value)); } else { quoted.append(QStringLiteral("=%1").arg( QString::number(value, 16).rightJustified(2, QChar::fromLatin1('0')))); } } return quoted; } QString unQuote(const QString "ed) { QByteArray unQuoted; const auto utf8 = quoted.toUtf8(); const auto size = utf8.size(); for (int i = 0; i < size; i++) { const auto val = static_cast<int>((unsigned char) utf8[i]); if (val != 61) { // 61 is '=' unQuoted.append(utf8[i]); } else { bool quotedValueConverted = false; uint quotedValue = 0; if (i + 2 < size) { quotedValue = utf8.mid(i + 1, 2).toUInt("edValueConverted, 16); } if (quotedValueConverted) { i += 2; unQuoted.append(static_cast<char>(quotedValue)); } else { // This should not happen unQuoted.append(utf8[i]); } } } return QString::fromUtf8(unQuoted); } -
What I do right now is a – possibly a bit clumsy – quoted-printable escaping to get rid of the non-ASCII characters:
QString quote(const QString &unQuoted) { QString quoted; const auto utf8 = unQuoted.toUtf8(); for (int i = 0; i < utf8.size(); i++) { const auto value = static_cast<int>((unsigned char) utf8[i]); if (value == 9 || (value >= 32 && value <= 60) || (value >= 62 && value <= 126)) { quoted.append(QChar(value)); } else { quoted.append(QStringLiteral("=%1").arg( QString::number(value, 16).rightJustified(2, QChar::fromLatin1('0')))); } } return quoted; } QString unQuote(const QString "ed) { QByteArray unQuoted; const auto utf8 = quoted.toUtf8(); const auto size = utf8.size(); for (int i = 0; i < size; i++) { const auto val = static_cast<int>((unsigned char) utf8[i]); if (val != 61) { // 61 is '=' unQuoted.append(utf8[i]); } else { bool quotedValueConverted = false; uint quotedValue = 0; if (i + 2 < size) { quotedValue = utf8.mid(i + 1, 2).toUInt("edValueConverted, 16); } if (quotedValueConverted) { i += 2; unQuoted.append(static_cast<char>(quotedValue)); } else { // This should not happen unQuoted.append(utf8[i]); } } } return QString::fromUtf8(unQuoted); }@l3u_ You can't use toLatin1 for non Latin-1 characters. That's undefined. You need to use encoding that can represent arbitrary binary data. One such encoding is Base64, but first you'll need to convert the QString to a byte array. An easy way to do that is by converting it to UTF-8.
So for example:
QString source = "äöü"; // Convert UTF-16 QString to UTF-8 to get a byte array and then to Base64 // to get an ASCII only text representation of the bytes. // You can put that in the QR code. QByteArray encoded = source.toUtf8().toBase64(); // Decode the bytes from Base64 to UTF-8 and then convert it back to QString (UTF-16). QString decoded = QString::fromUtf8(QByteArray::fromBase64(encoded)); -
@l3u_ You can't use toLatin1 for non Latin-1 characters. That's undefined. You need to use encoding that can represent arbitrary binary data. One such encoding is Base64, but first you'll need to convert the QString to a byte array. An easy way to do that is by converting it to UTF-8.
So for example:
QString source = "äöü"; // Convert UTF-16 QString to UTF-8 to get a byte array and then to Base64 // to get an ASCII only text representation of the bytes. // You can put that in the QR code. QByteArray encoded = source.toUtf8().toBase64(); // Decode the bytes from Base64 to UTF-8 and then convert it back to QString (UTF-16). QString decoded = QString::fromUtf8(QByteArray::fromBase64(encoded));@Chris-Kawa Thanks for the input!
I didn't want to use base64, as it will make each string longer by ~30%, no matter if it's pure ASCII or not. I want to keep the strings as short as possible, so that the error correction of the QR code is as high as possible for a given size.
But you're right, the Latin-1 stuff can't be used here. I'm not so fit with this low-level stuff ;-)
I reworked my quoted-printable quoting functions above, I hope they are better now?
-
URL/URI encoding? non-printable characters are escaped to conform, and there are many libraries out there.
-
@Chris-Kawa Thanks for the input!
I didn't want to use base64, as it will make each string longer by ~30%, no matter if it's pure ASCII or not. I want to keep the strings as short as possible, so that the error correction of the QR code is as high as possible for a given size.
But you're right, the Latin-1 stuff can't be used here. I'm not so fit with this low-level stuff ;-)
I reworked my quoted-printable quoting functions above, I hope they are better now?
@l3u_ said:
I hope they are better now?
I'm afraid your encoding is ambiguous. Lets say I have a string
<SOH>27, where<SOH>is the ASCII character 1. It's gonna be encoded as=127and then decoded as<FF>7, where<FF>is the ASCII character 12. You can invent a better encoding, e.g. you can add a separator instead of fixing the number size to 2, but keep in mind that you are still reinventing a very old wheel.If Base64 is too big for you maybe look into some existing lossless encodings instead. The percent encoding @Kent-Dorfman mentioned might be an option. Qt already supports it through QUrl::toPercentEncoding().
-
@l3u_ said:
I hope they are better now?
I'm afraid your encoding is ambiguous. Lets say I have a string
<SOH>27, where<SOH>is the ASCII character 1. It's gonna be encoded as=127and then decoded as<FF>7, where<FF>is the ASCII character 12. You can invent a better encoding, e.g. you can add a separator instead of fixing the number size to 2, but keep in mind that you are still reinventing a very old wheel.If Base64 is too big for you maybe look into some existing lossless encodings instead. The percent encoding @Kent-Dorfman mentioned might be an option. Qt already supports it through QUrl::toPercentEncoding().
@Chris-Kawa I don't get it?
<SOH>27with<SOH>being1is just127and will stay127?! It walks through theQByteArraybyte per byte and checks if the byte represents an ASCII character in the defined range, and if not, it replaces it with=and the string representation of the hex value of that byte (which itself is ASCII again)? And if a=appears in the array, it means that the next two bytes represent the hex value of the byte in question (including the=itself)?I mean, I didn't make this up, it's just Quoted-Printable – at least I hope so?! So I'm not re-inventing an old wheel, I'm just trying to implement it …
Well,
QUrl::toPercentEncodingwould possibly be an option, but it escapes spaces when it doesn't have to … one could replace%20with a space before though and re-replace it again before decoding it … -
@Chris-Kawa I don't get it?
<SOH>27with<SOH>being1is just127and will stay127?! It walks through theQByteArraybyte per byte and checks if the byte represents an ASCII character in the defined range, and if not, it replaces it with=and the string representation of the hex value of that byte (which itself is ASCII again)? And if a=appears in the array, it means that the next two bytes represent the hex value of the byte in question (including the=itself)?I mean, I didn't make this up, it's just Quoted-Printable – at least I hope so?! So I'm not re-inventing an old wheel, I'm just trying to implement it …
Well,
QUrl::toPercentEncodingwould possibly be an option, but it escapes spaces when it doesn't have to … one could replace%20with a space before though and re-replace it again before decoding it …@l3u_ said in Is there an easy way to escape all non-ASCII characters of a QString?:
<SOH>27with<SOH>being1is just127and will stay127?Umm, no. I haven't followed the ins & outs of this discussion, so I may be mistaken about your context. But
<SOH>is ASCII/binary character with value1, not digit1. But the27are digits27(right or not?), which is different. The 3 character sequence<SOH>27is not the same as the 3 characters127. -
@l3u_ said in Is there an easy way to escape all non-ASCII characters of a QString?:
<SOH>27with<SOH>being1is just127and will stay127?Umm, no. I haven't followed the ins & outs of this discussion, so I may be mistaken about your context. But
<SOH>is ASCII/binary character with value1, not digit1. But the27are digits27(right or not?), which is different. The 3 character sequence<SOH>27is not the same as the 3 characters127. -
@Chris-Kawa I don't get it?
<SOH>27with<SOH>being1is just127and will stay127?! It walks through theQByteArraybyte per byte and checks if the byte represents an ASCII character in the defined range, and if not, it replaces it with=and the string representation of the hex value of that byte (which itself is ASCII again)? And if a=appears in the array, it means that the next two bytes represent the hex value of the byte in question (including the=itself)?I mean, I didn't make this up, it's just Quoted-Printable – at least I hope so?! So I'm not re-inventing an old wheel, I'm just trying to implement it …
Well,
QUrl::toPercentEncodingwould possibly be an option, but it escapes spaces when it doesn't have to … one could replace%20with a space before though and re-replace it again before decoding it …@l3u_
<SOH>is 1 as in binary 00000001, not as in text "1". It's a non printable character. Your range is 9,32-60,62-126, so 1 is below it and gets encoded as=1. The following27is text "27". The characters are in range, so don't get translated, so you get=127. When decoding you don't know where the encoded part ends, just assume two digit number, so you grab 2 as part of the encoded character, when really it's just text, so you decode=12followed by text "7" instead of decoding=1followed by text "27".Can this actually be typed in using a QLineEdit?!
Haven't tried, but if not typed then probably copy/pasted from somewhere.
-
@JonB Ah okay. Thanks for the clarification. Can this actually be typed in using a
QLineEdit?! This is only intended to be used for strings typed by a user …@l3u_
As I say, I have not followed the discussion. But, no, user will not be able to type the<SOH>character into a line edit. That would actually require Ctrl+A to be typed, and a line edit won't store that as a character, it will treat it as a control sequence (probably selecting the whole of the line edit contents if your press it). -
@l3u_
<SOH>is 1 as in binary 00000001, not as in text "1". It's a non printable character. Your range is 9,32-60,62-126, so 1 is below it and gets encoded as=1. The following27is text "27". The characters are in range, so don't get translated, so you get=127. When decoding you don't know where the encoded part ends, just assume two digit number, so you grab 2 as part of the encoded character, when really it's just text, so you decode=12followed by text "7" instead of decoding=1followed by text "27".Can this actually be typed in using a QLineEdit?!
Haven't tried, but if not typed then probably copy/pasted from somewhere.
@Chris-Kawa Okay. Here we go. I hoped using quoted-printable would be easy to implement … maybe, using
QUrl::toPercentEncodingadding some charaters that actually don't need to be escaped for this use-case (using theexcludebytearray) will have less pitfalls ;-) -
@l3u_
As I say, I have not followed the discussion. But, no, user will not be able to type the<SOH>character into a line edit. That would actually require Ctrl+A to be typed, and a line edit won't store that as a character, it will treat it as a control sequence (probably selecting the whole of the line edit contents if your press it). -
Okay, this seems to do the trick:
const auto test = QStringLiteral("abc, äöü!"); const auto exclude = QStringLiteral(" !\"#$&'()*+,/:;=?@[]").toUtf8(); const auto escaped = QUrl::toPercentEncoding(test, exclude); const auto unEscaped = QUrl::fromPercentEncoding(escaped); qDebug() << test; qDebug() << escaped; qDebug() << unEscaped;Output:
"abc, äöü!" "abc, %C3%A4%C3%B6%C3%BC!" "abc, äöü!"This seems to be what I inteded to achieve with the quoted-printable encoding, and just as long (as
%C3is not longer than=C3). And as one can exclude characters that don't need escaping for my use-case, it's essentially the same, but without programming shortcomings from me ;-)I think this is the correct way. Thanks for the input :-)
Edit: There's also
QByteArray::toPercentEncoding, which is actually called byQUrl::toPercentEncoding, with only the inputQStringbeing converted to aQByteArrayviaQString::toUtf8. No need to useQUrl. -
@JonB Ah okay. Thanks for the clarification. Can this actually be typed in using a
QLineEdit?! This is only intended to be used for strings typed by a user … -
L l3u_ has marked this topic as solved on