Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Is there an easy way to escape all non-ASCII characters of a QString?
Forum Updated to NodeBB v4.3 + New Features

Is there an easy way to escape all non-ASCII characters of a QString?

Scheduled Pinned Locked Moved Solved General and Desktop
15 Posts 4 Posters 4.5k Views 4 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • l3u_L Offline
    l3u_L Offline
    l3u_
    wrote on last edited by l3u_
    #1

    Hi all!

    I'm currently messing with a project creating QR codes. I store the data in a QJsonObject, and create a string via QJsonDocument, which is packed into the QR code.

    Now I tested my stuff with a hardware QR code scanner and had to see that it won't handle UTF-8 data. E.g. if some value contains a ä or such, it's simply left out. I first thought about working around this by using Latin-1 (QString::fromUtf8(jsonByteArray).toLatin1()), but I wondered if there's an easy way to simply escape and unescape all the non-ASCII characters to not lose any data.

    Like what QDebug does:

    qDebug() << QStringLiteral("äöü").toUtf8();
    "\xC3\xA4\xC3\xB6\xC3\xBC"
    

    Can I get those literal \x... strings, and convert then back to be their unicode counterparts? So that I can use a pure-ASCII string for the QR code, that escapes everything that's non-ASCII, and get my real data back after having read it?

    Thanks for all help!

    Edit: QByteArray::toPercentEncoding does the trick. With additional permitted characters added, it essentially allows a quoted-printable encoding of a string, with only ASCII characters used. Here's what I ended up using:

    QString VendorDocumentsPrinter::escape(const QString &string) const
    {
        static const auto s_exclude = QStringLiteral(" !\"#$&'()*+,/:;=?@[]").toUtf8();
        const auto byteArray = string.toUtf8();
        return QString::fromUtf8(byteArray.toPercentEncoding(s_exclude));
    }
    

    and it's counterpart:

    QString ScanQrCodeWidget::unEscape(const QJsonValue &value) const
    {
        return QString::fromUtf8(QByteArray::fromPercentEncoding(value.toString().toUtf8()));
    }
    
    1 Reply Last reply
    0
    • Kent-DorfmanK Offline
      Kent-DorfmanK Offline
      Kent-Dorfman
      wrote on last edited by Kent-Dorfman
      #5

      URL/URI encoding? non-printable characters are escaped to conform, and there are many libraries out there.

      https://en.wikipedia.org/wiki/Percent-encoding

      I light my way forward with the fires of all the bridges I've burned behind me.

      1 Reply Last reply
      1
      • l3u_L Offline
        l3u_L Offline
        l3u_
        wrote on last edited by l3u_
        #2

        What I do right now is a – possibly a bit clumsy – quoted-printable escaping to get rid of the non-ASCII characters:

        QString quote(const QString &unQuoted)
        {
            QString quoted;
            const auto utf8 = unQuoted.toUtf8();
            for (int i = 0; i < utf8.size(); i++) {
                const auto value = static_cast<int>((unsigned char) utf8[i]);
                if (value == 9 || (value >= 32 && value <= 60) || (value >= 62 && value <= 126)) {
                    quoted.append(QChar(value));
                } else {
                    quoted.append(QStringLiteral("=%1").arg(
                        QString::number(value, 16).rightJustified(2, QChar::fromLatin1('0'))));
                }
            }
        
            return quoted;
        }
        
        QString unQuote(const QString &quoted)
        {
            QByteArray unQuoted;
            const auto utf8 = quoted.toUtf8();
            const auto size = utf8.size();
            for (int i = 0; i < size; i++) {
                const auto val = static_cast<int>((unsigned char) utf8[i]);
                if (val != 61) { // 61 is '='
                    unQuoted.append(utf8[i]);
                } else {
                    bool quotedValueConverted = false;
                    uint quotedValue = 0;
                    if (i + 2 < size) {
                        quotedValue = utf8.mid(i + 1, 2).toUInt(&quotedValueConverted, 16);
                    }
                    if (quotedValueConverted) {
                        i += 2;
                        unQuoted.append(static_cast<char>(quotedValue));
                    } else {
                        // This should not happen
                        unQuoted.append(utf8[i]);
                    }
                }
            }
        
            return QString::fromUtf8(unQuoted);
        }
        
        Chris KawaC 1 Reply Last reply
        0
        • l3u_L l3u_

          What I do right now is a – possibly a bit clumsy – quoted-printable escaping to get rid of the non-ASCII characters:

          QString quote(const QString &unQuoted)
          {
              QString quoted;
              const auto utf8 = unQuoted.toUtf8();
              for (int i = 0; i < utf8.size(); i++) {
                  const auto value = static_cast<int>((unsigned char) utf8[i]);
                  if (value == 9 || (value >= 32 && value <= 60) || (value >= 62 && value <= 126)) {
                      quoted.append(QChar(value));
                  } else {
                      quoted.append(QStringLiteral("=%1").arg(
                          QString::number(value, 16).rightJustified(2, QChar::fromLatin1('0'))));
                  }
              }
          
              return quoted;
          }
          
          QString unQuote(const QString &quoted)
          {
              QByteArray unQuoted;
              const auto utf8 = quoted.toUtf8();
              const auto size = utf8.size();
              for (int i = 0; i < size; i++) {
                  const auto val = static_cast<int>((unsigned char) utf8[i]);
                  if (val != 61) { // 61 is '='
                      unQuoted.append(utf8[i]);
                  } else {
                      bool quotedValueConverted = false;
                      uint quotedValue = 0;
                      if (i + 2 < size) {
                          quotedValue = utf8.mid(i + 1, 2).toUInt(&quotedValueConverted, 16);
                      }
                      if (quotedValueConverted) {
                          i += 2;
                          unQuoted.append(static_cast<char>(quotedValue));
                      } else {
                          // This should not happen
                          unQuoted.append(utf8[i]);
                      }
                  }
              }
          
              return QString::fromUtf8(unQuoted);
          }
          
          Chris KawaC Offline
          Chris KawaC Offline
          Chris Kawa
          Lifetime Qt Champion
          wrote on last edited by
          #3

          @l3u_ You can't use toLatin1 for non Latin-1 characters. That's undefined. You need to use encoding that can represent arbitrary binary data. One such encoding is Base64, but first you'll need to convert the QString to a byte array. An easy way to do that is by converting it to UTF-8.

          So for example:

          QString source = "äöü";
          
          // Convert UTF-16 QString to UTF-8 to get a byte array and then to Base64
          // to get an ASCII only text representation of the bytes.
          // You can put that in the QR code.
          QByteArray encoded = source.toUtf8().toBase64();
          
          // Decode the bytes from Base64 to UTF-8 and then convert it back to QString (UTF-16).
          QString decoded = QString::fromUtf8(QByteArray::fromBase64(encoded));
          
          l3u_L 1 Reply Last reply
          1
          • Chris KawaC Chris Kawa

            @l3u_ You can't use toLatin1 for non Latin-1 characters. That's undefined. You need to use encoding that can represent arbitrary binary data. One such encoding is Base64, but first you'll need to convert the QString to a byte array. An easy way to do that is by converting it to UTF-8.

            So for example:

            QString source = "äöü";
            
            // Convert UTF-16 QString to UTF-8 to get a byte array and then to Base64
            // to get an ASCII only text representation of the bytes.
            // You can put that in the QR code.
            QByteArray encoded = source.toUtf8().toBase64();
            
            // Decode the bytes from Base64 to UTF-8 and then convert it back to QString (UTF-16).
            QString decoded = QString::fromUtf8(QByteArray::fromBase64(encoded));
            
            l3u_L Offline
            l3u_L Offline
            l3u_
            wrote on last edited by
            #4

            @Chris-Kawa Thanks for the input!

            I didn't want to use base64, as it will make each string longer by ~30%, no matter if it's pure ASCII or not. I want to keep the strings as short as possible, so that the error correction of the QR code is as high as possible for a given size.

            But you're right, the Latin-1 stuff can't be used here. I'm not so fit with this low-level stuff ;-)

            I reworked my quoted-printable quoting functions above, I hope they are better now?

            Chris KawaC 1 Reply Last reply
            0
            • Kent-DorfmanK Offline
              Kent-DorfmanK Offline
              Kent-Dorfman
              wrote on last edited by Kent-Dorfman
              #5

              URL/URI encoding? non-printable characters are escaped to conform, and there are many libraries out there.

              https://en.wikipedia.org/wiki/Percent-encoding

              I light my way forward with the fires of all the bridges I've burned behind me.

              1 Reply Last reply
              1
              • l3u_L l3u_

                @Chris-Kawa Thanks for the input!

                I didn't want to use base64, as it will make each string longer by ~30%, no matter if it's pure ASCII or not. I want to keep the strings as short as possible, so that the error correction of the QR code is as high as possible for a given size.

                But you're right, the Latin-1 stuff can't be used here. I'm not so fit with this low-level stuff ;-)

                I reworked my quoted-printable quoting functions above, I hope they are better now?

                Chris KawaC Offline
                Chris KawaC Offline
                Chris Kawa
                Lifetime Qt Champion
                wrote on last edited by
                #6

                @l3u_ said:

                I hope they are better now?

                I'm afraid your encoding is ambiguous. Lets say I have a string <SOH>27, where <SOH> is the ASCII character 1. It's gonna be encoded as =127 and then decoded as <FF>7, where <FF> is the ASCII character 12. You can invent a better encoding, e.g. you can add a separator instead of fixing the number size to 2, but keep in mind that you are still reinventing a very old wheel.

                If Base64 is too big for you maybe look into some existing lossless encodings instead. The percent encoding @Kent-Dorfman mentioned might be an option. Qt already supports it through QUrl::toPercentEncoding().

                l3u_L 1 Reply Last reply
                0
                • Chris KawaC Chris Kawa

                  @l3u_ said:

                  I hope they are better now?

                  I'm afraid your encoding is ambiguous. Lets say I have a string <SOH>27, where <SOH> is the ASCII character 1. It's gonna be encoded as =127 and then decoded as <FF>7, where <FF> is the ASCII character 12. You can invent a better encoding, e.g. you can add a separator instead of fixing the number size to 2, but keep in mind that you are still reinventing a very old wheel.

                  If Base64 is too big for you maybe look into some existing lossless encodings instead. The percent encoding @Kent-Dorfman mentioned might be an option. Qt already supports it through QUrl::toPercentEncoding().

                  l3u_L Offline
                  l3u_L Offline
                  l3u_
                  wrote on last edited by l3u_
                  #7

                  @Chris-Kawa I don't get it? <SOH>27 with <SOH> being 1 is just 127 and will stay 127?! It walks through the QByteArray byte per byte and checks if the byte represents an ASCII character in the defined range, and if not, it replaces it with = and the string representation of the hex value of that byte (which itself is ASCII again)? And if a = appears in the array, it means that the next two bytes represent the hex value of the byte in question (including the = itself)?

                  I mean, I didn't make this up, it's just Quoted-Printable – at least I hope so?! So I'm not re-inventing an old wheel, I'm just trying to implement it …

                  Well, QUrl::toPercentEncoding would possibly be an option, but it escapes spaces when it doesn't have to … one could replace %20 with a space before though and re-replace it again before decoding it …

                  JonBJ Chris KawaC 2 Replies Last reply
                  0
                  • l3u_L l3u_

                    @Chris-Kawa I don't get it? <SOH>27 with <SOH> being 1 is just 127 and will stay 127?! It walks through the QByteArray byte per byte and checks if the byte represents an ASCII character in the defined range, and if not, it replaces it with = and the string representation of the hex value of that byte (which itself is ASCII again)? And if a = appears in the array, it means that the next two bytes represent the hex value of the byte in question (including the = itself)?

                    I mean, I didn't make this up, it's just Quoted-Printable – at least I hope so?! So I'm not re-inventing an old wheel, I'm just trying to implement it …

                    Well, QUrl::toPercentEncoding would possibly be an option, but it escapes spaces when it doesn't have to … one could replace %20 with a space before though and re-replace it again before decoding it …

                    JonBJ Offline
                    JonBJ Offline
                    JonB
                    wrote on last edited by
                    #8

                    @l3u_ said in Is there an easy way to escape all non-ASCII characters of a QString?:

                    <SOH>27 with <SOH> being 1 is just 127 and will stay 127?

                    Umm, no. I haven't followed the ins & outs of this discussion, so I may be mistaken about your context. But <SOH> is ASCII/binary character with value 1, not digit 1. But the 27 are digits 27 (right or not?), which is different. The 3 character sequence <SOH>27 is not the same as the 3 characters 127.

                    l3u_L 1 Reply Last reply
                    0
                    • JonBJ JonB

                      @l3u_ said in Is there an easy way to escape all non-ASCII characters of a QString?:

                      <SOH>27 with <SOH> being 1 is just 127 and will stay 127?

                      Umm, no. I haven't followed the ins & outs of this discussion, so I may be mistaken about your context. But <SOH> is ASCII/binary character with value 1, not digit 1. But the 27 are digits 27 (right or not?), which is different. The 3 character sequence <SOH>27 is not the same as the 3 characters 127.

                      l3u_L Offline
                      l3u_L Offline
                      l3u_
                      wrote on last edited by
                      #9

                      @JonB Ah okay. Thanks for the clarification. Can this actually be typed in using a QLineEdit?! This is only intended to be used for strings typed by a user …

                      JonBJ l3u_L 2 Replies Last reply
                      0
                      • l3u_L l3u_

                        @Chris-Kawa I don't get it? <SOH>27 with <SOH> being 1 is just 127 and will stay 127?! It walks through the QByteArray byte per byte and checks if the byte represents an ASCII character in the defined range, and if not, it replaces it with = and the string representation of the hex value of that byte (which itself is ASCII again)? And if a = appears in the array, it means that the next two bytes represent the hex value of the byte in question (including the = itself)?

                        I mean, I didn't make this up, it's just Quoted-Printable – at least I hope so?! So I'm not re-inventing an old wheel, I'm just trying to implement it …

                        Well, QUrl::toPercentEncoding would possibly be an option, but it escapes spaces when it doesn't have to … one could replace %20 with a space before though and re-replace it again before decoding it …

                        Chris KawaC Offline
                        Chris KawaC Offline
                        Chris Kawa
                        Lifetime Qt Champion
                        wrote on last edited by Chris Kawa
                        #10

                        @l3u_ <SOH> is 1 as in binary 00000001, not as in text "1". It's a non printable character. Your range is 9,32-60,62-126, so 1 is below it and gets encoded as =1. The following 27 is text "27". The characters are in range, so don't get translated, so you get =127. When decoding you don't know where the encoded part ends, just assume two digit number, so you grab 2 as part of the encoded character, when really it's just text, so you decode =12 followed by text "7" instead of decoding =1 followed by text "27".

                        Can this actually be typed in using a QLineEdit?!

                        Haven't tried, but if not typed then probably copy/pasted from somewhere.

                        l3u_L 1 Reply Last reply
                        0
                        • l3u_L l3u_

                          @JonB Ah okay. Thanks for the clarification. Can this actually be typed in using a QLineEdit?! This is only intended to be used for strings typed by a user …

                          JonBJ Offline
                          JonBJ Offline
                          JonB
                          wrote on last edited by
                          #11

                          @l3u_
                          As I say, I have not followed the discussion. But, no, user will not be able to type the <SOH> character into a line edit. That would actually require Ctrl+A to be typed, and a line edit won't store that as a character, it will treat it as a control sequence (probably selecting the whole of the line edit contents if your press it).

                          l3u_L 1 Reply Last reply
                          0
                          • Chris KawaC Chris Kawa

                            @l3u_ <SOH> is 1 as in binary 00000001, not as in text "1". It's a non printable character. Your range is 9,32-60,62-126, so 1 is below it and gets encoded as =1. The following 27 is text "27". The characters are in range, so don't get translated, so you get =127. When decoding you don't know where the encoded part ends, just assume two digit number, so you grab 2 as part of the encoded character, when really it's just text, so you decode =12 followed by text "7" instead of decoding =1 followed by text "27".

                            Can this actually be typed in using a QLineEdit?!

                            Haven't tried, but if not typed then probably copy/pasted from somewhere.

                            l3u_L Offline
                            l3u_L Offline
                            l3u_
                            wrote on last edited by
                            #12

                            @Chris-Kawa Okay. Here we go. I hoped using quoted-printable would be easy to implement … maybe, using QUrl::toPercentEncoding adding some charaters that actually don't need to be escaped for this use-case (using the exclude bytearray) will have less pitfalls ;-)

                            1 Reply Last reply
                            0
                            • JonBJ JonB

                              @l3u_
                              As I say, I have not followed the discussion. But, no, user will not be able to type the <SOH> character into a line edit. That would actually require Ctrl+A to be typed, and a line edit won't store that as a character, it will treat it as a control sequence (probably selecting the whole of the line edit contents if your press it).

                              l3u_L Offline
                              l3u_L Offline
                              l3u_
                              wrote on last edited by
                              #13

                              @JonB Okay. Well this is only inteded to be used to escape non-ASCII (UTF-8) characters from user input and unescape them later on. The input is a single line from a QLineEdit.

                              1 Reply Last reply
                              0
                              • l3u_L Offline
                                l3u_L Offline
                                l3u_
                                wrote on last edited by l3u_
                                #14

                                Okay, this seems to do the trick:

                                const auto test = QStringLiteral("abc, äöü!");
                                const auto exclude = QStringLiteral(" !\"#$&'()*+,/:;=?@[]").toUtf8();
                                const auto escaped = QUrl::toPercentEncoding(test, exclude);
                                const auto unEscaped = QUrl::fromPercentEncoding(escaped);
                                
                                qDebug() << test;
                                qDebug() << escaped;
                                qDebug() << unEscaped;
                                

                                Output:

                                "abc, äöü!"
                                "abc, %C3%A4%C3%B6%C3%BC!"
                                "abc, äöü!"
                                

                                This seems to be what I inteded to achieve with the quoted-printable encoding, and just as long (as %C3 is not longer than =C3). And as one can exclude characters that don't need escaping for my use-case, it's essentially the same, but without programming shortcomings from me ;-)

                                I think this is the correct way. Thanks for the input :-)

                                Edit: There's also QByteArray::toPercentEncoding, which is actually called by QUrl::toPercentEncoding, with only the input QString being converted to a QByteArray via QString::toUtf8. No need to use QUrl.

                                1 Reply Last reply
                                0
                                • l3u_L l3u_

                                  @JonB Ah okay. Thanks for the clarification. Can this actually be typed in using a QLineEdit?! This is only intended to be used for strings typed by a user …

                                  l3u_L Offline
                                  l3u_L Offline
                                  l3u_
                                  wrote on last edited by
                                  #15

                                  @l3u_ However, it would be encoded as =01, not =1 ;-)

                                  1 Reply Last reply
                                  0
                                  • l3u_L l3u_ has marked this topic as solved on

                                  • Login

                                  • Login or register to search.
                                  • First post
                                    Last post
                                  0
                                  • Categories
                                  • Recent
                                  • Tags
                                  • Popular
                                  • Users
                                  • Groups
                                  • Search
                                  • Get Qt Extensions
                                  • Unsolved