How to determine data format in QByteArray (ASCII / HEX / Unicode)
-
wrote on 26 Sept 2022, 16:05 last edited by
I have a software that analyzes the serial communication between a testing machine and their software because I need to grab these values for my software. Most of the components are using ASCII-formats for their communication, but there are some that are using binary data like Modbus, etc. As I use readAll(), I've get the result into a QByteArray and when I print it by qDebug() I can clearly see whether it's ASCII or HEX (HEX-values are printed with \xdf\x01\xff...), but I did not find a way to determine by software what format it is. I think there must be a way to find out this...
-
I have a software that analyzes the serial communication between a testing machine and their software because I need to grab these values for my software. Most of the components are using ASCII-formats for their communication, but there are some that are using binary data like Modbus, etc. As I use readAll(), I've get the result into a QByteArray and when I print it by qDebug() I can clearly see whether it's ASCII or HEX (HEX-values are printed with \xdf\x01\xff...), but I did not find a way to determine by software what format it is. I think there must be a way to find out this...
wrote on 26 Sept 2022, 16:17 last edited by JonB@hkottmann said in How to determine data format in QByteArray (ASCII / HEX / Unicode):
but I did not find a way to determine by software what format it is. I think there must be a way to find out this...
No, there can be no such thing. You receive bytes over serial into
QByeArray
. Bytes are bytes! They could mean anything, they could be arbitrary binary values or multibyte numbers or ASCII character values or whatever. There is no fool proof way of knowing which, other than seeing if there look like a lot of characters there. You have to know who the sender is and what "format" it is sending bytes in if you want to "interpret" them as such. -
wrote on 26 Sept 2022, 16:23 last edited by
BTW, but how knows qDebug() how to print it?
-
wrote on 26 Sept 2022, 16:50 last edited by
@hkottmann
You are handingqDebug()
aQByteArray
, so it shows the bytes in the array. It may or may not show them as characters if the byte happens to be in ASCII range, I don't know. But whatever neitherqDebug()
notQByteArray
know anything about what the bytes "mean" or where they come from. -
Looks like the OP wanted a second opinion because mine wasn't the right answer: https://stackoverflow.com/questions/73855456/how-to-determine-data-format-in-qbytearray-ascii-hex-unicode :)
-
wrote on 26 Sept 2022, 17:46 last edited by
@hkottmann said in How to determine data format in QByteArray (ASCII / HEX / Unicode):
BTW, but how knows qDebug() how to print it?
The source is available. You can see how qDebug() does it. https://code.qt.io/cgit/qt/qtbase.git/tree/src/corelib/io/qdebug.cpp#n26
-
wrote on 27 Sept 2022, 07:57 last edited by
Well, there are a few tricks you can try, though it will not be perfect.
First, I assume that when you say ASCII you mean true ASCII, i.e. only 128 bits and not the full 256. If this is not the case then you are (almost) out of luck. At least you could not distinguish between ASCII, HEX, and Unicode all three at the same time.
If you have 128-bit ASCII then just treat it as UTF-8 (I assume that when you say Unicode, you mean UTF-8). This range is the same for ASCII and UTF-8. Then, you only need to distinguish between UTF-8 and HEX.
The first 32 values in ASCII and Unicode are control characters. Most likely you'll only want to support a specific set of control characters, like
\0
,\n
and\r
(maybe\t
), inside text blocks. If your QByteArray contains any other control characters treat the whole QByteArray as HEX.You should also have a quick look at UTF-8 on Wikipedia. If your byte starts with
0xxxxxxx
it is an ASCII character (including all control characters).0x110xxxxx
is a 2-byte multibyte-character in UTF-8 (two leading ones),0x1110xxxx
is a 3-byte multibyte-character, and0x11110xxx
is a 4-byte multibyte-character. The remaining bytes of the multibyte-character start with0x10xxxxxx
. If you don't recognize a QByteArray as UTF-8 (because of invalid multibyte sequences that are not UTF-8 multibytes) treat it as HEX.The last one can be sped up a little bit (i.e. you don't have to implement it yourself):
QTextCodec::ConverterState state; QTextCodec *codec = QTextCodec::codecForName("UTF-8"); QByteArray byteArray(text); QString str = codec->toUnicode(byteArray.constData(), byteArray.size(), &state); if (state.invalidChars > 0) { // not UTF-8 -> treat as HEX }
This approach still has some room for errors. Any HEX sequence could look like UTF-8 by accident. How well it works depends on the length of the QByteArray. If you have longer sequences this approach will work better. If you receive the full protocol and somewhere in there is some transmitted text hidden, then you need to parse the protocol and can't just rely on my proposed heuristic.
-
wrote on 8 Oct 2022, 07:35 last edited by
Dear all
Thanks for your help. I used the way that qDebug goes, BTW, there is no waterproof way to easily determine the data format. The function isprint() can give you a hint, but you need in the most of cases further checks to determine the real data format, as QByteArray or QString don't have any header information about it's data format.