Unsolved The most appropriate way of detecting BOM in a UTF-8/16/32 file/text
-
Hello Qt guys. I stumbled upon a scenario where I wanted to detect whether the UTF-x encoding of a given text has a BOM. As the documentation says QTextStream supports automatic detection of BOM but Strangely its documentation says it detects only UTF-16 and UTF-32 BOMs, where being a multi-byte encoding UTF-8 also has a BOM. Now, about the actual detection of the BOM. This is the only working answer I've found:
QFile file( "..." ); const bool hasByteOrderMark = QTextCodec::codecForUtfText( file.peek(4), Q_NULLPTR ) != Q_NULLPTR;
This seems to me a pretty raw and hardcoded approach since the knowledge about the length of the BOM is explicitly stated as 4. UTF BOM variants ( see here / "Byte order marks by encoding" ) range from 1-5 bytes actually if we take non-standard UTF-7 encoding at hand as well. Thus to make the implementation more elegant and Qt conformant thought that probably it is better to serialize the BOM in all of its forms - UTF-8/16/32 big/little-endian in a string and then see if the start of the text/file at hand matches either of them. Tried serializing the BOM to a string using QTextStream's
bom
global function alone:QString str; QTextStream ts( & str ); ts.setCodec( QTextCodec::codecForName( "UTF-8" ) ); ts << bom; qDebug() << QString( "text with bom alone: text(%1), size(%2)" ) .arg( str ) .arg( str.size() );
The output is an empty string and zero as size. The same goes for the following code as well:
QString str; QTextStream ts( & str ); ts.setCodec( QTextCodec::codecForName( "UTF-8" ) ); ts.setGenerateByteOrderMark( true ); ts << 'a'; qDebug() << QString( "text with bom alone: text(%1), size(%2)" ) .arg( str ) .arg( str.size() );
The result is the text "a" and size of 1.
Is there any more elegant and standard way of detecting the BOM without hardcoded knowledge of its size?
-
Hi,
AFAIK, no there's not.
Did you also saw that post about the same subject ? The technique is slightly different than yours.
I'd also recommend taking a look at the interest mailing list archives. IIRC there was already discussion about the topic there.
-
AFAIK, no there's not.
Ok.
Did you also saw that post about the same subject ? The technique is slightly different than yours.
I have cited exactly this approach in my post above with few just cosmetic modifications I would say. See, the first citation:
This is the only working answer I've found:
I'd also recommend taking a look at the interest mailing list archives. IIRC there was already discussion about the topic there.
Thanks.