The most appropriate way of detecting BOM in a UTF-8/16/32 file/text



  • Hello Qt guys. I stumbled upon a scenario where I wanted to detect whether the UTF-x encoding of a given text has a BOM. As the documentation says QTextStream supports automatic detection of BOM but Strangely its documentation says it detects only UTF-16 and UTF-32 BOMs, where being a multi-byte encoding UTF-8 also has a BOM. Now, about the actual detection of the BOM. This is the only working answer I've found:

    QFile file( "..." );
    const bool hasByteOrderMark = QTextCodec::codecForUtfText( file.peek(4), Q_NULLPTR ) != Q_NULLPTR;
    

    This seems to me a pretty raw and hardcoded approach since the knowledge about the length of the BOM is explicitly stated as 4. UTF BOM variants ( see here / "Byte order marks by encoding" ) range from 1-5 bytes actually if we take non-standard UTF-7 encoding at hand as well. Thus to make the implementation more elegant and Qt conformant thought that probably it is better to serialize the BOM in all of its forms - UTF-8/16/32 big/little-endian in a string and then see if the start of the text/file at hand matches either of them. Tried serializing the BOM to a string using QTextStream's bom global function alone:

    QString str;
    QTextStream ts( & str );
    ts.setCodec( QTextCodec::codecForName( "UTF-8" ) );
    ts << bom;
    
    qDebug() << QString( "text with bom alone: text(%1), size(%2)" )
                .arg( str )
                .arg( str.size() );
    

    The output is an empty string and zero as size. The same goes for the following code as well:

    QString str;
    QTextStream ts( & str );
    ts.setCodec( QTextCodec::codecForName( "UTF-8" ) );
    ts.setGenerateByteOrderMark( true );
    ts << 'a';
    
    qDebug() << QString( "text with bom alone: text(%1), size(%2)" )
                .arg( str )
                .arg( str.size() );
    

    The result is the text "a" and size of 1.

    Is there any more elegant and standard way of detecting the BOM without hardcoded knowledge of its size?


  • Lifetime Qt Champion

    Hi,

    AFAIK, no there's not.

    Did you also saw that post about the same subject ? The technique is slightly different than yours.

    I'd also recommend taking a look at the interest mailing list archives. IIRC there was already discussion about the topic there.



  • @SGaist

    AFAIK, no there's not.

    Ok.

    @SGaist:

    Did you also saw that post about the same subject ? The technique is slightly different than yours.

    I have cited exactly this approach in my post above with few just cosmetic modifications I would say. See, the first citation:

    @napajejenunedk0:

    This is the only working answer I've found:

    @SGaist:

    I'd also recommend taking a look at the interest mailing list archives. IIRC there was already discussion about the topic there.

    Thanks.


Log in to reply