The most appropriate way of detecting BOM in a UTF-8/16/32 file/text



  • Hello Qt guys. I stumbled upon a scenario where I wanted to detect whether the UTF-x encoding of a given text has a BOM. As the documentation says QTextStream supports automatic detection of BOM but Strangely its documentation says it detects only UTF-16 and UTF-32 BOMs, where being a multi-byte encoding UTF-8 also has a BOM. Now, about the actual detection of the BOM. This is the only working answer I've found:

    QFile file( "..." );
    const bool hasByteOrderMark = QTextCodec::codecForUtfText( file.peek(4), Q_NULLPTR ) != Q_NULLPTR;
    

    This seems to me a pretty raw and hardcoded approach since the knowledge about the length of the BOM is explicitly stated as 4. UTF BOM variants ( see here / "Byte order marks by encoding" ) range from 1-5 bytes actually if we take non-standard UTF-7 encoding at hand as well. Thus to make the implementation more elegant and Qt conformant thought that probably it is better to serialize the BOM in all of its forms - UTF-8/16/32 big/little-endian in a string and then see if the start of the text/file at hand matches either of them. Tried serializing the BOM to a string using QTextStream's bom global function alone:

    QString str;
    QTextStream ts( & str );
    ts.setCodec( QTextCodec::codecForName( "UTF-8" ) );
    ts << bom;
    
    qDebug() << QString( "text with bom alone: text(%1), size(%2)" )
                .arg( str )
                .arg( str.size() );
    

    The output is an empty string and zero as size. The same goes for the following code as well:

    QString str;
    QTextStream ts( & str );
    ts.setCodec( QTextCodec::codecForName( "UTF-8" ) );
    ts.setGenerateByteOrderMark( true );
    ts << 'a';
    
    qDebug() << QString( "text with bom alone: text(%1), size(%2)" )
                .arg( str )
                .arg( str.size() );
    

    The result is the text "a" and size of 1.

    Is there any more elegant and standard way of detecting the BOM without hardcoded knowledge of its size?


  • Lifetime Qt Champion

    Hi,

    AFAIK, no there's not.

    Did you also saw that post about the same subject ? The technique is slightly different than yours.

    I'd also recommend taking a look at the interest mailing list archives. IIRC there was already discussion about the topic there.



  • @SGaist

    AFAIK, no there's not.

    Ok.

    @SGaist:

    Did you also saw that post about the same subject ? The technique is slightly different than yours.

    I have cited exactly this approach in my post above with few just cosmetic modifications I would say. See, the first citation:

    @napajejenunedk0:

    This is the only working answer I've found:

    @SGaist:

    I'd also recommend taking a look at the interest mailing list archives. IIRC there was already discussion about the topic there.

    Thanks.


Log in to reply
 

Looks like your connection to Qt Forum was lost, please wait while we try to reconnect.