Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. The most appropriate way of detecting BOM in a UTF-8/16/32 file/text
Forum Updated to NodeBB v4.3 + New Features

The most appropriate way of detecting BOM in a UTF-8/16/32 file/text

Scheduled Pinned Locked Moved Unsolved General and Desktop
3 Posts 2 Posters 2.5k Views 1 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • napajejenunedk0N Offline
    napajejenunedk0N Offline
    napajejenunedk0
    wrote on last edited by
    #1

    Hello Qt guys. I stumbled upon a scenario where I wanted to detect whether the UTF-x encoding of a given text has a BOM. As the documentation says QTextStream supports automatic detection of BOM but Strangely its documentation says it detects only UTF-16 and UTF-32 BOMs, where being a multi-byte encoding UTF-8 also has a BOM. Now, about the actual detection of the BOM. This is the only working answer I've found:

    QFile file( "..." );
    const bool hasByteOrderMark = QTextCodec::codecForUtfText( file.peek(4), Q_NULLPTR ) != Q_NULLPTR;
    

    This seems to me a pretty raw and hardcoded approach since the knowledge about the length of the BOM is explicitly stated as 4. UTF BOM variants ( see here / "Byte order marks by encoding" ) range from 1-5 bytes actually if we take non-standard UTF-7 encoding at hand as well. Thus to make the implementation more elegant and Qt conformant thought that probably it is better to serialize the BOM in all of its forms - UTF-8/16/32 big/little-endian in a string and then see if the start of the text/file at hand matches either of them. Tried serializing the BOM to a string using QTextStream's bom global function alone:

    QString str;
    QTextStream ts( & str );
    ts.setCodec( QTextCodec::codecForName( "UTF-8" ) );
    ts << bom;
    
    qDebug() << QString( "text with bom alone: text(%1), size(%2)" )
                .arg( str )
                .arg( str.size() );
    

    The output is an empty string and zero as size. The same goes for the following code as well:

    QString str;
    QTextStream ts( & str );
    ts.setCodec( QTextCodec::codecForName( "UTF-8" ) );
    ts.setGenerateByteOrderMark( true );
    ts << 'a';
    
    qDebug() << QString( "text with bom alone: text(%1), size(%2)" )
                .arg( str )
                .arg( str.size() );
    

    The result is the text "a" and size of 1.

    Is there any more elegant and standard way of detecting the BOM without hardcoded knowledge of its size?

    1 Reply Last reply
    0
    • SGaistS Offline
      SGaistS Offline
      SGaist
      Lifetime Qt Champion
      wrote on last edited by
      #2

      Hi,

      AFAIK, no there's not.

      Did you also saw that post about the same subject ? The technique is slightly different than yours.

      I'd also recommend taking a look at the interest mailing list archives. IIRC there was already discussion about the topic there.

      Interested in AI ? www.idiap.ch
      Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

      1 Reply Last reply
      0
      • napajejenunedk0N Offline
        napajejenunedk0N Offline
        napajejenunedk0
        wrote on last edited by
        #3

        @SGaist

        AFAIK, no there's not.

        Ok.

        @SGaist:

        Did you also saw that post about the same subject ? The technique is slightly different than yours.

        I have cited exactly this approach in my post above with few just cosmetic modifications I would say. See, the first citation:

        @napajejenunedk0:

        This is the only working answer I've found:

        @SGaist:

        I'd also recommend taking a look at the interest mailing list archives. IIRC there was already discussion about the topic there.

        Thanks.

        1 Reply Last reply
        0

        • Login

        • Login or register to search.
        • First post
          Last post
        0
        • Categories
        • Recent
        • Tags
        • Popular
        • Users
        • Groups
        • Search
        • Get Qt Extensions
        • Unsolved