Is QT checked, which encoding is using in file?
-
Hi,
I have files with different encoding: UTF-8, UTF-8 with BOM, windows-1250, UTF-16LE, UTF-16BE.
I know that when the file has UTF-8 encoding and I load this file to QString and next I set this QString in label->setText() I will see a strange signs. So I have to use:
QTextStream ts1(file1.readAll(), QIODevice::ReadOnly); ts1.setCodec(QTextCodec::codecForName("UTF-8"));
And now it works.
When I use other encodings ( UTF-8 with BOM, windows-1250, UTF16LE, UTF-16BE ) I don't have this problem, so my question is: Can QT autodetect if the file has for example encoding UTF-8 with BOM and automatic setCodec to UTF-8?
I think QT can't detect UTF-8 encoding because there isn't BOM here. ANSI is default. UTF-8 with BOM, UTF16LE, UTF16BE have BOM so, QT check bytes in files headers and find BOMs. But maybe I'm wrong.
EDIT:
In QTextStream in docs I find:Internally, QTextStream uses a Unicode based buffer, and QTextCodec is used by QTextStream to automatically support different character sets. By default, QTextCodec::codecForLocale() is used for reading and writing, but you can also set the codec by calling setCodec(). Automatic Unicode detection is also supported. When this feature is enabled (the default behavior), QTextStream will detect the UTF-16 or the UTF-32 BOM (Byte Order Mark) and switch to the appropriate UTF codec when reading. QTextStream does not write a BOM by default, but you can enable this by calling setGenerateByteOrderMark(true). When QTextStream operates on a QString directly, the codec is disabled.
So UTF16 is detected, but what about UTF8 with Bom?
I find here:
https://code.woboq.org/qt5/qtbase/src/corelib/serialization/qtextstream.cpp.html
text:
QTextCodec *QTextStream::codec() const { Q_D(const QTextStream); return d->codec; } /*! If \a enabled is true, QTextStream will attempt to detect Unicode encoding by peeking into the stream data to see if it can find the UTF-8, UTF-16, or UTF-32 Byte Order Mark (BOM). If this mark is found, QTextStream will replace the current codec with the UTF codec. This function can be used together with setCodec(). It is common to set the codec to UTF-8, and then enable UTF-16 detection. \sa autoDetectUnicode(), setCodec(), QTextCodec::codecForUtfText() */ void QTextStream::setAutoDetectUnicode(bool enabled) { Q_D(QTextStream); d->autoDetectUnicode = enabled; }
So... what with UTF-8 with BOM?
-
QTextCodec, by default will detect a UTF-32, -16, or -8 BOM and react accordingly. Any other collection of bytes will be interpreted as whatever QTextCodec::codecForLocale() returns on the machine running the code (UTF-8 on my Linux box, probably a WIndows-125x 8-bit encoding on Windows).