[Solved] Localization problems with UTF8 Turkish text data

Alicemirror · wrote on 3 Oct 2011, 07:30

Hi to all,

I have a application that manages UTF-8 json files (formerly text files) that are included as internal resources. Some of these includes special characters because of words in Turkish language.

Opening the files in the Qt Creator editor or any other editor I read the files correctly and all the special characters are shown in the right way. The following points summarize the scenario:

The json file is opened as a QString datastream by a C++ class exposed to the QML document. The console.log(class.datastream) shows the file with the correct characters. This means that the localization is managed in the right way: all the Turkish cities names have the right special characters.

The datastream is converted to a json object using the eval js function. What I expect is that all the names are in the json dictionary that should be appended to a list for the user choice.

function:
@
// sourceString is the QString datastream (verified as correct)
function processJsonData(sourceString) {

return eval&#40;'(' + sourceString + '&#41;'&#41;;

}
@
As a matter of fact, this function didn't work. I became crazy to find the "parsing error" message received from the function searching for wrong data in the json text file. But all is correct.
Opening the json file with a binary editor I see three unprintable bytes before the first json structure character (that is the first "{" character). These are 0xEF, 0xBB, 0xBF part of the UTF-8 encoding of the file.
These characters are treated by the js for some unknown reason as wrong data. This is the reason that the QString can't be processed correctly.
The demonstration is in the changes that I have done to the previous function to this new one:
@
function processJsonData(sourceString) {

console.log("\n *** Remove invalid left " + sourceString.indexOf("{") +  " characters ");
return eval&#40;'('
            + sourceString.slice(sourceString.indexOf("{"&#41;&#41;
            + ')');

}
@
The function above works as expected and the QString is parsed. This function is applied to all the json files used by the application and ony for UTF-8 json files including Turkish characters the log message reports
@
*** Remove invalid left 3 characters
@
that are the mentioned first three bytes.

Showing the list of data in the application the Turkish characters are rendered in a wrong way with grpahic symbols, strange characters etc. But all the other parts of the file are shown correctly.

Is there some special procedure I should do? I have no ideas on how to workaround to this problem.

dangelog · wrote on 3 Oct 2011, 13:24

[quote]Opening the json file with a binary editor I see three unprintable bytes before the first json structure character (that is the first “{” character). These are 0xEF, 0xBB, 0xBF part of the UTF-8 encoding of the file.[/quote]

Those bytes are just the "BOM":http://en.wikipedia.org/wiki/Byte_order_mark for UTF-8 encoding. How are you getting that string inside your program? For sure there's a fromUtf8 call missing somewhere.

Alicemirror · wrote on 3 Oct 2011, 15:38

Hi, peppe. I am sure that thre is something missing, my difficult is to focus the problem. Sigh.

dangelog · wrote on 3 Oct 2011, 15:58

Well: how do you open and read that JSON file?

Alicemirror · wrote on 3 Oct 2011, 16:07

@
if (!m_data.isNull()) {
if (!QFile::exists(m_data))
m_datastream.clear();
else {
QFile file(m_data);
if (!file.open(QFile::ReadOnly))
m_datastream.clear();
else {
QByteArray data = file.readAll();
QTextCodec *codec = QTextCodec::codecForLocale();
QString str = codec->toUnicode(data);
m_datastream.append(str);

            qDebug() << "AppData::getJson() created datastream";

        }
    }
}

@

This is the core function that with m_data checked for the file content, with the right path etc. Creates the QString m_datastream that is the string exposed to the QML code: as is the QString that is parsed in the js function.

dangelog · wrote on 3 Oct 2011, 16:23

Line 10 is suspicious. The file seems to be UTF8 (it has a UTF8 BOM), so you should be using QString::fromUtf8. What if the locale codec isn't UTF8?

Alicemirror · wrote on 3 Oct 2011, 16:37

This is suspiciuout to me too, because I don 't know very good this part. so you suggest to change
@
QString str = codec->toUnicode(data)
@
in
@
QString str = codec->fromUTf8(data)
@

? All the files are UTF8 btoh turkish and not. And the BOM is only on the Turkish-character files.

Alicemirror · wrote on 3 Oct 2011, 16:49

@
QByteArray data = file.readAll();
QTextCodec *codec = QTextCodec::codecForLocale();
QString str = codec->toUnicode(data);
m_datastream.append(str);
@
This piece was changed accordingly with your suggestion:
@
QByteArray data = file.readAll();
QString str = QString::fromUtf8(data); // Instead of the following two lines
// QTextCodec *codec = QTextCodec::codecForLocale();
// QString str = codec->toUnicode(data);
m_datastream.append(str);
@

Remain the general question if is possible to know the format of the input file or if it is best that I set a function parameter to decide what kind of encoding / decoding should be used.

dangelog · wrote on 3 Oct 2011, 17:55

No. You must know the encoding in advance, or apply heuristics (like what file(1) does).

Alicemirror · wrote on 3 Oct 2011, 18:38

@peppe: many thanks for the support :)