QRegExp returning incorrect position due to codec ?

Aramir

Hello

I am working on a file in UTF-8 format (mixing latin and japanese characters) using QFile, parsing said file with QTextStream (codec set to UTF-8) and analysing each line with QRegExp.
More precisely I'm parsing this file and modifying a single character (using QFile::putChar() ) at a specific location given me by QRegExp.

The problem I encounter is that QRegExp::pos() returns the correct position of said character within QTextStream. But this position totally differs in QFile (because it's not working with the same """"codec"""").

Here's a replication of the problem https://pastebin.com/L6UiKu4K

Therefore my question is how do I make sure that QFile, QRegExp, QTextStream work using the same format/codec ? It seems there used to be a QTextCodec::codecForCStrings() that would have worked ? But not anymore.

PS : I'm trying to avoid the simplistic approach of "copy everything in ram, works there then rewrite everything". It's kinda overkill to read and rewrite the whole file for one single character.

Christian Ehrlicher

Don't write out your data to the QFile directly but with QTextStream or modify the QString and write it out at the end.
And don't use QRegExp but QRegularExpression - QRegExp is deprecated since ages.

Aramir

Thanks for the fast answer.

I'll use QRegularExpression indeed, thanks for pointing it out.

However I fail to see how using QTextStream is going to help me there. Because I don't want to modify anything except the targeted character. I don't want to append the modified line at the end of the file, I want to keep the lines's order intact. Therefore I still need to reposition the QTextStream at the beginning of the line before writing. And my attempts at that so far are unsuccesful : https://pastebin.com/3g9raWD3

Christian Ehrlicher

@Aramir said in QRegExp returning incorrect position due to codec ?:

However I fail to see how using QTextStream is going to help me there. Because I don't want to modify anything except the targeted character.

This won't work since it's utf-8 encoded - so your source and target character may not have the same utf-8 encoded byte length in your text file.
Read your whole file into memory, modify it,write it back.

Aramir

@Christian-Ehrlicher

Sorry but I don't get it. I'm replacing a digit [0-9] by another [0-9] (I failed to mention this my bad). According to UTF-8's table :
0 <=> (0x)30 <=> (0b)0011 0000
9 <=> (0x)39 <=> (0b)0011 1001
So my modification shouldn't alter the size of anything ?

Christian Ehrlicher

@Aramir said in QRegExp returning incorrect position due to codec ?:

So my modification shouldn't alter the size of anything ?

No but you did not mention it before.
You only know the position in the file when you've the QString, convert it to utf-8 and then search for your digit.

As I said - reading it into a QString, modify it and write it out afterwards is much easier and less error-prone.

Aramir

@Christian-Ehrlicher

There seems to be missing words in your last answer. But from what I understand, you're saying that I've got the incorrect "file relative" position of my character, because I'm iterating over a Qstring in UTF-8 whereas my file is in... well there's no codec for the file..

I've got multiple issues with that:
1/ My code (still the same https://pastebin.com/3g9raWD3) works just fine on the first line of the file. => I can replace the targeted char just fine using QRegularExpressionMatch::capturedStart(int n). (or QString or whatever)
2/ I'm not using QFile directly anymore to manipulate data, I'm using QTextStream for anything so I should be using the UTF-8 format for every operation ? So it shouldn't matter what position I'm using as they're all "Line relative" and not "File relative" (that's probably the incorrect assumption) ?
3/ It's not the targeted character position that is wrong (as this one is relative to the QString of the current line I'm reading) but the "start of the line" position. Explaining why my solution works for the first line but not the others.

So the only explanation that I can come up with is that despite using QTextStream::seek() ( instead of QFile::seek() ) to place myself and despite using QTextStream::pos() to locate where am I.... I should be using QFile/raw position with QTextStream::seek().... I should use raw information to place my UTF-8 QTextStream. I'm guessing that's why the documentation says "Seeks to the position pos in the device" and not "in the stream".

This is weird but ok. Sorry if all this seems condescending, I'm not trying to be. Just trying to understand things.

Anyways for the time being I'll probably store the whole thing in ram .... and rewrite the whole file instead of one single character --", wasting cycles and sd card's life. (I'm working on embedded device here so it kinda matters). Or I'll go back to using raw C(pp) std::getline() and avoid the whole codec. Or I'll read everything char by char, fill a buffer and stop whenever I've got "=erocsgninrael[" in it. Or I'll have the epiphany and figure out what's wrong with this qt code...

Christian Ehrlicher

@Aramir said in QRegExp returning incorrect position due to codec ?:

There seems to be missing words in your last answer.

Which?

over a Qstring in UTF-8 whereas my file is in... well there's no codec for the file..

No, you're wrong.
You told us in the first post that your file is utf-8 encoded.
This means a character in your file can be one to four bytes long. When you decode it into a QString, you've utf-16 encoded string. So yes - the position in the QString can not be equal to the position in your file (except your file only contains ansi characters).

Therefore again - instead trying to figure out where in the file you have to replace a byte simply rewrite the whole file. You're reading it at all anyway so why this complicated stuff. Anything else will just produce work for nothing - knowing the current file position for a QString index needs a conversion for the whole content until the index to utf-8 which then can be written to a file directly instead.

Aramir

@Christian-Ehrlicher said in QRegExp returning incorrect position due to codec ?:

Which?

"when you've the QString," I'm guessing it's "when you've got the QString"

When you decode it into a QString, you've utf-16 encoded string

Ok I didn't know QString where utf-16. I did however manage to get the correct position on the first line

[fontType=hiragana][jp=いぬ][kanji=犬][trad=chien][learningScore=5]

But maybe it was dumb luck on this specific line and all the characters's length are the same in UTF-8 and UTF-16.

instead trying to figure out where in the file you have to replace a byte simply rewrite the whole file. You're reading it at all anyway

I've seen many SD cards/emmc die in ereaders (devices I'm aiming for) therefore I'm trying to lower the write operations on it in order to not reduce their lifetime. That's it I think it's worth the trouble to figure out a solution to this problem instead of creating more and more e-waste.

SimonSchroeder

If you really want to change individual bits/bytes in a file – even when it is UTF-8 encoded – you should handle the file as QByteArray or something similar. Good thing with UTF-8 is that all bytes starting with 0 as the first bit are actual ASCII characters. As long as you only want to modify numbers, just run over the bytes and change the ones you want to change.

JonB

@SimonSchroeder said in QRegExp returning incorrect position due to codec ?:

Good thing with UTF-8 is that all bytes starting with 0 as the first bit are actual ASCII characters

Hi Simon. This is an immensely useful statement for those of us (i.e. me!) who have always struggled to understand encoding, UTF-8! You inspired me to go look at https://en.wikipedia.org/wiki/UTF-8, the Encoding sub-section table. May I check with you the following rules, pertinent to doing what turns out now to be a simple scan of a UTF-8 file regarded as bytes:

Any time I encounter a byte without the top bit set I can be sure this is a standalone 7-bit "ASCII" character, never part of a UTF-8 multibyte sequence, for sure.
Any time I encounter a byte with the top bit set I can be sure this is part of multibyte sequence. I cannot be sure whether it will be 2, 3 or 4 bytes unless I want to use that table to figure it out (byte 1 will tell me whether 1, 2, or 3 follow, and they will all be binary 10xxxxxx), but all of these bytes will have top bit set and so cannot be "ASCII" characters. If I wish to skip all multibyte UTF-8 character sequences I can ignore all bytes with top bit bit set, even if one multibyte sequence immediately follows another.

The above is correct, right?

So @Aramir you can forget about QTextStream and do nothing other than QFile scan byte-by-byte changing anything/everything (0b)0011 0000 -- (0b)0011 1001 and be sure that is only addressing genuine 1-byte digits.

SimonSchroeder

@JonB said in QRegExp returning incorrect position due to codec ?:

May I check with you the following rules
[...]
The above is correct, right?

I am not a Unicode export, though I have read a lot about it (though I am still missing some experience). From my personal understanding I can confirm everything you have stated.

With the table from Wikipedia you can even extend a little (though this is not important as long as you only look at ASCII characters). The number of 1's set at the beginning of the first byte of a multibyte character is the length of the multibyte character: 110... -> two 1's means two-byte character, 1110... -> three 1's means three-byte character and finally 11110... -> four 1's means four-byte character. If you encounter 10... (only one leading 1) you are in the middle of a multibyte character. Look to the left of the current byte to find the beginning of the character (you don't know how many bytes to look to the left). UTF-8 is certainly one of the best Unicode encoding for programmers who access file directly.

JonB

@SimonSchroeder
Thank you for confirming, a veil of mystery has been lifted from my eyes! UTF-8 is really simple to understand and recognize the sequences :)

Yes, I realize the top bits in the first byte tell you how many follow, hence my "unless I want to use that table to figure it out ". Even though that is pretty easy, the OP here simply does not need to bother since he can be sure to only and always look at bytes with top bit clear for his/any-"ASCII" purposes.

Christian Ehrlicher

@Aramir said in QRegExp returning incorrect position due to codec ?:

ve seen many SD cards/emmc die in ereaders (devices I'm aiming for) therefore I'm trying to lower the write operations on it in order to not reduce their lifetime. That's it I think it's worth the trouble to figure out a solution to this problem instead of creating more and more e-waste.

Again a useless optimization due to a feeling. Optimize only when you can prove it's a problem and needs to be optimzied.

You SD-Card has likely a sector size of 512 or 1024 bytes. On top of this your filesystem may use a block size up to 64kb (NTFS, ext4 use 4kb ). So when you even change single byte in your file which is less than the sector size or block size it will write the whole sector/block.
It's just plain stupid and an over-complicating of things for nothing.