QTextDocument::toHtml() "encoding" parameter

JonB

@JKSH
I have now come across another related problem with encoding & decoding. Again, it's not to do with Qt itself.

Before I type it all in here to ask, would you be prepared to read & answer it if I did so? I don't want to type it all in if no-one will answer, I would quite understand, but can save myself the effort. Thanks.

JKSH

@JonB said in QTextDocument::toHtml() "encoding" parameter:

Before I type it all in here to ask, would you be prepared to read & answer it if I did so?

Happy to answer :)

Still, you could try writing a TL;DR (summarized) version first. Perhaps it could lead to the answers you want without requiring an essay from you.

JonB

@JKSH
Thanks :)

TL;DR #1:
I still hate this flipping encodings, though maybe I understand a touch better.

TL;DR #2:
The new question: I had assumed that if I found I could not encode (got character encoding error) with encoding_1, and had then saved back to file using encoding_2 which did work, upon reading back and decoding I would get similar error when I tried first with encoding_1, and would therefore know to decode using encoding_2. Instead, the reading accepted the character from the other encoding but displayed it as "rubbish" in its encoding.

This is depressing and is making my brain ache... !

JKSH

@JonB said in QTextDocument::toHtml() "encoding" parameter:

I had assumed that if I found I could not encode (got character encoding error) with encoding_1, and had then saved back to file using encoding_2 which did work, upon reading back and decoding I would get similar error when I tried first with encoding_1

This assumption doesn't work. Error detection is easy when encoding but hard when decoding.

Examples:

If you give an ASCII encoder the £ character, it can tell you straight up, "I don't support this character!"
If you give an ISO-8859-1 decoder the bytes 0xC2A3 (which is UTF-8 for £), it will do this:
1. Convert the 0xA3 byte. The decoder is happy because 0xA3 is £ in ISO-8859-1.
2. Convert the 0xC2 byte. The decoder is happy because 0xC2 is Â in ISO-8859-1.

"£Â" is perfectly valid text, so how is the decoder meant to know that the human won't like it?

In summary, an encoder knows immediately when it's given rubbish, but a decoder can't always tell.

JonB

@JKSH
Thanks for this. Unfortunately, it's the way I discovered it works (not surprisingly), but it's not what I want it to do! :(

By default, Python uses the "user's preferred encoding" when opening files.
Under Linux that's utf-8, but under Windows it's that damn cp1252.
A Windows user pastes in some text from elsewhere that happens to contain \u200b, which is a "non-breaking space" character, apparently.
My code tries to encode during write with default cp1252, this fails on that character.
I fall back to encoding with utf-8, that works, I can save, great.
Later I come to read that file back in.
Instead of it failing decoding with default cp1252, so I'd know to try utf-8, it succeeds.
But I don't get the utf-8 non-breaking space character, I get a couple of rubbish characters instead. Which don't look good.
But I have no way of knowing I should have decoded the file with utf-8....

Yuck!

JKSH

@JonB said in QTextDocument::toHtml() "encoding" parameter:

By default, Python uses the "user's preferred encoding" when opening files.

Under Linux that's utf-8, but under Windows it's that damn cp1252.

I recommend always saving (and hence reading) in UTF-8, no matter what platform you're on.

A Windows user pastes in some text from elsewhere that happens to contain \u200b, which is a "non-breaking space" character, apparently.

This is a different problem from the issue of juggling encodings. 99.9% of the time, people don't actually want \u200b in their documents: https://stackoverflow.com/questions/7055600/u200b-zero-width-space-characters-in-my-js-code-where-did-they-come-from

JonB

@JKSH
For the pasting, users can paste whatever they like from wherever they like and I have to record this verbatim, for legal reasons. I take your point about that particular character, but once I start stripping things out I wouldn't know where to stop. Although you say it's "different from juggling encodings", the issue is that character encodes OK to utf-8 but causes fatal error to cp1252 when I try to save, which is the default Python encoding under Windows.

I recommend always saving (and hence reading) in UTF-8, no matter what platform you're on.

Now that is really interesting! Clearly you can see that I'm in a mess, and am looking for some way out. A solution whereby I always knew what encoding to use unconditionally would be a huge boon. I could track down all the Python file "open"s and change them all over to UTF-8, and hopefully then be a happy bunny in all circumstances. Furthermore that would ensure interoperability with Linux (where default is already UTF-8), which would also be nice.

I need to press you a bit more on this solution, if I may, and you'd be kind enough to stick with me. Do you know any of the following:

We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8? Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?
If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?
What goes on with encoding declarations in HTML? Some, but not all, of these files are HTML. I know I could start saving with <head> <meta charset="utf-8"/>. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the file open() method. I don't think you can change your mind about the encoding once you have opened a file. So how does this work when reading the HTML file --- which encoding should I pass to open('r') given that it might encounter a charset specification after a while when reading the content??

Thank you so much for your kind time on this!

kshegunov

@JonB said in QTextDocument::toHtml() "encoding" parameter:

If you allow me ...

We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8? Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?

cp1252 is very similar to ISO-8859-1, a.k.a. Latin1, but not exactly the same, 'cause Microsoft. In any case there are a few codepoints that are different between Latin1 and cp1252 that are going to give you trouble if you directly try to decode a cp1252 text through utf8. These include the euro sign, and slanted apostrophies and quotation marks, the permille sign among a few others.

Note: I talk about differences between Latin1 and cp1252 only because Latin1 is a subset of utf8, thus you can decode Latin1 text directly as if it were encoded in utf8.

If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?

I would imagine it'd either use the local 8-bit encoding, which can be cp1252 or Latin1, or it can save it as utf8. There should be a way to specify that when saving the actual file.

What goes on with encoding declarations in HTML? Some, but not all, of these files are HTML. I know I could start saving with <head> <meta charset="utf-8"/>. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the file open() method. I don't think you can change your mind about the encoding once you have opened a file. So how does this work when reading the HTML file --- which encoding should I pass to open('r') given that it might encounter a charset specification after a while when reading the content??

HTML is quite similar to XML in that regard. In XML you have the preamble (which contains the encoding of the document) that is supposed to be always encoded in latin 8bit, so whatever is used in the rest of the document can be read by the parser. For HTML this is the meta-tag, so the parser is supposed to switch to the indicated encoding whenever it reaches the charset meta tag.

JonB

@kshegunov said in QTextDocument::toHtml() "encoding" parameter:

There should be a way to specify that when saving the actual file.

My end-users are quite beyond my control. There is no chance of getting them to specify some encoding to save as if they use Notepad, they will use whatever the default is, period.

so whatever is used in the rest of the document can be read by the parser. For HTML this is the meta-tag, so the parser is supposed to switch to the indicated encoding whenever it reaches the charset meta tag.

I feel a bit like Alice, disappearing down a rabbit hole, "Curioser and curiouser"....

So you're saying you expect to change the decoding while you're in the middle of reading a text file?! I don't even know how to do that from my Python: when I open a file for text-read I specify an encoding, which it uses as it reads lines. I don't think I can change that halfway along.... E.g. from the Python docs for open():

As mentioned in the Overview, Python distinguishes between binary and text I/O. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.

OK, I get further. It turns out the text-open returns an object (class io.TextIOWrapper) which does have a reconfigure method allowing encoding to be respecified. However, I am not surprised to read:

It is not possible to change the encoding or newline if some data has already been read from the stream.

which is about what I would expect. What is going on here? This is getting crazy!

I know that's not your problem, so just in outline how would you expect to achieve reading such an HTML from, say, Qt/QFile?

kshegunov

@JonB said in QTextDocument::toHtml() "encoding" parameter:

So you're saying you expect to change the decoding while you're in the middle of reading a text file?!

Of course. Text display goes like this: bytes -> encoding -> font for locale -> display of glyphs
So the encoding, as suggested by its name, is the way a character (or rather a code point) is encoded into byte(s).
For example for XML the text declaration can specify an encoding differing from the default utf8.

I don't even know how to do that from my Python

Sorry, I'm completely clueless here.

I know that's not your problem, so just in outline how would you expect to achieve reading such an HTML from, say, Qt/QFile?

You open the file, as usual. Then you start reading the data in unencoded form (i.e. in QByteArray); then parse it as if it were containing utf8 (the default for HTML5) and whenever you parse the meta tag and get the requested encoding, you attach a QTextCodec and start converting the raw bytes to unicode (i.e. the QString's internal utf16). Thereafter it's easy as you are working with QStrings.

JonB

@kshegunov
Yes, given your implementation I get it.

Trouble is, for Python (remember I'm a noob there too) it doesn't seem you handle files like that. We specify the desired encoding as a parameter to file open() (read or write, on a text file), and then when characters are read/written the decode/encode is auto-performed.

[Just BTW: I do understand I could (presumably) do the whole thing from Python via PyQt using Qt's QFile and maybe QTextCodec etc. But while the app heavily uses Qt it is still a Python program and there are good reasons why it uses Python file handling for all purposes. I do not have the luxury/choice of chucking that away in favour of Qt.]

P.S.
So among all Qt's various useful classes, there isn't one which will open an HTML/XML file, do whatever work to parse the correct encoding if present in the header, and then return you a final QString of the content having been appropriately decoded as best it can? I could use that!

kshegunov

@JonB said in QTextDocument::toHtml() "encoding" parameter:

So among all Qt's various useful classes, there isn't one which will open an HTML/XML file, do whatever work to parse the correct encoding if present in the header, and then return you a final QString of the content having been appropriately decoded as best it can? I could use that!

Check QXmlStreamReader and/or QDomDocument and see if they do you any good.

JonB

@kshegunov
Thanks, but I think they're both going to want to find (well formed) XML and parse it. My input will be HTML (and not XHTML btw), plus all I want is the resulting content as a single QString for Python, so I don't think they'll help.

Maybe I need to go see somewhere if there is a Python skeleton for doing this decoding correctly. It seems like this code is needed any time you want to open an HTML document which might specify an encoding for reading the content, which ought to be a pretty standard thing that will be wanted, I'm surprised it's so tricky?

kshegunov

The QtWebKit module might be an option, however I've never used it ... People perfected parsing bad HTML over the last 20 years ... ;)

I'm surprised it's so tricky?

I guess you were mostly shielded from this whole process, judging by your default 8bit encoding ... :)

For me it used to be cp1251, and of course it was incompatible with KOI8-R which was what linuxes mostly stuck to. And of course cp1251 is compatible with cp1252, but then the latter was slightly different from ASCII. I have 15 year old IRC logs that are completely inaccessible to me as I don't currently have anything that can read windows-1251 ...

JKSH

@JonB said in QTextDocument::toHtml() "encoding" parameter:

We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8?

Nope! If a character can be encoded in CP1252, then it can also be encoded in UTF-8.

This is one reason why folks are pushing for UTF-8 to be the default, the One Encoding to Rule Them All.

Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?

Yes. Example: £

If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?

If the file already contains UTF-8 specific byte sequences, then Notepad will still re-save it as UTF-8.

If the file does not contain any UTF-8 specific byte sequences, then Notepad doesn't think it's UTF-8 so it won't re-save as UTF-8.

...I know I could start saving with <head> <meta charset="utf-8"/>. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the file open() method. I don't think you can change your mind about the encoding once you have opened a file.

Open the HTML file using the UTF-8 decoder.
Check the charset field.
If the charset is UTF-8, GOTO Happy Ending.
If the charset is not UTF-8, close the file and re-open it using the decoder for the declared charset.

This is the underlying assumption: No matter what encoding the file is in, the charset metadata is legible to a UTF-8 decoder. <subliminal_message>Isn't UTF-8 wonderful?</subliminal_message>

JKSH

@kshegunov said in QTextDocument::toHtml() "encoding" parameter:

Latin1 is a subset of utf8, thus you can decode Latin1 text directly as if it were encoded in utf8.

No.

"£" is 0xA3 in ISO-8859-1 but 0xC2A3 in UTF-8.

@kshegunov said in QTextDocument::toHtml() "encoding" parameter:

I have 15 year old IRC logs that are completely inaccessible to me as I don't currently have anything that can read windows-1251 ...

What platform do you currently use?

Linux? Use iconv: https://stackoverflow.com/questions/15422753/iconv-convert-from-cp1252-to-utf-8
Windows? Use Notepad++: https://notepad-plus-plus.org/

kshegunov

@JKSH said in QTextDocument::toHtml() "encoding" parameter:

No.
"£" is 0xA3 in ISO-8859-1 but 0xC2A3 in UTF-8.

You're making me look bad!
Apparently I'm wrong. Only the codepoint stays the same: U+00A3.

@JKSH said in QTextDocument::toHtml() "encoding" parameter:

What platform do you currently use?

Linux. There wasn't really a serious need to reencode them, so that's why I didn't. In any case thanks for the links!

JonB

@JKSH

Nope! If a character can be encoded in CP1252, then it can also be encoded in UTF-8.

This is good news for me, thanks. But do you mean it will have the same codepoint, or do you mean it will be encodable but possibly by a different one? I suspect the latter? £ is doable in both, but is not the same in either, right? [EDIT: Looks like "codepoint" is the wrong word here, I clearly mean the "input/output bytes" here.]

This is one reason why folks are pushing for UTF-8 to be the default, the One Encoding to Rule Them All.

Would suit me down the ground.

Thanks for confirmation of approach to correct handling of reading HTML file, similar to @kshegunov. I like that you start by opening still as text file with utf-8 decoder, as opposed to binary opener, as this fits much better with Python file handling.

I hope I get a Happy Ending. If not your close and re-open again fits best with Python, as it's not possible to change decoder during read as per @kshegunov's suggestion. Though it's hideously inefficient :) And btw won't be doable if the HTML text is arriving via a pipe instead of a file, which is a bit of a limitation :(

Do you feel like offering some code to achieve the "Check the charset field."? This doesn't look like a "one-liner" to me. What can come before any <head>? (e.g. DOCTYPE, comments, whitespace, blank lines, other stuff?) <head> is optional, isn't it? When do you stop if it's not going to be present? (e.g. if you hit <html> or <body> or something?)

JKSH

@kshegunov said in QTextDocument::toHtml() "encoding" parameter:

@JKSH said in QTextDocument::toHtml() "encoding" parameter:

No.
"£" is 0xA3 in ISO-8859-1 but 0xC2A3 in UTF-8.

You're making me look bad!
Apparently I'm wrong. Only the codepoint stays the same: U+00A3.

T'wasn't my intention ^^;; You're welcome for the links!

Yeah, same code point, different output bytes. I don't often think in terms of code points -- As a programmer, I've found it most useful to think in terms of graphemes and raw bytes. People who design encodings or fonts would be more interested in the other concepts.

P.S. In this thread, whenever I've said "character", I really meant "grapheme".
P.P.S. If anyone's interested in the nuances between "grapheme", "code point", and other concepts, see https://stackoverflow.com/a/27331885/1144539

JKSH

@JonB said in QTextDocument::toHtml() "encoding" parameter:

do you mean it will be encodable but possibly by a different one? … £ is doable in both, but is not the same in either, right?

Right. As per the table above, £...

...cannot be encoded in ASCII
…can be encoded in CP1252 and Windows-1252 as 0xA3
…can be encoded in UTF-8 as 0xC2A3

Though it's hideously inefficient :)

It is. But it's what we need to put up with if we want to support arbitrarily-encoded HTML/XML (particularly in Python)

This is less of a headache in C++ because as @kshegunov said, we can do ASCII searches in binary data.

And btw won't be doable if the HTML text is arriving via a pipe instead of a file, which is a bit of a limitation :(

This thread has been going for a while, but it's still not clear to me: When exactly does your app need to decode stuff? You've mentioned that it needs to re-open files produced by the app itself; does it also need to open user-created files? Does it decode files/data downloaded from the network?

Do you feel like offering some code to achieve the "Check the charset field."? This doesn't look like a "one-liner" to me.

Sorry, I'll pass. Precisely because it's not a one-liner ;)

There's a few ways to do it:

Quick and dirty hack: Use textual searching
Proper: Use a HTML parser. Far more inefficient than what we've already discussed.

I did a simple quick and dirty hack before: https://github.com/JKSH/QtSdkRepoChooser/blob/master/src/downloader.cpp#L58 (Man, I wasn't expecting that tool to still be alive and kicking 4 years later)

What can come before any <head>? (e.g. DOCTYPE, comments, whitespace, blank lines, other stuff?) <head> is optional, isn't it? When do you stop if it's not going to be present? (e.g. if you hit <html> or <body> or something?)

Stopping at <body> seems like a good bet.

For a quick and dirty hack, you can grab all strings from the start of the file up till "<body" (no closing bracket) and then scan this substring for "charset". If you find it, regex should be able to finish the job.