QTextDocument::toHtml() "encoding" parameter

kshegunov

@JonB said in QTextDocument::toHtml() "encoding" parameter:

So you're saying you expect to change the decoding while you're in the middle of reading a text file?!

Of course. Text display goes like this: bytes -> encoding -> font for locale -> display of glyphs
So the encoding, as suggested by its name, is the way a character (or rather a code point) is encoded into byte(s).
For example for XML the text declaration can specify an encoding differing from the default utf8.

I don't even know how to do that from my Python

Sorry, I'm completely clueless here.

I know that's not your problem, so just in outline how would you expect to achieve reading such an HTML from, say, Qt/QFile?

You open the file, as usual. Then you start reading the data in unencoded form (i.e. in QByteArray); then parse it as if it were containing utf8 (the default for HTML5) and whenever you parse the meta tag and get the requested encoding, you attach a QTextCodec and start converting the raw bytes to unicode (i.e. the QString's internal utf16). Thereafter it's easy as you are working with QStrings.

JonB

@kshegunov
Yes, given your implementation I get it.

Trouble is, for Python (remember I'm a noob there too) it doesn't seem you handle files like that. We specify the desired encoding as a parameter to file open() (read or write, on a text file), and then when characters are read/written the decode/encode is auto-performed.

[Just BTW: I do understand I could (presumably) do the whole thing from Python via PyQt using Qt's QFile and maybe QTextCodec etc. But while the app heavily uses Qt it is still a Python program and there are good reasons why it uses Python file handling for all purposes. I do not have the luxury/choice of chucking that away in favour of Qt.]

P.S.
So among all Qt's various useful classes, there isn't one which will open an HTML/XML file, do whatever work to parse the correct encoding if present in the header, and then return you a final QString of the content having been appropriately decoded as best it can? I could use that!

kshegunov

@JonB said in QTextDocument::toHtml() "encoding" parameter:

So among all Qt's various useful classes, there isn't one which will open an HTML/XML file, do whatever work to parse the correct encoding if present in the header, and then return you a final QString of the content having been appropriately decoded as best it can? I could use that!

Check QXmlStreamReader and/or QDomDocument and see if they do you any good.

JonB

@kshegunov
Thanks, but I think they're both going to want to find (well formed) XML and parse it. My input will be HTML (and not XHTML btw), plus all I want is the resulting content as a single QString for Python, so I don't think they'll help.

Maybe I need to go see somewhere if there is a Python skeleton for doing this decoding correctly. It seems like this code is needed any time you want to open an HTML document which might specify an encoding for reading the content, which ought to be a pretty standard thing that will be wanted, I'm surprised it's so tricky?

kshegunov

The QtWebKit module might be an option, however I've never used it ... People perfected parsing bad HTML over the last 20 years ... ;)

I'm surprised it's so tricky?

I guess you were mostly shielded from this whole process, judging by your default 8bit encoding ... :)

For me it used to be cp1251, and of course it was incompatible with KOI8-R which was what linuxes mostly stuck to. And of course cp1251 is compatible with cp1252, but then the latter was slightly different from ASCII. I have 15 year old IRC logs that are completely inaccessible to me as I don't currently have anything that can read windows-1251 ...

JKSH

@JonB said in QTextDocument::toHtml() "encoding" parameter:

We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8?

Nope! If a character can be encoded in CP1252, then it can also be encoded in UTF-8.

This is one reason why folks are pushing for UTF-8 to be the default, the One Encoding to Rule Them All.

Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?

Yes. Example: £

If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?

If the file already contains UTF-8 specific byte sequences, then Notepad will still re-save it as UTF-8.

If the file does not contain any UTF-8 specific byte sequences, then Notepad doesn't think it's UTF-8 so it won't re-save as UTF-8.

...I know I could start saving with <head> <meta charset="utf-8"/>. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the file open() method. I don't think you can change your mind about the encoding once you have opened a file.

Open the HTML file using the UTF-8 decoder.
Check the charset field.
If the charset is UTF-8, GOTO Happy Ending.
If the charset is not UTF-8, close the file and re-open it using the decoder for the declared charset.

This is the underlying assumption: No matter what encoding the file is in, the charset metadata is legible to a UTF-8 decoder. <subliminal_message>Isn't UTF-8 wonderful?</subliminal_message>

JKSH

@kshegunov said in QTextDocument::toHtml() "encoding" parameter:

Latin1 is a subset of utf8, thus you can decode Latin1 text directly as if it were encoded in utf8.

No.

"£" is 0xA3 in ISO-8859-1 but 0xC2A3 in UTF-8.

@kshegunov said in QTextDocument::toHtml() "encoding" parameter:

I have 15 year old IRC logs that are completely inaccessible to me as I don't currently have anything that can read windows-1251 ...

What platform do you currently use?

Linux? Use iconv: https://stackoverflow.com/questions/15422753/iconv-convert-from-cp1252-to-utf-8
Windows? Use Notepad++: https://notepad-plus-plus.org/

kshegunov

@JKSH said in QTextDocument::toHtml() "encoding" parameter:

No.
"£" is 0xA3 in ISO-8859-1 but 0xC2A3 in UTF-8.

You're making me look bad!
Apparently I'm wrong. Only the codepoint stays the same: U+00A3.

@JKSH said in QTextDocument::toHtml() "encoding" parameter:

What platform do you currently use?

Linux. There wasn't really a serious need to reencode them, so that's why I didn't. In any case thanks for the links!

JonB

@JKSH

Nope! If a character can be encoded in CP1252, then it can also be encoded in UTF-8.

This is good news for me, thanks. But do you mean it will have the same codepoint, or do you mean it will be encodable but possibly by a different one? I suspect the latter? £ is doable in both, but is not the same in either, right? [EDIT: Looks like "codepoint" is the wrong word here, I clearly mean the "input/output bytes" here.]

This is one reason why folks are pushing for UTF-8 to be the default, the One Encoding to Rule Them All.

Would suit me down the ground.

Thanks for confirmation of approach to correct handling of reading HTML file, similar to @kshegunov. I like that you start by opening still as text file with utf-8 decoder, as opposed to binary opener, as this fits much better with Python file handling.

I hope I get a Happy Ending. If not your close and re-open again fits best with Python, as it's not possible to change decoder during read as per @kshegunov's suggestion. Though it's hideously inefficient :) And btw won't be doable if the HTML text is arriving via a pipe instead of a file, which is a bit of a limitation :(

Do you feel like offering some code to achieve the "Check the charset field."? This doesn't look like a "one-liner" to me. What can come before any <head>? (e.g. DOCTYPE, comments, whitespace, blank lines, other stuff?) <head> is optional, isn't it? When do you stop if it's not going to be present? (e.g. if you hit <html> or <body> or something?)

JKSH

@kshegunov said in QTextDocument::toHtml() "encoding" parameter:

@JKSH said in QTextDocument::toHtml() "encoding" parameter:

No.
"£" is 0xA3 in ISO-8859-1 but 0xC2A3 in UTF-8.

You're making me look bad!
Apparently I'm wrong. Only the codepoint stays the same: U+00A3.

T'wasn't my intention ^^;; You're welcome for the links!

Yeah, same code point, different output bytes. I don't often think in terms of code points -- As a programmer, I've found it most useful to think in terms of graphemes and raw bytes. People who design encodings or fonts would be more interested in the other concepts.

P.S. In this thread, whenever I've said "character", I really meant "grapheme".
P.P.S. If anyone's interested in the nuances between "grapheme", "code point", and other concepts, see https://stackoverflow.com/a/27331885/1144539

JKSH

@JonB said in QTextDocument::toHtml() "encoding" parameter:

do you mean it will be encodable but possibly by a different one? … £ is doable in both, but is not the same in either, right?

Right. As per the table above, £...

...cannot be encoded in ASCII
…can be encoded in CP1252 and Windows-1252 as 0xA3
…can be encoded in UTF-8 as 0xC2A3

Though it's hideously inefficient :)

It is. But it's what we need to put up with if we want to support arbitrarily-encoded HTML/XML (particularly in Python)

This is less of a headache in C++ because as @kshegunov said, we can do ASCII searches in binary data.

And btw won't be doable if the HTML text is arriving via a pipe instead of a file, which is a bit of a limitation :(

This thread has been going for a while, but it's still not clear to me: When exactly does your app need to decode stuff? You've mentioned that it needs to re-open files produced by the app itself; does it also need to open user-created files? Does it decode files/data downloaded from the network?

Do you feel like offering some code to achieve the "Check the charset field."? This doesn't look like a "one-liner" to me.

Sorry, I'll pass. Precisely because it's not a one-liner ;)

There's a few ways to do it:

Quick and dirty hack: Use textual searching
Proper: Use a HTML parser. Far more inefficient than what we've already discussed.

I did a simple quick and dirty hack before: https://github.com/JKSH/QtSdkRepoChooser/blob/master/src/downloader.cpp#L58 (Man, I wasn't expecting that tool to still be alive and kicking 4 years later)

What can come before any <head>? (e.g. DOCTYPE, comments, whitespace, blank lines, other stuff?) <head> is optional, isn't it? When do you stop if it's not going to be present? (e.g. if you hit <html> or <body> or something?)

Stopping at <body> seems like a good bet.

For a quick and dirty hack, you can grab all strings from the start of the file up till "<body" (no closing bracket) and then scan this substring for "charset". If you find it, regex should be able to finish the job.