QTextDocument::toHtml() "encoding" parameter

JKSH

Hi @JonB, you're trying to juggle multiple different issues and concepts at the same time.

Let's take a step back and look at them one-by-one.

1. Independent Issue No. 1: Understanding Encodings

Because I have never really understood encodings,
….
Why do you do that? Is this terribly complicated for me because I use Python3/Qt, that already auto-translates all the Qt functions on QString & char *, to/from its str, I think it's str is already "Unicode", does that make a difference?

Let's try to un-complicate things, shall we? This should help to clarify many other things.

In simple terms:

"Encode": Convert something meaningful into a sequence of bytes.
"Decode": Convert a sequence of bytes into something meaningful.

1.1. Basic study: Encodings and Strings

You are correct that strings in Qt (and most modern software) are represented by "Unicode". However, "Unicode" is just a system that assigns a unique "key" to every common "character" in the world. (This is a gross oversimplification, but it is enough for this discussion).

Suppose you have a string, "HELLO". Under Unicode, the sequence of "keys" for this string is:

U+0048 U+0045 U+004C U+004C U+004F

Note 1: These "keys" are NOT bytes! To store them in a file, you need to convert these "keys" into bytes. In other words, you must encode your string.

Examples of encodings:

UTF-8 is one encoding for Unicode.
UTF-16 is another encoding for Unicode.
UTF-32 (a.k.a. UCS-4) is yet another encoding for Unicode.
SHIFT-JIS is a non-Unicode encoding, made specifically for the Japanese language.

// Initialize the string
QString str = "HELLO";

// Encode the string
QByteArray       str8  = str.toUtf8(); // Array of 8-bit values
QVector<quint32> str32 = str.toUcs4(); // Array of 32-bit values

// Write bytes to file
file8.write(str8);
file32.write(  reinterpret_cast<char*>(str32.data()),  str32.length()*sizeof(quint32)  );

Same string, 2 different encodings:

file8 contains 5 bytes: 4845 4c4c 4f
file32 contains 20 bytes: 4800 0000 4500 0000 4c00 0000 4c00 0000 4f00 0000 (assuming little-endian)

from Python/PyQt toHtml() returns a str and write() accepts a str.

In C++ Qt, QTextDocument::toHtml() returns a QString and QFile::write() accepts a byte array. This means we must explicitly choose an encoding when we write strings to a file. This is a bit more work for us, but it does makes things unambiguous.

Python lets you write strings straight to a file, which means it automatically chooses an encoding for you. Nonetheless, you can also manually specify an encoding.

Note 2: In general, think of strings as un-encoded entities. It only gets encoded (converted into a byte sequence) when you manually perform the encoding and/or when you write the string to file.

1.2. Intermediate study: Encodings and HTML Documents

Converting rich text into a HTML document for storage/transmission is a 2-step process:

QTextDocument -- toHtml() --> HTML string (not encoded) -- QString::toUtf8() OR fileObject.write() --> Byte sequence of the HTML document (encoded)

Note the difference between the HTML string and the HTML byte sequence.

In C++, toUtf8() produces a byte array in-memory. This array can then be written to file, or be transmitted over the Internet.
In Python, write() implicitly encodes your string and then writes the encoded data to file. You don't get a copy of the encoded string in memory.

2. Independent Issue No. 2: The Browser's Warning

@JonB said in QTextDocument::toHtml() "encoding" parameter:

The character encoding of the HTML document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the page must be declared in the document or in the transfer protocol.

Here, the browser is simply complaining, "Oi, you didn't tell me what your HTML document's encoding is!" This warning is NOT caused by a "wrong"/"bad"/"corrupted" encoding.

As you have discovered, the way to resolve this is to declare the document's encoding, by specifying an argument to QTextDocument::toHtml().

Examples:

If you call doc->toHtml("utf-8"); your document will be stamped with the tag <meta charset="utf-8"/>
If you call doc->toHtml("iso-8859-1"); your document will be stamped with the tag <meta charset="iso-8859-1"/>

Note 3a: When you do this, you simply declare that "My document is encoded in UTF-8" or "My document is encoded in ISO-8859-1". However, it does NOT perform any actual encoding! (See #1.2.)

Note 3b: It is possible to declare one encoding in toHtml() but then write out a different encoding. This essentially attaches a wrong tag to your document, which will probably confuse the browser, causing it to display gibberish.

3. Independent Issue No. 3: Verifying/Sanitizing Inputs

I don't know what encoding I already have my text files or QStrings or *char *`s in, because I'm saying they come from a variety of sources. That's why I don't put encodings into my output code, 'coz I don't know what I've got so I cross my fingers and tend to say nothing and hope everything works without me specifying something that's actually wrong.

This is not good.

As the programmer, it is your job to set the rules for your app's users ("The only encodings allowed are ____ and ____!"), and ideally you should have mechanisms in place to check that the inputs comply with these rules.

Little Bobby Tables
(Source: https://www.xkcd.com/327/)

4. To be continued?

I have to run, so I can't talk about the £ issue right now. But perhaps after reading everything in this thread, you can understand the causes and identify solutions yourself?

JonB

@JKSH
Thank you for Part I of your explanation!

TBH, I think I do know all the above stuff. I think my problem is that I don't know what my various sources of strings are encoded as.

In C++ Qt, QTextDocument::toHtml() returns a QString and QFile::write() accepts a byte array. This means we must explicitly choose an encoding when we write strings to a file. This is a bit more work for us, but it does makes things unambiguous.

Python lets you write strings straight to a file, which means it automatically chooses an encoding for you. Nonetheless, you can also manually specify an encoding.

Yes, I thought in C++ you'd be explicit. In Python I'm thinking it will (by default, in my case) be doing a .toUtf8() when writing to file.

Now we come to the practicalities of my situation! As I've said, my potential sources for input are many & various. For example, it would be great if every user-editable input file already included a BOM or an encoding declaration. But they don't. Users can edit a file of HTML in Windows Notepad, or not.

So... given that, how would you discover, for example, whether what you have is or is not UTF-8? You can only guess, and scan all the characters, I guess? How often do you want me to do that, when?

And finally: in practice, in the past I seem to have gotten away without worrying about encodings and, I guess, everything defaulted to UTF-8 and it worked. This time, my inputs may contain £ characters often, and may have been edited under Linux and/or Windows, with or without Notepad. Is my whole life a £ mess up situation?

[Also, I think I've seen different behaviour between browsers/displayers (if user happens to view HTML file there) as to whether they complain about HTML-encoding-declaration mismatch against actual content, or how they display certain "contentious" characters when they're not strictly in the encoding. So they may complain about £ encodings or not, display it as £ or ?, etc.]

JKSH

@JonB said in QTextDocument::toHtml() "encoding" parameter:

in practice, in the past I seem to have gotten away without worrying about encodings and, I guess, everything defaulted to UTF-8 and it worked.

It's quite possible that, in the past, you've only had to deal with ASCII characters. Many of today's common encodings (in the English-speaking world anyway) are ASCII-compatible, so there is no noticeable consequence if you accidentally mixed encodings. See #4 below for more details.

In Python I'm thinking it will (by default, in my case) be doing a .toUtf8() when writing to file.

If I'm not mistaken, Python 2 defaults to ASCII. Python 3 defaults to UTF-8.

I think the pound sterling is often char 0xA3, but that might well be Windows-1252. And I don't like that because it sounds Windows-y and how do I know that will "be available" under Linux? (And then I started looking at ISO 8859-1 (Latin-1) encoding for this....)

Linux can read Windows-1252 data no problem.

You just need to make sure you use the right decoder for the right input.

4. Independent Issue No. 4: Why £ Seems More Troublesome than $

I think, think I've seen £ arrive to me as one byte 0xA3, two byte 0x00A3 and two byte 0xC2A3 !

That just means you've seen £ encoded in 3 different ways:

Character	ASCII	ISO-8859-1 OR Windows-1252	UTF-8	UTF-16
A	0x41	0x41	0x41	0x0041
B	0x42	0x42	0x42	0x0042
C	0x43	0x43	0x43	0x0043
$	0x24	0x24	0x24	0x0024
£	(Does not exist!)	0xA3	0xC2A3	0x00A3

Note: $ is an ASCII character. £ is not.

[Maybe a lot of my problems stem from this £ issue. It's really unfair that there never has been any problem with $ :( ]

The table above also illustrates why you seem to only have trouble with £ but not $: Because when you have simple English text where money is only specified in Dollars, ASCII, ISO-8859-1 or UTF-8 will encode your text in exactly the same way. Therefore, using a UTF-8 reader to parse an ISO-8859-1 file (or vice-versa) has no noticeable consequences.

However, if your text contains money in Pounds, then there will be consequences:

A UTF-8 reader wrongly parses an ISO-8859-1 file because 0xA3 is not a valid UTF-8 character, and none of the multi-byte characters start with 0xA3.
An ISO-8859-1 reader wrongly parses a UTF-8 file because it treats 0xC2A3 as 2 separate characters, but 0xC2 is not a valid ISO-8859-1 character.

Is my whole life a £ mess up situation?

Nah. If it were, then most software developers in the UK would be in agony ;)

Anyway, it all depends on what characters your app needs to deal with. Spare a thought for the users and developers who don't even use the Latin alphabet in their daily lives -- They have way more to deal with then £.

5. Independent Issue No. 5: How to Detect the Encoding of an Input

how would you discover, for example, whether what you have is or is not UTF-8? You can only guess, and scan all the characters, I guess?

This is quite an arduous and unreliable process. See: https://en.wikipedia.org/wiki/Charset_detection

It is impractical to implement this yourself, but you could use a 3rd-party library like chardet: https://chardet.readthedocs.io/

How often do you want me to do that, when?

Well, it's better if you don't do that. Encoding detection should only be used as a last resort, not the first: https://chardet.readthedocs.io/en/latest/faq.html#yippie-screw-the-standards-ill-just-auto-detect-everything

But if you must, see below.

[Also, I think I've seen different behaviour between browsers/displayers (if user happens to view HTML file there) as to whether they complain about HTML-encoding-declaration mismatch against actual content, or how they display certain "contentious" characters when they're not strictly in the encoding. So they may complain about £ encodings or not, display it as £ or ?, etc.]

This occurs because the browsers do try to handle encoding mixups, but no silver bullet exists. Each project (and you yourself) must decide: What's most important?

Maximizing successful detections?
Minimizing development effort?
Minimizing the end-user's processing power and wait time?
Keeping the end-user blissfully unaware of the problem?

5. Independent Issue No. 5: How to Practically Handle the Pound Problem

I don't know what your app does, how it's used, or who your users are. So, I can only give generic tips here. (But since your users are hand-editing HTML files, I'll assume they are somewhat tech-literate). This is what I would consider.

The ultimate source of the £ characters in the documents are multiple: from existing HTML "template" files, from database text fields, from literal Python source code, from editing by users including from within QWebEngineView and other editors. Etc.!

Are any of these in your control? If so, I would scour them all and ensure that all these sources are encoded in UTF-8. This act alone can make lots of your problems vanish without requiring your users to do anything different.

Note: If Notepad opens a file that contains characters that are uniquely encoded in UTF-8 (for example, if it opens a file that contains 0xC2A3), then it will automatically save any modifications in UTF-8 too.

I think my problem is that I don't know what my various sources of strings are encoded as.
...
Now we come to the practicalities of my situation! As I've said, my potential sources for input are many & various.... Users can edit a file of HTML in Windows Notepad, or not.

What I said #3 still stands. It is important to explicitly decide what inputs are supported. Put in the effort to ensure that the supported inputs are always parsed properly. Beyond that, give no guarantees for unsupported inputs (but you can still try to handle them nicely, out of goodwill).

First, I would explicitly tell users to only use UTF-8 files in the app documentation. I'd also nudge them in that direction using GUI hints -- For example, in the file selection dialog, I'd explicitly say something like "Select UTF-8 HTML File". In my documentation, I'd say that only UTF-8 is officially supported; while I'd do my best to support other encodings, I won't guarantee anything. I'd also provide instructions on how to use Windows Notepad to save files in UTF-8.

If I know that my users strongly value the ability to provide inputs in other encodings though, then I'll start preparing myself for the long journey ahead.

5.1. If you really, truly, seriously want to support a multiple input encodings...

If your inputs contain headers/metadata about the encoding (such as HTML's <meta charset="utf-8"> tag), use it. Avoid auto-detection if at all possible.
- If the header/metadata does not match the actual content, then the input is corrupt. It is not your job to fix it; go ahead and display gibberish! Do not give your users an incentive to keep corrupted files around.
If the input does not support headers/metadata, then provide a drop-down menu to specify the encoding when they select the input to import.
- You can provide an "Auto-detected" option here, and you can use the chardet library here.
- Do not run chardet if your user did not ask for it.

JonB

@JKSH
Thank you very much for taking the effort to type up this comprehensive reply!

In truth I find I did know all/most of this. But it's nice to see confirmation in a table of the 3 different encodings I seem to have encountered for £. I hope it's also useful reference for other potential readers.

One thing I do not totally get: you say the the encoding for £ in UTF-8 is 0xC2A3. Now, call me gullible, but I thought the point of UTF-8 was that all the characters it supports are represented in, well, 8 bits! What I'm now seeing is: UTF-16 always takes 16 bits, but UTF-8 seems to try to fit in 8 bits, but can go to 16 bits if it wants to. How does decoder know? Answer must be that the leading 0xC2 byte tells it this is a 16-bit sequence? I certainly did not know that.

Just a few points in summary about my situation:

But since your users are hand-editing HTML files, I'll assume they are somewhat tech-literate

On the contrary, they are below "tech-literate"! However, this does not stop them typing in whatever they fancy wherever (from whatever other applications) they fancy and demanding it work :) "A little knowledge is dangerous."

In my commercial programming I have no problems with telling end users what they need to do. If they get it wrong, I phone their boss and have them fired ;-) However, my Qt programming is a project used by non-commercial users, so I do not have this power :( I do not have any contact with the users, there is no manual, they wouldn't look at one if there were, and they would not know or act on "UTF-8" if it hit them in the face :) There are no metadata headers. They are not subject to any control or rules. This is what I mean by the "practicalities" (rather than what one might desire) of my situation. If it worked in whatever "Paradox" software they were using years ago (to which I have no access and no interest), it's just expected to "work" now!

Can't use auto-detection form chardet as I'm Python(3)-only. Have to do any work myself.

Having said that, the more I think about it the more I believe my only problem is £ sterling. Users speak English (i.e. they are not American), so it's not like I have to deal with Cyrillic or Chinese. The only currency used is UK. So far, I have put the following code in and seem to not be experiencing any problems:

    def callbackToHtmlSaveHtml(self, data: str):
        self.html = data
        # remove the contenteditable="true" when saving
        self.removeContentEditable()
        # and also replace any literal £ characters with &pound; HTML entitized, which makes things easier later on
        # this is called from QWebEngineView
        # it's not helped by the fact that each time you edit visually in QWebEngineView
        # QWebEngineView insists on turning any "&pound;" back into a literal "£"...
        self.html = self.html.replace("£", "&pound;")

def sendHtmlEmail():
    # ensure encoded as UTF-8
    # this is at least required to get embedded "£" characters through correctly
    charset = 'utf-8'
    html = html.encode(charset)
    # set Content-Type for text/html
    msg.add_header('Content-Type', 'text/html; charset=' + charset)

I think this has reminded me: I may have gotten away with "funny"/"unspecified" encodings when putting HTML into file/browser, but sending in SMTP email is more rigorous about not guessing/complaining. Like I said, my input sources are various, and so are my output destinations!

I have found that the magic of (Python 3) html.replace("£", "£") seems to make my HTML documents much more acceptable as UTF-8 HTML. I'm not sure what that literal Python code is doing in terms of each of the 3 cases you mention about possible £ encodings in the source to substitute, but it seems to work in practice....

JKSH

You're welcome!

@JonB said in QTextDocument::toHtml() "encoding" parameter:

One thing I do not totally get: you say the the encoding for £ in UTF-8 is 0xC2A3. Now, call me gullible, but I thought the point of UTF-8 was that all the characters it supports are represented in, well, 8 bits!

I can see how the name "UTF-8" gives that impression.

8 bits can only encode 256 unique "characters" though, which is woefully inadequate for Unicode's goal of covering all common languages today. Unicode can encode over 1 000 000 unique "characters": https://en.wikipedia.org/wiki/Code_point

What I'm now seeing is: UTF-16 always takes 16 bits

~~Yes.~~ [EDIT: Oops, actually UTF-16 is variable-width! ]

UTF-8 seems to try to fit in 8 bits, but can go to 16 bits if it wants to.

No. The only 8-bit "characters" in UTF-8 are the ASCII characters. (128 in total, but not all of them are real text "characters". Some are control codes.) This design allows a UTF-8 decoder to read ASCII input.

Other "characters" can take up to 32 bits in UTF-8.

How does decoder know? Answer must be that the leading 0xC2 byte tells it this is a 16-bit sequence? I certainly did not know that.

Yep, you got it.

The leading byte tells the decoder how many bytes this character takes. There are a few other rules too; see the 1st table at https://en.wikipedia.org/wiki/UTF-8#Description if you're interested.

So far, I have put the following code in and seem to not be experiencing any problems:
...
I'm not sure what that literal Python code is doing in terms of each of the 3 cases you mention about possible £ encodings in the source to substitute, but it seems to work in practice....

To test it, get a copy of inputs that you know caused issues before and see if the new code handles it. Try it with both UTF-8 and ISO-8859-1 inputs.

Your code ensures that your app outputs UTF-8 compatible data. It doesn't do anything to inputs that come into your app, however.

Can't use auto-detection form chardet as I'm Python(3)-only. Have to do any work myself.

Having said that, the more I think about it the more I believe my only problem is £ sterling. Users speak English (i.e. they are not American), so it's not like I have to deal with Cyrillic or Chinese. The only currency used is UK.

For that level of "practicality", one possible approach for checking inputs is to search for 0xC2A3 in the raw input byte stream (remember to check before any decoding occurs). If it's found, treat the input as UTF-8. If not, treat it as ISO-8859-1.

P.S. Are you implying that "English-speaking" and "American" are mutually-exclusive? ;-)

I have found that the magic of (Python 3) html.replace("£", "£") seems to make my HTML documents much more acceptable as UTF-8 HTML.

What you've actually done is produce ASCII outputs. &, p, o, u, n, d, ; are all ASCII characters. This allows lots of decoders (both UTF-8 and non-UTF-8) to read it.

I think this has reminded me: I may have gotten away with "funny"/"unspecified" encodings when putting HTML into file/browser, but sending in SMTP email is more rigorous about not guessing/complaining. Like I said, my input sources are various, and so are my output destinations!

Outputs are easier to deal with. Just spit out UTF-8 and most self-respecting software should be happy to accept it. You can deal with "It doesn't work!" complaints on a case-by-case basis, whexpects.I ruled a kingdom, and my knights embarked on noble quests

JonB

@JKSH
I have now come across another related problem with encoding & decoding. Again, it's not to do with Qt itself.

Before I type it all in here to ask, would you be prepared to read & answer it if I did so? I don't want to type it all in if no-one will answer, I would quite understand, but can save myself the effort. Thanks.

JKSH

@JonB said in QTextDocument::toHtml() "encoding" parameter:

Before I type it all in here to ask, would you be prepared to read & answer it if I did so?

Happy to answer :)

Still, you could try writing a TL;DR (summarized) version first. Perhaps it could lead to the answers you want without requiring an essay from you.

JonB

@JKSH
Thanks :)

TL;DR #1:
I still hate this flipping encodings, though maybe I understand a touch better.

TL;DR #2:
The new question: I had assumed that if I found I could not encode (got character encoding error) with encoding_1, and had then saved back to file using encoding_2 which did work, upon reading back and decoding I would get similar error when I tried first with encoding_1, and would therefore know to decode using encoding_2. Instead, the reading accepted the character from the other encoding but displayed it as "rubbish" in its encoding.

This is depressing and is making my brain ache... !

JKSH

@JonB said in QTextDocument::toHtml() "encoding" parameter:

I had assumed that if I found I could not encode (got character encoding error) with encoding_1, and had then saved back to file using encoding_2 which did work, upon reading back and decoding I would get similar error when I tried first with encoding_1

This assumption doesn't work. Error detection is easy when encoding but hard when decoding.

Examples:

If you give an ASCII encoder the £ character, it can tell you straight up, "I don't support this character!"
If you give an ISO-8859-1 decoder the bytes 0xC2A3 (which is UTF-8 for £), it will do this:
1. Convert the 0xA3 byte. The decoder is happy because 0xA3 is £ in ISO-8859-1.
2. Convert the 0xC2 byte. The decoder is happy because 0xC2 is Â in ISO-8859-1.

"£Â" is perfectly valid text, so how is the decoder meant to know that the human won't like it?

In summary, an encoder knows immediately when it's given rubbish, but a decoder can't always tell.

JonB

@JKSH
Thanks for this. Unfortunately, it's the way I discovered it works (not surprisingly), but it's not what I want it to do! :(

By default, Python uses the "user's preferred encoding" when opening files.
Under Linux that's utf-8, but under Windows it's that damn cp1252.
A Windows user pastes in some text from elsewhere that happens to contain \u200b, which is a "non-breaking space" character, apparently.
My code tries to encode during write with default cp1252, this fails on that character.
I fall back to encoding with utf-8, that works, I can save, great.
Later I come to read that file back in.
Instead of it failing decoding with default cp1252, so I'd know to try utf-8, it succeeds.
But I don't get the utf-8 non-breaking space character, I get a couple of rubbish characters instead. Which don't look good.
But I have no way of knowing I should have decoded the file with utf-8....

Yuck!

JKSH

@JonB said in QTextDocument::toHtml() "encoding" parameter:

By default, Python uses the "user's preferred encoding" when opening files.

Under Linux that's utf-8, but under Windows it's that damn cp1252.

I recommend always saving (and hence reading) in UTF-8, no matter what platform you're on.

A Windows user pastes in some text from elsewhere that happens to contain \u200b, which is a "non-breaking space" character, apparently.

This is a different problem from the issue of juggling encodings. 99.9% of the time, people don't actually want \u200b in their documents: https://stackoverflow.com/questions/7055600/u200b-zero-width-space-characters-in-my-js-code-where-did-they-come-from

JonB

@JKSH
For the pasting, users can paste whatever they like from wherever they like and I have to record this verbatim, for legal reasons. I take your point about that particular character, but once I start stripping things out I wouldn't know where to stop. Although you say it's "different from juggling encodings", the issue is that character encodes OK to utf-8 but causes fatal error to cp1252 when I try to save, which is the default Python encoding under Windows.

I recommend always saving (and hence reading) in UTF-8, no matter what platform you're on.

Now that is really interesting! Clearly you can see that I'm in a mess, and am looking for some way out. A solution whereby I always knew what encoding to use unconditionally would be a huge boon. I could track down all the Python file "open"s and change them all over to UTF-8, and hopefully then be a happy bunny in all circumstances. Furthermore that would ensure interoperability with Linux (where default is already UTF-8), which would also be nice.

I need to press you a bit more on this solution, if I may, and you'd be kind enough to stick with me. Do you know any of the following:

We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8? Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?
If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?
What goes on with encoding declarations in HTML? Some, but not all, of these files are HTML. I know I could start saving with <head> <meta charset="utf-8"/>. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the file open() method. I don't think you can change your mind about the encoding once you have opened a file. So how does this work when reading the HTML file --- which encoding should I pass to open('r') given that it might encounter a charset specification after a while when reading the content??

Thank you so much for your kind time on this!

kshegunov

@JonB said in QTextDocument::toHtml() "encoding" parameter:

If you allow me ...

We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8? Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?

cp1252 is very similar to ISO-8859-1, a.k.a. Latin1, but not exactly the same, 'cause Microsoft. In any case there are a few codepoints that are different between Latin1 and cp1252 that are going to give you trouble if you directly try to decode a cp1252 text through utf8. These include the euro sign, and slanted apostrophies and quotation marks, the permille sign among a few others.

Note: I talk about differences between Latin1 and cp1252 only because Latin1 is a subset of utf8, thus you can decode Latin1 text directly as if it were encoded in utf8.

If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?

I would imagine it'd either use the local 8-bit encoding, which can be cp1252 or Latin1, or it can save it as utf8. There should be a way to specify that when saving the actual file.

What goes on with encoding declarations in HTML? Some, but not all, of these files are HTML. I know I could start saving with <head> <meta charset="utf-8"/>. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the file open() method. I don't think you can change your mind about the encoding once you have opened a file. So how does this work when reading the HTML file --- which encoding should I pass to open('r') given that it might encounter a charset specification after a while when reading the content??

HTML is quite similar to XML in that regard. In XML you have the preamble (which contains the encoding of the document) that is supposed to be always encoded in latin 8bit, so whatever is used in the rest of the document can be read by the parser. For HTML this is the meta-tag, so the parser is supposed to switch to the indicated encoding whenever it reaches the charset meta tag.

JonB

@kshegunov said in QTextDocument::toHtml() "encoding" parameter:

There should be a way to specify that when saving the actual file.

My end-users are quite beyond my control. There is no chance of getting them to specify some encoding to save as if they use Notepad, they will use whatever the default is, period.

so whatever is used in the rest of the document can be read by the parser. For HTML this is the meta-tag, so the parser is supposed to switch to the indicated encoding whenever it reaches the charset meta tag.

I feel a bit like Alice, disappearing down a rabbit hole, "Curioser and curiouser"....

So you're saying you expect to change the decoding while you're in the middle of reading a text file?! I don't even know how to do that from my Python: when I open a file for text-read I specify an encoding, which it uses as it reads lines. I don't think I can change that halfway along.... E.g. from the Python docs for open():

As mentioned in the Overview, Python distinguishes between binary and text I/O. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.

OK, I get further. It turns out the text-open returns an object (class io.TextIOWrapper) which does have a reconfigure method allowing encoding to be respecified. However, I am not surprised to read:

It is not possible to change the encoding or newline if some data has already been read from the stream.

which is about what I would expect. What is going on here? This is getting crazy!

I know that's not your problem, so just in outline how would you expect to achieve reading such an HTML from, say, Qt/QFile?

kshegunov

@JonB said in QTextDocument::toHtml() "encoding" parameter:

So you're saying you expect to change the decoding while you're in the middle of reading a text file?!

Of course. Text display goes like this: bytes -> encoding -> font for locale -> display of glyphs
So the encoding, as suggested by its name, is the way a character (or rather a code point) is encoded into byte(s).
For example for XML the text declaration can specify an encoding differing from the default utf8.

I don't even know how to do that from my Python

Sorry, I'm completely clueless here.

I know that's not your problem, so just in outline how would you expect to achieve reading such an HTML from, say, Qt/QFile?

You open the file, as usual. Then you start reading the data in unencoded form (i.e. in QByteArray); then parse it as if it were containing utf8 (the default for HTML5) and whenever you parse the meta tag and get the requested encoding, you attach a QTextCodec and start converting the raw bytes to unicode (i.e. the QString's internal utf16). Thereafter it's easy as you are working with QStrings.

JonB

@kshegunov
Yes, given your implementation I get it.

Trouble is, for Python (remember I'm a noob there too) it doesn't seem you handle files like that. We specify the desired encoding as a parameter to file open() (read or write, on a text file), and then when characters are read/written the decode/encode is auto-performed.

[Just BTW: I do understand I could (presumably) do the whole thing from Python via PyQt using Qt's QFile and maybe QTextCodec etc. But while the app heavily uses Qt it is still a Python program and there are good reasons why it uses Python file handling for all purposes. I do not have the luxury/choice of chucking that away in favour of Qt.]

P.S.
So among all Qt's various useful classes, there isn't one which will open an HTML/XML file, do whatever work to parse the correct encoding if present in the header, and then return you a final QString of the content having been appropriately decoded as best it can? I could use that!

kshegunov

@JonB said in QTextDocument::toHtml() "encoding" parameter:

So among all Qt's various useful classes, there isn't one which will open an HTML/XML file, do whatever work to parse the correct encoding if present in the header, and then return you a final QString of the content having been appropriately decoded as best it can? I could use that!

Check QXmlStreamReader and/or QDomDocument and see if they do you any good.

JonB

@kshegunov
Thanks, but I think they're both going to want to find (well formed) XML and parse it. My input will be HTML (and not XHTML btw), plus all I want is the resulting content as a single QString for Python, so I don't think they'll help.

Maybe I need to go see somewhere if there is a Python skeleton for doing this decoding correctly. It seems like this code is needed any time you want to open an HTML document which might specify an encoding for reading the content, which ought to be a pretty standard thing that will be wanted, I'm surprised it's so tricky?

kshegunov

The QtWebKit module might be an option, however I've never used it ... People perfected parsing bad HTML over the last 20 years ... ;)

I'm surprised it's so tricky?

I guess you were mostly shielded from this whole process, judging by your default 8bit encoding ... :)

For me it used to be cp1251, and of course it was incompatible with KOI8-R which was what linuxes mostly stuck to. And of course cp1251 is compatible with cp1252, but then the latter was slightly different from ASCII. I have 15 year old IRC logs that are completely inaccessible to me as I don't currently have anything that can read windows-1251 ...

JKSH

@JonB said in QTextDocument::toHtml() "encoding" parameter:

We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8?

Nope! If a character can be encoded in CP1252, then it can also be encoded in UTF-8.

This is one reason why folks are pushing for UTF-8 to be the default, the One Encoding to Rule Them All.

Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?

Yes. Example: £

If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?

If the file already contains UTF-8 specific byte sequences, then Notepad will still re-save it as UTF-8.

If the file does not contain any UTF-8 specific byte sequences, then Notepad doesn't think it's UTF-8 so it won't re-save as UTF-8.

...I know I could start saving with <head> <meta charset="utf-8"/>. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the file open() method. I don't think you can change your mind about the encoding once you have opened a file.

Open the HTML file using the UTF-8 decoder.
Check the charset field.
If the charset is UTF-8, GOTO Happy Ending.
If the charset is not UTF-8, close the file and re-open it using the decoder for the declared charset.

This is the underlying assumption: No matter what encoding the file is in, the charset metadata is legible to a UTF-8 decoder. <subliminal_message>Isn't UTF-8 wonderful?</subliminal_message>