QTextDocument::toHtml() "encoding" parameter
-
Before I start, I realise this is (presumably) an HTML/text encoding question in general rather than Qt-specific. But you experts are so helpful here, so I thought I'd give it a go!
I create a
QTextDocument
and populate through its structure methods. Finally I want it converted to HTML.Because I have never really understood encodings, I just go:
html = doc.toHtml()
omitting the
encoding
parameter. The resulting HTML seems to start with:<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" "http://www.w3.org/TR/REC-html40/strict.dtd"> <html> <head> <meta name="qrichtext" content="1" />
I have tried reading, say:
- https://forum.qt.io/topic/2167/solved-qtextdocument-tohtml-utf-8-and-arabic-on-windows
- https://asmaloney.com/2012/05/code/qtextdocument-html-and-unicode-its-all-greek-to-me/
but it's still beyond double-double Greek to me :)
All seems well, till I happen to save to file, run Firefox on it, and examine the F12 Console messages, where I see:
The character encoding of the HTML document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the page must be declared in the document or in the transfer protocol.
Would some kind soul (perhaps my friend @kshegunov who is often patient with my questions and prepared to go back & forth explaining!) care to spend a bit of time attempting to convey to me how I know what should be passed as the encoding? I can supply any further relevant information, e.g. how documents might be generated/edited, or where they are intended to be used, etc., if required, I don't know what is.
P.S.
If an administrator here feels this question should actually be moved to The Lounge, please feel free to do so!Hi @JonB, you're trying to juggle multiple different issues and concepts at the same time.
Let's take a step back and look at them one-by-one.
1. Independent Issue No. 1: Understanding Encodings
Because I have never really understood encodings,
….
Why do you do that? Is this terribly complicated for me because I use Python3/Qt, that already auto-translates all the Qt functions onQString
&char *
, to/from itsstr
, I think it'sstr
is already "Unicode", does that make a difference?Let's try to un-complicate things, shall we? This should help to clarify many other things.
In simple terms:
- "Encode": Convert something meaningful into a sequence of bytes.
- "Decode": Convert a sequence of bytes into something meaningful.
1.1. Basic study: Encodings and Strings
You are correct that strings in Qt (and most modern software) are represented by "Unicode". However, "Unicode" is just a system that assigns a unique "key" to every common "character" in the world. (This is a gross oversimplification, but it is enough for this discussion).
Suppose you have a string, "HELLO". Under Unicode, the sequence of "keys" for this string is:
U+0048 U+0045 U+004C U+004C U+004F
Note 1: These "keys" are NOT bytes! To store them in a file, you need to convert these "keys" into bytes. In other words, you must encode your string.
Examples of encodings:
- UTF-8 is one encoding for Unicode.
- UTF-16 is another encoding for Unicode.
- UTF-32 (a.k.a. UCS-4) is yet another encoding for Unicode.
- SHIFT-JIS is a non-Unicode encoding, made specifically for the Japanese language.
// Initialize the string QString str = "HELLO"; // Encode the string QByteArray str8 = str.toUtf8(); // Array of 8-bit values QVector<quint32> str32 = str.toUcs4(); // Array of 32-bit values // Write bytes to file file8.write(str8); file32.write( reinterpret_cast<char*>(str32.data()), str32.length()*sizeof(quint32) );
Same string, 2 different encodings:
- file8 contains 5 bytes:
4845 4c4c 4f
- file32 contains 20 bytes:
4800 0000 4500 0000 4c00 0000 4c00 0000 4f00 0000
(assuming little-endian)
from Python/PyQt
toHtml()
returns astr
andwrite()
accepts astr
.In C++ Qt,
QTextDocument::toHtml()
returns a QString andQFile::write()
accepts a byte array. This means we must explicitly choose an encoding when we write strings to a file. This is a bit more work for us, but it does makes things unambiguous.Python lets you write strings straight to a file, which means it automatically chooses an encoding for you. Nonetheless, you can also manually specify an encoding.
Note 2: In general, think of strings as un-encoded entities. It only gets encoded (converted into a byte sequence) when you manually perform the encoding and/or when you write the string to file.
1.2. Intermediate study: Encodings and HTML Documents
Converting rich text into a HTML document for storage/transmission is a 2-step process:
QTextDocument
-- toHtml() -->
HTML string (not encoded)-- QString::toUtf8() OR fileObject.write() -->
Byte sequence of the HTML document (encoded)Note the difference between the HTML string and the HTML byte sequence.
- In C++,
toUtf8()
produces a byte array in-memory. This array can then be written to file, or be transmitted over the Internet. - In Python,
write()
implicitly encodes your string and then writes the encoded data to file. You don't get a copy of the encoded string in memory.
2. Independent Issue No. 2: The Browser's Warning
@JonB said in QTextDocument::toHtml() "encoding" parameter:
The character encoding of the HTML document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the page must be declared in the document or in the transfer protocol.
Here, the browser is simply complaining, "Oi, you didn't tell me what your HTML document's encoding is!" This warning is NOT caused by a "wrong"/"bad"/"corrupted" encoding.
As you have discovered, the way to resolve this is to declare the document's encoding, by specifying an argument to
QTextDocument::toHtml()
.Examples:
- If you call
doc->toHtml("utf-8")
; your document will be stamped with the tag<meta charset="utf-8"/>
- If you call
doc->toHtml("iso-8859-1")
; your document will be stamped with the tag<meta charset="iso-8859-1"/>
Note 3a: When you do this, you simply declare that "My document is encoded in UTF-8" or "My document is encoded in ISO-8859-1". However, it does NOT perform any actual encoding! (See #1.2.)
Note 3b: It is possible to declare one encoding in
toHtml()
but then write out a different encoding. This essentially attaches a wrong tag to your document, which will probably confuse the browser, causing it to display gibberish.3. Independent Issue No. 3: Verifying/Sanitizing Inputs
I don't know what encoding I already have my text files or
QString
s or *char *`s in, because I'm saying they come from a variety of sources. That's why I don't put encodings into my output code, 'coz I don't know what I've got so I cross my fingers and tend to say nothing and hope everything works without me specifying something that's actually wrong.This is not good.
As the programmer, it is your job to set the rules for your app's users ("The only encodings allowed are ____ and ____!"), and ideally you should have mechanisms in place to check that the inputs comply with these rules.
(Source: https://www.xkcd.com/327/)4. To be continued?
I have to run, so I can't talk about the
£
issue right now. But perhaps after reading everything in this thread, you can understand the causes and identify solutions yourself? -
Hi @JonB, you're trying to juggle multiple different issues and concepts at the same time.
Let's take a step back and look at them one-by-one.
1. Independent Issue No. 1: Understanding Encodings
Because I have never really understood encodings,
….
Why do you do that? Is this terribly complicated for me because I use Python3/Qt, that already auto-translates all the Qt functions onQString
&char *
, to/from itsstr
, I think it'sstr
is already "Unicode", does that make a difference?Let's try to un-complicate things, shall we? This should help to clarify many other things.
In simple terms:
- "Encode": Convert something meaningful into a sequence of bytes.
- "Decode": Convert a sequence of bytes into something meaningful.
1.1. Basic study: Encodings and Strings
You are correct that strings in Qt (and most modern software) are represented by "Unicode". However, "Unicode" is just a system that assigns a unique "key" to every common "character" in the world. (This is a gross oversimplification, but it is enough for this discussion).
Suppose you have a string, "HELLO". Under Unicode, the sequence of "keys" for this string is:
U+0048 U+0045 U+004C U+004C U+004F
Note 1: These "keys" are NOT bytes! To store them in a file, you need to convert these "keys" into bytes. In other words, you must encode your string.
Examples of encodings:
- UTF-8 is one encoding for Unicode.
- UTF-16 is another encoding for Unicode.
- UTF-32 (a.k.a. UCS-4) is yet another encoding for Unicode.
- SHIFT-JIS is a non-Unicode encoding, made specifically for the Japanese language.
// Initialize the string QString str = "HELLO"; // Encode the string QByteArray str8 = str.toUtf8(); // Array of 8-bit values QVector<quint32> str32 = str.toUcs4(); // Array of 32-bit values // Write bytes to file file8.write(str8); file32.write( reinterpret_cast<char*>(str32.data()), str32.length()*sizeof(quint32) );
Same string, 2 different encodings:
- file8 contains 5 bytes:
4845 4c4c 4f
- file32 contains 20 bytes:
4800 0000 4500 0000 4c00 0000 4c00 0000 4f00 0000
(assuming little-endian)
from Python/PyQt
toHtml()
returns astr
andwrite()
accepts astr
.In C++ Qt,
QTextDocument::toHtml()
returns a QString andQFile::write()
accepts a byte array. This means we must explicitly choose an encoding when we write strings to a file. This is a bit more work for us, but it does makes things unambiguous.Python lets you write strings straight to a file, which means it automatically chooses an encoding for you. Nonetheless, you can also manually specify an encoding.
Note 2: In general, think of strings as un-encoded entities. It only gets encoded (converted into a byte sequence) when you manually perform the encoding and/or when you write the string to file.
1.2. Intermediate study: Encodings and HTML Documents
Converting rich text into a HTML document for storage/transmission is a 2-step process:
QTextDocument
-- toHtml() -->
HTML string (not encoded)-- QString::toUtf8() OR fileObject.write() -->
Byte sequence of the HTML document (encoded)Note the difference between the HTML string and the HTML byte sequence.
- In C++,
toUtf8()
produces a byte array in-memory. This array can then be written to file, or be transmitted over the Internet. - In Python,
write()
implicitly encodes your string and then writes the encoded data to file. You don't get a copy of the encoded string in memory.
2. Independent Issue No. 2: The Browser's Warning
@JonB said in QTextDocument::toHtml() "encoding" parameter:
The character encoding of the HTML document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the page must be declared in the document or in the transfer protocol.
Here, the browser is simply complaining, "Oi, you didn't tell me what your HTML document's encoding is!" This warning is NOT caused by a "wrong"/"bad"/"corrupted" encoding.
As you have discovered, the way to resolve this is to declare the document's encoding, by specifying an argument to
QTextDocument::toHtml()
.Examples:
- If you call
doc->toHtml("utf-8")
; your document will be stamped with the tag<meta charset="utf-8"/>
- If you call
doc->toHtml("iso-8859-1")
; your document will be stamped with the tag<meta charset="iso-8859-1"/>
Note 3a: When you do this, you simply declare that "My document is encoded in UTF-8" or "My document is encoded in ISO-8859-1". However, it does NOT perform any actual encoding! (See #1.2.)
Note 3b: It is possible to declare one encoding in
toHtml()
but then write out a different encoding. This essentially attaches a wrong tag to your document, which will probably confuse the browser, causing it to display gibberish.3. Independent Issue No. 3: Verifying/Sanitizing Inputs
I don't know what encoding I already have my text files or
QString
s or *char *`s in, because I'm saying they come from a variety of sources. That's why I don't put encodings into my output code, 'coz I don't know what I've got so I cross my fingers and tend to say nothing and hope everything works without me specifying something that's actually wrong.This is not good.
As the programmer, it is your job to set the rules for your app's users ("The only encodings allowed are ____ and ____!"), and ideally you should have mechanisms in place to check that the inputs comply with these rules.
(Source: https://www.xkcd.com/327/)4. To be continued?
I have to run, so I can't talk about the
£
issue right now. But perhaps after reading everything in this thread, you can understand the causes and identify solutions yourself?@JKSH
Thank you for Part I of your explanation!TBH, I think I do know all the above stuff. I think my problem is that I don't know what my various sources of strings are encoded as.
In C++ Qt, QTextDocument::toHtml() returns a QString and QFile::write() accepts a byte array. This means we must explicitly choose an encoding when we write strings to a file. This is a bit more work for us, but it does makes things unambiguous.
Python lets you write strings straight to a file, which means it automatically chooses an encoding for you. Nonetheless, you can also manually specify an encoding.
Yes, I thought in C++ you'd be explicit. In Python I'm thinking it will (by default, in my case) be doing a
.toUtf8()
when writing to file.Now we come to the practicalities of my situation! As I've said, my potential sources for input are many & various. For example, it would be great if every user-editable input file already included a BOM or an encoding declaration. But they don't. Users can edit a file of HTML in Windows Notepad, or not.
So... given that, how would you discover, for example, whether what you have is or is not UTF-8? You can only guess, and scan all the characters, I guess? How often do you want me to do that, when?
And finally: in practice, in the past I seem to have gotten away without worrying about encodings and, I guess, everything defaulted to UTF-8 and it worked. This time, my inputs may contain
£
characters often, and may have been edited under Linux and/or Windows, with or without Notepad. Is my whole life a£
mess up situation?[Also, I think I've seen different behaviour between browsers/displayers (if user happens to view HTML file there) as to whether they complain about HTML-encoding-declaration mismatch against actual content, or how they display certain "contentious" characters when they're not strictly in the encoding. So they may complain about
£
encodings or not, display it as£
or?
, etc.] -
@JKSH
Thank you for Part I of your explanation!TBH, I think I do know all the above stuff. I think my problem is that I don't know what my various sources of strings are encoded as.
In C++ Qt, QTextDocument::toHtml() returns a QString and QFile::write() accepts a byte array. This means we must explicitly choose an encoding when we write strings to a file. This is a bit more work for us, but it does makes things unambiguous.
Python lets you write strings straight to a file, which means it automatically chooses an encoding for you. Nonetheless, you can also manually specify an encoding.
Yes, I thought in C++ you'd be explicit. In Python I'm thinking it will (by default, in my case) be doing a
.toUtf8()
when writing to file.Now we come to the practicalities of my situation! As I've said, my potential sources for input are many & various. For example, it would be great if every user-editable input file already included a BOM or an encoding declaration. But they don't. Users can edit a file of HTML in Windows Notepad, or not.
So... given that, how would you discover, for example, whether what you have is or is not UTF-8? You can only guess, and scan all the characters, I guess? How often do you want me to do that, when?
And finally: in practice, in the past I seem to have gotten away without worrying about encodings and, I guess, everything defaulted to UTF-8 and it worked. This time, my inputs may contain
£
characters often, and may have been edited under Linux and/or Windows, with or without Notepad. Is my whole life a£
mess up situation?[Also, I think I've seen different behaviour between browsers/displayers (if user happens to view HTML file there) as to whether they complain about HTML-encoding-declaration mismatch against actual content, or how they display certain "contentious" characters when they're not strictly in the encoding. So they may complain about
£
encodings or not, display it as£
or?
, etc.]@JonB said in QTextDocument::toHtml() "encoding" parameter:
in practice, in the past I seem to have gotten away without worrying about encodings and, I guess, everything defaulted to UTF-8 and it worked.
It's quite possible that, in the past, you've only had to deal with ASCII characters. Many of today's common encodings (in the English-speaking world anyway) are ASCII-compatible, so there is no noticeable consequence if you accidentally mixed encodings. See #4 below for more details.
In Python I'm thinking it will (by default, in my case) be doing a
.toUtf8()
when writing to file.If I'm not mistaken, Python 2 defaults to ASCII. Python 3 defaults to UTF-8.
I think the pound sterling is often char
0xA3
, but that might well beWindows-1252
. And I don't like that because it sounds Windows-y and how do I know that will "be available" under Linux? (And then I started looking atISO 8859-1
(Latin-1) encoding for this....)Linux can read Windows-1252 data no problem.
You just need to make sure you use the right decoder for the right input.
4. Independent Issue No. 4: Why £ Seems More Troublesome than $
I think, think I've seen
£
arrive to me as one byte 0xA3, two byte 0x00A3 and two byte 0xC2A3 !That just means you've seen
£
encoded in 3 different ways:Character ASCII ISO-8859-1 OR Windows-1252 UTF-8 UTF-16 A 0x41 0x41 0x41 0x0041 B 0x42 0x42 0x42 0x0042 C 0x43 0x43 0x43 0x0043 $ 0x24 0x24 0x24 0x0024 £ (Does not exist!) 0xA3 0xC2A3 0x00A3 Note:
$
is an ASCII character.£
is not.[Maybe a lot of my problems stem from this
£
issue. It's really unfair that there never has been any problem with$
:( ]The table above also illustrates why you seem to only have trouble with
£
but not$
: Because when you have simple English text where money is only specified in Dollars, ASCII, ISO-8859-1 or UTF-8 will encode your text in exactly the same way. Therefore, using a UTF-8 reader to parse an ISO-8859-1 file (or vice-versa) has no noticeable consequences.However, if your text contains money in Pounds, then there will be consequences:
- A UTF-8 reader wrongly parses an ISO-8859-1 file because 0xA3 is not a valid UTF-8 character, and none of the multi-byte characters start with 0xA3.
- An ISO-8859-1 reader wrongly parses a UTF-8 file because it treats 0xC2A3 as 2 separate characters, but 0xC2 is not a valid ISO-8859-1 character.
Is my whole life a
£
mess up situation?Nah. If it were, then most software developers in the UK would be in agony ;)
Anyway, it all depends on what characters your app needs to deal with. Spare a thought for the users and developers who don't even use the Latin alphabet in their daily lives -- They have way more to deal with then
£
.5. Independent Issue No. 5: How to Detect the Encoding of an Input
how would you discover, for example, whether what you have is or is not UTF-8? You can only guess, and scan all the characters, I guess?
This is quite an arduous and unreliable process. See: https://en.wikipedia.org/wiki/Charset_detection
It is impractical to implement this yourself, but you could use a 3rd-party library like chardet: https://chardet.readthedocs.io/
How often do you want me to do that, when?
Well, it's better if you don't do that. Encoding detection should only be used as a last resort, not the first: https://chardet.readthedocs.io/en/latest/faq.html#yippie-screw-the-standards-ill-just-auto-detect-everything
But if you must, see below.
[Also, I think I've seen different behaviour between browsers/displayers (if user happens to view HTML file there) as to whether they complain about HTML-encoding-declaration mismatch against actual content, or how they display certain "contentious" characters when they're not strictly in the encoding. So they may complain about
£
encodings or not, display it as£
or?
, etc.]This occurs because the browsers do try to handle encoding mixups, but no silver bullet exists. Each project (and you yourself) must decide: What's most important?
- Maximizing successful detections?
- Minimizing development effort?
- Minimizing the end-user's processing power and wait time?
- Keeping the end-user blissfully unaware of the problem?
5. Independent Issue No. 5: How to Practically Handle the Pound Problem
I don't know what your app does, how it's used, or who your users are. So, I can only give generic tips here. (But since your users are hand-editing HTML files, I'll assume they are somewhat tech-literate). This is what I would consider.
The ultimate source of the
£
characters in the documents are multiple: from existing HTML "template" files, from database text fields, from literal Python source code, from editing by users including from withinQWebEngineView
and other editors. Etc.!Are any of these in your control? If so, I would scour them all and ensure that all these sources are encoded in UTF-8. This act alone can make lots of your problems vanish without requiring your users to do anything different.
Note: If Notepad opens a file that contains characters that are uniquely encoded in UTF-8 (for example, if it opens a file that contains
0xC2A3
), then it will automatically save any modifications in UTF-8 too.I think my problem is that I don't know what my various sources of strings are encoded as.
...
Now we come to the practicalities of my situation! As I've said, my potential sources for input are many & various.... Users can edit a file of HTML in Windows Notepad, or not.What I said #3 still stands. It is important to explicitly decide what inputs are supported. Put in the effort to ensure that the supported inputs are always parsed properly. Beyond that, give no guarantees for unsupported inputs (but you can still try to handle them nicely, out of goodwill).
First, I would explicitly tell users to only use UTF-8 files in the app documentation. I'd also nudge them in that direction using GUI hints -- For example, in the file selection dialog, I'd explicitly say something like "Select UTF-8 HTML File". In my documentation, I'd say that only UTF-8 is officially supported; while I'd do my best to support other encodings, I won't guarantee anything. I'd also provide instructions on how to use Windows Notepad to save files in UTF-8.
If I know that my users strongly value the ability to provide inputs in other encodings though, then I'll start preparing myself for the long journey ahead.
5.1. If you really, truly, seriously want to support a multiple input encodings...
-
If your inputs contain headers/metadata about the encoding (such as HTML's
<meta charset="utf-8">
tag), use it. Avoid auto-detection if at all possible.- If the header/metadata does not match the actual content, then the input is corrupt. It is not your job to fix it; go ahead and display gibberish! Do not give your users an incentive to keep corrupted files around.
-
If the input does not support headers/metadata, then provide a drop-down menu to specify the encoding when they select the input to import.
- You can provide an "Auto-detected" option here, and you can use the chardet library here.
- Do not run chardet if your user did not ask for it.
-
@JonB said in QTextDocument::toHtml() "encoding" parameter:
in practice, in the past I seem to have gotten away without worrying about encodings and, I guess, everything defaulted to UTF-8 and it worked.
It's quite possible that, in the past, you've only had to deal with ASCII characters. Many of today's common encodings (in the English-speaking world anyway) are ASCII-compatible, so there is no noticeable consequence if you accidentally mixed encodings. See #4 below for more details.
In Python I'm thinking it will (by default, in my case) be doing a
.toUtf8()
when writing to file.If I'm not mistaken, Python 2 defaults to ASCII. Python 3 defaults to UTF-8.
I think the pound sterling is often char
0xA3
, but that might well beWindows-1252
. And I don't like that because it sounds Windows-y and how do I know that will "be available" under Linux? (And then I started looking atISO 8859-1
(Latin-1) encoding for this....)Linux can read Windows-1252 data no problem.
You just need to make sure you use the right decoder for the right input.
4. Independent Issue No. 4: Why £ Seems More Troublesome than $
I think, think I've seen
£
arrive to me as one byte 0xA3, two byte 0x00A3 and two byte 0xC2A3 !That just means you've seen
£
encoded in 3 different ways:Character ASCII ISO-8859-1 OR Windows-1252 UTF-8 UTF-16 A 0x41 0x41 0x41 0x0041 B 0x42 0x42 0x42 0x0042 C 0x43 0x43 0x43 0x0043 $ 0x24 0x24 0x24 0x0024 £ (Does not exist!) 0xA3 0xC2A3 0x00A3 Note:
$
is an ASCII character.£
is not.[Maybe a lot of my problems stem from this
£
issue. It's really unfair that there never has been any problem with$
:( ]The table above also illustrates why you seem to only have trouble with
£
but not$
: Because when you have simple English text where money is only specified in Dollars, ASCII, ISO-8859-1 or UTF-8 will encode your text in exactly the same way. Therefore, using a UTF-8 reader to parse an ISO-8859-1 file (or vice-versa) has no noticeable consequences.However, if your text contains money in Pounds, then there will be consequences:
- A UTF-8 reader wrongly parses an ISO-8859-1 file because 0xA3 is not a valid UTF-8 character, and none of the multi-byte characters start with 0xA3.
- An ISO-8859-1 reader wrongly parses a UTF-8 file because it treats 0xC2A3 as 2 separate characters, but 0xC2 is not a valid ISO-8859-1 character.
Is my whole life a
£
mess up situation?Nah. If it were, then most software developers in the UK would be in agony ;)
Anyway, it all depends on what characters your app needs to deal with. Spare a thought for the users and developers who don't even use the Latin alphabet in their daily lives -- They have way more to deal with then
£
.5. Independent Issue No. 5: How to Detect the Encoding of an Input
how would you discover, for example, whether what you have is or is not UTF-8? You can only guess, and scan all the characters, I guess?
This is quite an arduous and unreliable process. See: https://en.wikipedia.org/wiki/Charset_detection
It is impractical to implement this yourself, but you could use a 3rd-party library like chardet: https://chardet.readthedocs.io/
How often do you want me to do that, when?
Well, it's better if you don't do that. Encoding detection should only be used as a last resort, not the first: https://chardet.readthedocs.io/en/latest/faq.html#yippie-screw-the-standards-ill-just-auto-detect-everything
But if you must, see below.
[Also, I think I've seen different behaviour between browsers/displayers (if user happens to view HTML file there) as to whether they complain about HTML-encoding-declaration mismatch against actual content, or how they display certain "contentious" characters when they're not strictly in the encoding. So they may complain about
£
encodings or not, display it as£
or?
, etc.]This occurs because the browsers do try to handle encoding mixups, but no silver bullet exists. Each project (and you yourself) must decide: What's most important?
- Maximizing successful detections?
- Minimizing development effort?
- Minimizing the end-user's processing power and wait time?
- Keeping the end-user blissfully unaware of the problem?
5. Independent Issue No. 5: How to Practically Handle the Pound Problem
I don't know what your app does, how it's used, or who your users are. So, I can only give generic tips here. (But since your users are hand-editing HTML files, I'll assume they are somewhat tech-literate). This is what I would consider.
The ultimate source of the
£
characters in the documents are multiple: from existing HTML "template" files, from database text fields, from literal Python source code, from editing by users including from withinQWebEngineView
and other editors. Etc.!Are any of these in your control? If so, I would scour them all and ensure that all these sources are encoded in UTF-8. This act alone can make lots of your problems vanish without requiring your users to do anything different.
Note: If Notepad opens a file that contains characters that are uniquely encoded in UTF-8 (for example, if it opens a file that contains
0xC2A3
), then it will automatically save any modifications in UTF-8 too.I think my problem is that I don't know what my various sources of strings are encoded as.
...
Now we come to the practicalities of my situation! As I've said, my potential sources for input are many & various.... Users can edit a file of HTML in Windows Notepad, or not.What I said #3 still stands. It is important to explicitly decide what inputs are supported. Put in the effort to ensure that the supported inputs are always parsed properly. Beyond that, give no guarantees for unsupported inputs (but you can still try to handle them nicely, out of goodwill).
First, I would explicitly tell users to only use UTF-8 files in the app documentation. I'd also nudge them in that direction using GUI hints -- For example, in the file selection dialog, I'd explicitly say something like "Select UTF-8 HTML File". In my documentation, I'd say that only UTF-8 is officially supported; while I'd do my best to support other encodings, I won't guarantee anything. I'd also provide instructions on how to use Windows Notepad to save files in UTF-8.
If I know that my users strongly value the ability to provide inputs in other encodings though, then I'll start preparing myself for the long journey ahead.
5.1. If you really, truly, seriously want to support a multiple input encodings...
-
If your inputs contain headers/metadata about the encoding (such as HTML's
<meta charset="utf-8">
tag), use it. Avoid auto-detection if at all possible.- If the header/metadata does not match the actual content, then the input is corrupt. It is not your job to fix it; go ahead and display gibberish! Do not give your users an incentive to keep corrupted files around.
-
If the input does not support headers/metadata, then provide a drop-down menu to specify the encoding when they select the input to import.
- You can provide an "Auto-detected" option here, and you can use the chardet library here.
- Do not run chardet if your user did not ask for it.
@JKSH
Thank you very much for taking the effort to type up this comprehensive reply!In truth I find I did know all/most of this. But it's nice to see confirmation in a table of the 3 different encodings I seem to have encountered for
£
. I hope it's also useful reference for other potential readers.One thing I do not totally get: you say the the encoding for
£
in UTF-8 is0xC2A3
. Now, call me gullible, but I thought the point of UTF-8 was that all the characters it supports are represented in, well, 8 bits! What I'm now seeing is: UTF-16 always takes 16 bits, but UTF-8 seems to try to fit in 8 bits, but can go to 16 bits if it wants to. How does decoder know? Answer must be that the leading0xC2
byte tells it this is a 16-bit sequence? I certainly did not know that.Just a few points in summary about my situation:
-
But since your users are hand-editing HTML files, I'll assume they are somewhat tech-literate
On the contrary, they are below "tech-literate"! However, this does not stop them typing in whatever they fancy wherever (from whatever other applications) they fancy and demanding it work :) "A little knowledge is dangerous."
- In my commercial programming I have no problems with telling end users what they need to do. If they get it wrong, I phone their boss and have them fired ;-) However, my Qt programming is a project used by non-commercial users, so I do not have this power :( I do not have any contact with the users, there is no manual, they wouldn't look at one if there were, and they would not know or act on "UTF-8" if it hit them in the face :) There are no metadata headers. They are not subject to any control or rules. This is what I mean by the "practicalities" (rather than what one might desire) of my situation. If it worked in whatever "Paradox" software they were using years ago (to which I have no access and no interest), it's just expected to "work" now!
Can't use auto-detection form chardet as I'm Python(3)-only. Have to do any work myself.
Having said that, the more I think about it the more I believe my only problem is
£
sterling. Users speak English (i.e. they are not American), so it's not like I have to deal with Cyrillic or Chinese. The only currency used is UK. So far, I have put the following code in and seem to not be experiencing any problems:def callbackToHtmlSaveHtml(self, data: str): self.html = data # remove the contenteditable="true" when saving self.removeContentEditable() # and also replace any literal £ characters with £ HTML entitized, which makes things easier later on # this is called from QWebEngineView # it's not helped by the fact that each time you edit visually in QWebEngineView # QWebEngineView insists on turning any "£" back into a literal "£"... self.html = self.html.replace("£", "£") def sendHtmlEmail(): # ensure encoded as UTF-8 # this is at least required to get embedded "£" characters through correctly charset = 'utf-8' html = html.encode(charset) # set Content-Type for text/html msg.add_header('Content-Type', 'text/html; charset=' + charset)
I think this has reminded me: I may have gotten away with "funny"/"unspecified" encodings when putting HTML into file/browser, but sending in SMTP email is more rigorous about not guessing/complaining. Like I said, my input sources are various, and so are my output destinations!
I have found that the magic of (Python 3)
html.replace("£", "£")
seems to make my HTML documents much more acceptable as UTF-8 HTML. I'm not sure what that literal Python code is doing in terms of each of the 3 cases you mention about possible£
encodings in the source to substitute, but it seems to work in practice.... -
@JKSH
Thank you very much for taking the effort to type up this comprehensive reply!In truth I find I did know all/most of this. But it's nice to see confirmation in a table of the 3 different encodings I seem to have encountered for
£
. I hope it's also useful reference for other potential readers.One thing I do not totally get: you say the the encoding for
£
in UTF-8 is0xC2A3
. Now, call me gullible, but I thought the point of UTF-8 was that all the characters it supports are represented in, well, 8 bits! What I'm now seeing is: UTF-16 always takes 16 bits, but UTF-8 seems to try to fit in 8 bits, but can go to 16 bits if it wants to. How does decoder know? Answer must be that the leading0xC2
byte tells it this is a 16-bit sequence? I certainly did not know that.Just a few points in summary about my situation:
-
But since your users are hand-editing HTML files, I'll assume they are somewhat tech-literate
On the contrary, they are below "tech-literate"! However, this does not stop them typing in whatever they fancy wherever (from whatever other applications) they fancy and demanding it work :) "A little knowledge is dangerous."
- In my commercial programming I have no problems with telling end users what they need to do. If they get it wrong, I phone their boss and have them fired ;-) However, my Qt programming is a project used by non-commercial users, so I do not have this power :( I do not have any contact with the users, there is no manual, they wouldn't look at one if there were, and they would not know or act on "UTF-8" if it hit them in the face :) There are no metadata headers. They are not subject to any control or rules. This is what I mean by the "practicalities" (rather than what one might desire) of my situation. If it worked in whatever "Paradox" software they were using years ago (to which I have no access and no interest), it's just expected to "work" now!
Can't use auto-detection form chardet as I'm Python(3)-only. Have to do any work myself.
Having said that, the more I think about it the more I believe my only problem is
£
sterling. Users speak English (i.e. they are not American), so it's not like I have to deal with Cyrillic or Chinese. The only currency used is UK. So far, I have put the following code in and seem to not be experiencing any problems:def callbackToHtmlSaveHtml(self, data: str): self.html = data # remove the contenteditable="true" when saving self.removeContentEditable() # and also replace any literal £ characters with £ HTML entitized, which makes things easier later on # this is called from QWebEngineView # it's not helped by the fact that each time you edit visually in QWebEngineView # QWebEngineView insists on turning any "£" back into a literal "£"... self.html = self.html.replace("£", "£") def sendHtmlEmail(): # ensure encoded as UTF-8 # this is at least required to get embedded "£" characters through correctly charset = 'utf-8' html = html.encode(charset) # set Content-Type for text/html msg.add_header('Content-Type', 'text/html; charset=' + charset)
I think this has reminded me: I may have gotten away with "funny"/"unspecified" encodings when putting HTML into file/browser, but sending in SMTP email is more rigorous about not guessing/complaining. Like I said, my input sources are various, and so are my output destinations!
I have found that the magic of (Python 3)
html.replace("£", "£")
seems to make my HTML documents much more acceptable as UTF-8 HTML. I'm not sure what that literal Python code is doing in terms of each of the 3 cases you mention about possible£
encodings in the source to substitute, but it seems to work in practice....You're welcome!
@JonB said in QTextDocument::toHtml() "encoding" parameter:
One thing I do not totally get: you say the the encoding for
£
in UTF-8 is0xC2A3
. Now, call me gullible, but I thought the point of UTF-8 was that all the characters it supports are represented in, well, 8 bits!I can see how the name "UTF-8" gives that impression.
8 bits can only encode 256 unique "characters" though, which is woefully inadequate for Unicode's goal of covering all common languages today. Unicode can encode over 1 000 000 unique "characters": https://en.wikipedia.org/wiki/Code_point
What I'm now seeing is: UTF-16 always takes 16 bits
Yes.[EDIT: Oops, actually UTF-16 is variable-width! ]UTF-8 seems to try to fit in 8 bits, but can go to 16 bits if it wants to.
No. The only 8-bit "characters" in UTF-8 are the ASCII characters. (128 in total, but not all of them are real text "characters". Some are control codes.) This design allows a UTF-8 decoder to read ASCII input.
Other "characters" can take up to 32 bits in UTF-8.
How does decoder know? Answer must be that the leading
0xC2
byte tells it this is a 16-bit sequence? I certainly did not know that.Yep, you got it.
The leading byte tells the decoder how many bytes this character takes. There are a few other rules too; see the 1st table at https://en.wikipedia.org/wiki/UTF-8#Description if you're interested.
So far, I have put the following code in and seem to not be experiencing any problems:
...
I'm not sure what that literal Python code is doing in terms of each of the 3 cases you mention about possible£
encodings in the source to substitute, but it seems to work in practice....To test it, get a copy of inputs that you know caused issues before and see if the new code handles it. Try it with both UTF-8 and ISO-8859-1 inputs.
Your code ensures that your app outputs UTF-8 compatible data. It doesn't do anything to inputs that come into your app, however.
Can't use auto-detection form chardet as I'm Python(3)-only. Have to do any work myself.
Having said that, the more I think about it the more I believe my only problem is
£
sterling. Users speak English (i.e. they are not American), so it's not like I have to deal with Cyrillic or Chinese. The only currency used is UK.For that level of "practicality", one possible approach for checking inputs is to search for
0xC2A3
in the raw input byte stream (remember to check before any decoding occurs). If it's found, treat the input as UTF-8. If not, treat it as ISO-8859-1.P.S. Are you implying that "English-speaking" and "American" are mutually-exclusive? ;-)
I have found that the magic of (Python 3)
html.replace("£", "£")
seems to make my HTML documents much more acceptable as UTF-8 HTML.What you've actually done is produce ASCII outputs.
&
,p
,o
,u
,n
,d
,;
are all ASCII characters. This allows lots of decoders (both UTF-8 and non-UTF-8) to read it.I think this has reminded me: I may have gotten away with "funny"/"unspecified" encodings when putting HTML into file/browser, but sending in SMTP email is more rigorous about not guessing/complaining. Like I said, my input sources are various, and so are my output destinations!
Outputs are easier to deal with. Just spit out UTF-8 and most self-respecting software should be happy to accept it. You can deal with "It doesn't work!" complaints on a case-by-case basis, whexpects.I ruled a kingdom, and my knights embarked on noble quests
-
-
You're welcome!
@JonB said in QTextDocument::toHtml() "encoding" parameter:
One thing I do not totally get: you say the the encoding for
£
in UTF-8 is0xC2A3
. Now, call me gullible, but I thought the point of UTF-8 was that all the characters it supports are represented in, well, 8 bits!I can see how the name "UTF-8" gives that impression.
8 bits can only encode 256 unique "characters" though, which is woefully inadequate for Unicode's goal of covering all common languages today. Unicode can encode over 1 000 000 unique "characters": https://en.wikipedia.org/wiki/Code_point
What I'm now seeing is: UTF-16 always takes 16 bits
Yes.[EDIT: Oops, actually UTF-16 is variable-width! ]UTF-8 seems to try to fit in 8 bits, but can go to 16 bits if it wants to.
No. The only 8-bit "characters" in UTF-8 are the ASCII characters. (128 in total, but not all of them are real text "characters". Some are control codes.) This design allows a UTF-8 decoder to read ASCII input.
Other "characters" can take up to 32 bits in UTF-8.
How does decoder know? Answer must be that the leading
0xC2
byte tells it this is a 16-bit sequence? I certainly did not know that.Yep, you got it.
The leading byte tells the decoder how many bytes this character takes. There are a few other rules too; see the 1st table at https://en.wikipedia.org/wiki/UTF-8#Description if you're interested.
So far, I have put the following code in and seem to not be experiencing any problems:
...
I'm not sure what that literal Python code is doing in terms of each of the 3 cases you mention about possible£
encodings in the source to substitute, but it seems to work in practice....To test it, get a copy of inputs that you know caused issues before and see if the new code handles it. Try it with both UTF-8 and ISO-8859-1 inputs.
Your code ensures that your app outputs UTF-8 compatible data. It doesn't do anything to inputs that come into your app, however.
Can't use auto-detection form chardet as I'm Python(3)-only. Have to do any work myself.
Having said that, the more I think about it the more I believe my only problem is
£
sterling. Users speak English (i.e. they are not American), so it's not like I have to deal with Cyrillic or Chinese. The only currency used is UK.For that level of "practicality", one possible approach for checking inputs is to search for
0xC2A3
in the raw input byte stream (remember to check before any decoding occurs). If it's found, treat the input as UTF-8. If not, treat it as ISO-8859-1.P.S. Are you implying that "English-speaking" and "American" are mutually-exclusive? ;-)
I have found that the magic of (Python 3)
html.replace("£", "£")
seems to make my HTML documents much more acceptable as UTF-8 HTML.What you've actually done is produce ASCII outputs.
&
,p
,o
,u
,n
,d
,;
are all ASCII characters. This allows lots of decoders (both UTF-8 and non-UTF-8) to read it.I think this has reminded me: I may have gotten away with "funny"/"unspecified" encodings when putting HTML into file/browser, but sending in SMTP email is more rigorous about not guessing/complaining. Like I said, my input sources are various, and so are my output destinations!
Outputs are easier to deal with. Just spit out UTF-8 and most self-respecting software should be happy to accept it. You can deal with "It doesn't work!" complaints on a case-by-case basis, whexpects.I ruled a kingdom, and my knights embarked on noble quests
@JKSH
I have now come across another related problem with encoding & decoding. Again, it's not to do with Qt itself.Before I type it all in here to ask, would you be prepared to read & answer it if I did so? I don't want to type it all in if no-one will answer, I would quite understand, but can save myself the effort. Thanks.
-
@JKSH
I have now come across another related problem with encoding & decoding. Again, it's not to do with Qt itself.Before I type it all in here to ask, would you be prepared to read & answer it if I did so? I don't want to type it all in if no-one will answer, I would quite understand, but can save myself the effort. Thanks.
@JonB said in QTextDocument::toHtml() "encoding" parameter:
Before I type it all in here to ask, would you be prepared to read & answer it if I did so?
Happy to answer :)
Still, you could try writing a TL;DR (summarized) version first. Perhaps it could lead to the answers you want without requiring an essay from you.
-
@JonB said in QTextDocument::toHtml() "encoding" parameter:
Before I type it all in here to ask, would you be prepared to read & answer it if I did so?
Happy to answer :)
Still, you could try writing a TL;DR (summarized) version first. Perhaps it could lead to the answers you want without requiring an essay from you.
@JKSH
Thanks :)TL;DR #1:
I still hate this flipping encodings, though maybe I understand a touch better.TL;DR #2:
The new question: I had assumed that if I found I could not encode (got character encoding error) with encoding_1, and had then saved back to file using encoding_2 which did work, upon reading back and decoding I would get similar error when I tried first with encoding_1, and would therefore know to decode using encoding_2. Instead, the reading accepted the character from the other encoding but displayed it as "rubbish" in its encoding.This is depressing and is making my brain ache... !
-
@JKSH
Thanks :)TL;DR #1:
I still hate this flipping encodings, though maybe I understand a touch better.TL;DR #2:
The new question: I had assumed that if I found I could not encode (got character encoding error) with encoding_1, and had then saved back to file using encoding_2 which did work, upon reading back and decoding I would get similar error when I tried first with encoding_1, and would therefore know to decode using encoding_2. Instead, the reading accepted the character from the other encoding but displayed it as "rubbish" in its encoding.This is depressing and is making my brain ache... !
@JonB said in QTextDocument::toHtml() "encoding" parameter:
I had assumed that if I found I could not encode (got character encoding error) with encoding_1, and had then saved back to file using encoding_2 which did work, upon reading back and decoding I would get similar error when I tried first with encoding_1
This assumption doesn't work. Error detection is easy when encoding but hard when decoding.
Examples:
- If you give an ASCII encoder the
£
character, it can tell you straight up, "I don't support this character!" - If you give an ISO-8859-1 decoder the bytes
0xC2A3
(which is UTF-8 for£
), it will do this:- Convert the
0xA3
byte. The decoder is happy because0xA3
is£
in ISO-8859-1. - Convert the
0xC2
byte. The decoder is happy because0xC2
isÂ
in ISO-8859-1.
- Convert the
"£Â" is perfectly valid text, so how is the decoder meant to know that the human won't like it?
In summary, an encoder knows immediately when it's given rubbish, but a decoder can't always tell.
- If you give an ASCII encoder the
-
@JonB said in QTextDocument::toHtml() "encoding" parameter:
I had assumed that if I found I could not encode (got character encoding error) with encoding_1, and had then saved back to file using encoding_2 which did work, upon reading back and decoding I would get similar error when I tried first with encoding_1
This assumption doesn't work. Error detection is easy when encoding but hard when decoding.
Examples:
- If you give an ASCII encoder the
£
character, it can tell you straight up, "I don't support this character!" - If you give an ISO-8859-1 decoder the bytes
0xC2A3
(which is UTF-8 for£
), it will do this:- Convert the
0xA3
byte. The decoder is happy because0xA3
is£
in ISO-8859-1. - Convert the
0xC2
byte. The decoder is happy because0xC2
isÂ
in ISO-8859-1.
- Convert the
"£Â" is perfectly valid text, so how is the decoder meant to know that the human won't like it?
In summary, an encoder knows immediately when it's given rubbish, but a decoder can't always tell.
@JKSH
Thanks for this. Unfortunately, it's the way I discovered it works (not surprisingly), but it's not what I want it to do! :(- By default, Python uses the "user's preferred encoding" when opening files.
- Under Linux that's utf-8, but under Windows it's that damn cp1252.
- A Windows user pastes in some text from elsewhere that happens to contain
\u200b
, which is a "non-breaking space" character, apparently. - My code tries to encode during write with default cp1252, this fails on that character.
- I fall back to encoding with utf-8, that works, I can save, great.
- Later I come to read that file back in.
- Instead of it failing decoding with default cp1252, so I'd know to try utf-8, it succeeds.
- But I don't get the utf-8 non-breaking space character, I get a couple of rubbish characters instead. Which don't look good.
- But I have no way of knowing I should have decoded the file with utf-8....
Yuck!
- If you give an ASCII encoder the
-
@JonB said in QTextDocument::toHtml() "encoding" parameter:
- By default, Python uses the "user's preferred encoding" when opening files.
- Under Linux that's utf-8, but under Windows it's that damn cp1252.
I recommend always saving (and hence reading) in UTF-8, no matter what platform you're on.
- A Windows user pastes in some text from elsewhere that happens to contain
\u200b
, which is a "non-breaking space" character, apparently.
This is a different problem from the issue of juggling encodings. 99.9% of the time, people don't actually want
\u200b
in their documents: https://stackoverflow.com/questions/7055600/u200b-zero-width-space-characters-in-my-js-code-where-did-they-come-from -
@JonB said in QTextDocument::toHtml() "encoding" parameter:
- By default, Python uses the "user's preferred encoding" when opening files.
- Under Linux that's utf-8, but under Windows it's that damn cp1252.
I recommend always saving (and hence reading) in UTF-8, no matter what platform you're on.
- A Windows user pastes in some text from elsewhere that happens to contain
\u200b
, which is a "non-breaking space" character, apparently.
This is a different problem from the issue of juggling encodings. 99.9% of the time, people don't actually want
\u200b
in their documents: https://stackoverflow.com/questions/7055600/u200b-zero-width-space-characters-in-my-js-code-where-did-they-come-from@JKSH
For the pasting, users can paste whatever they like from wherever they like and I have to record this verbatim, for legal reasons. I take your point about that particular character, but once I start stripping things out I wouldn't know where to stop. Although you say it's "different from juggling encodings", the issue is that character encodes OK toutf-8
but causes fatal error tocp1252
when I try to save, which is the default Python encoding under Windows.I recommend always saving (and hence reading) in UTF-8, no matter what platform you're on.
Now that is really interesting! Clearly you can see that I'm in a mess, and am looking for some way out. A solution whereby I always knew what encoding to use unconditionally would be a huge boon. I could track down all the Python file "open"s and change them all over to UTF-8, and hopefully then be a happy bunny in all circumstances. Furthermore that would ensure interoperability with Linux (where default is already UTF-8), which would also be nice.
I need to press you a bit more on this solution, if I may, and you'd be kind enough to stick with me. Do you know any of the following:
-
We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8? Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?
-
If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?
-
What goes on with encoding declarations in HTML? Some, but not all, of these files are HTML. I know I could start saving with
<head> <meta charset="utf-8"/>
. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the fileopen()
method. I don't think you can change your mind about the encoding once you have opened a file. So how does this work when reading the HTML file --- which encoding should I pass toopen('r')
given that it might encounter acharset
specification after a while when reading the content??
Thank you so much for your kind time on this!
-
@JKSH
For the pasting, users can paste whatever they like from wherever they like and I have to record this verbatim, for legal reasons. I take your point about that particular character, but once I start stripping things out I wouldn't know where to stop. Although you say it's "different from juggling encodings", the issue is that character encodes OK toutf-8
but causes fatal error tocp1252
when I try to save, which is the default Python encoding under Windows.I recommend always saving (and hence reading) in UTF-8, no matter what platform you're on.
Now that is really interesting! Clearly you can see that I'm in a mess, and am looking for some way out. A solution whereby I always knew what encoding to use unconditionally would be a huge boon. I could track down all the Python file "open"s and change them all over to UTF-8, and hopefully then be a happy bunny in all circumstances. Furthermore that would ensure interoperability with Linux (where default is already UTF-8), which would also be nice.
I need to press you a bit more on this solution, if I may, and you'd be kind enough to stick with me. Do you know any of the following:
-
We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8? Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?
-
If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?
-
What goes on with encoding declarations in HTML? Some, but not all, of these files are HTML. I know I could start saving with
<head> <meta charset="utf-8"/>
. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the fileopen()
method. I don't think you can change your mind about the encoding once you have opened a file. So how does this work when reading the HTML file --- which encoding should I pass toopen('r')
given that it might encounter acharset
specification after a while when reading the content??
Thank you so much for your kind time on this!
@JonB said in QTextDocument::toHtml() "encoding" parameter:
If you allow me ...
- We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8? Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?
cp1252 is very similar to ISO-8859-1, a.k.a. Latin1, but not exactly the same, 'cause Microsoft. In any case there are a few codepoints that are different between Latin1 and cp1252 that are going to give you trouble if you directly try to decode a cp1252 text through utf8. These include the euro sign, and slanted apostrophies and quotation marks, the permille sign among a few others.
Note: I talk about differences between Latin1 and cp1252 only because Latin1 is a subset of utf8, thus you can decode Latin1 text directly as if it were encoded in utf8.
- If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?
I would imagine it'd either use the local 8-bit encoding, which can be cp1252 or Latin1, or it can save it as utf8. There should be a way to specify that when saving the actual file.
- What goes on with encoding declarations in HTML? Some, but not all, of these files are HTML. I know I could start saving with
<head> <meta charset="utf-8"/>
. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the fileopen()
method. I don't think you can change your mind about the encoding once you have opened a file. So how does this work when reading the HTML file --- which encoding should I pass toopen('r')
given that it might encounter acharset
specification after a while when reading the content??
HTML is quite similar to XML in that regard. In XML you have the preamble (which contains the encoding of the document) that is supposed to be always encoded in latin 8bit, so whatever is used in the rest of the document can be read by the parser. For HTML this is the meta-tag, so the parser is supposed to switch to the indicated encoding whenever it reaches the charset meta tag.
-
-
@JonB said in QTextDocument::toHtml() "encoding" parameter:
If you allow me ...
- We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8? Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?
cp1252 is very similar to ISO-8859-1, a.k.a. Latin1, but not exactly the same, 'cause Microsoft. In any case there are a few codepoints that are different between Latin1 and cp1252 that are going to give you trouble if you directly try to decode a cp1252 text through utf8. These include the euro sign, and slanted apostrophies and quotation marks, the permille sign among a few others.
Note: I talk about differences between Latin1 and cp1252 only because Latin1 is a subset of utf8, thus you can decode Latin1 text directly as if it were encoded in utf8.
- If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?
I would imagine it'd either use the local 8-bit encoding, which can be cp1252 or Latin1, or it can save it as utf8. There should be a way to specify that when saving the actual file.
- What goes on with encoding declarations in HTML? Some, but not all, of these files are HTML. I know I could start saving with
<head> <meta charset="utf-8"/>
. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the fileopen()
method. I don't think you can change your mind about the encoding once you have opened a file. So how does this work when reading the HTML file --- which encoding should I pass toopen('r')
given that it might encounter acharset
specification after a while when reading the content??
HTML is quite similar to XML in that regard. In XML you have the preamble (which contains the encoding of the document) that is supposed to be always encoded in latin 8bit, so whatever is used in the rest of the document can be read by the parser. For HTML this is the meta-tag, so the parser is supposed to switch to the indicated encoding whenever it reaches the charset meta tag.
@kshegunov said in QTextDocument::toHtml() "encoding" parameter:
There should be a way to specify that when saving the actual file.
My end-users are quite beyond my control. There is no chance of getting them to specify some encoding to save as if they use Notepad, they will use whatever the default is, period.
so whatever is used in the rest of the document can be read by the parser. For HTML this is the meta-tag, so the parser is supposed to switch to the indicated encoding whenever it reaches the charset meta tag.
I feel a bit like Alice, disappearing down a rabbit hole, "Curioser and curiouser"....
So you're saying you expect to change the decoding while you're in the middle of reading a text file?! I don't even know how to do that from my Python: when I open a file for text-read I specify an encoding, which it uses as it reads lines. I don't think I can change that halfway along.... E.g. from the Python docs for
open()
:As mentioned in the Overview, Python distinguishes between binary and text I/O. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.
OK, I get further. It turns out the text-open returns an object (
class io.TextIOWrapper
) which does have areconfigure
method allowing encoding to be respecified. However, I am not surprised to read:It is not possible to change the encoding or newline if some data has already been read from the stream.
which is about what I would expect. What is going on here? This is getting crazy!
I know that's not your problem, so just in outline how would you expect to achieve reading such an HTML from, say, Qt/
QFile
? -
@kshegunov said in QTextDocument::toHtml() "encoding" parameter:
There should be a way to specify that when saving the actual file.
My end-users are quite beyond my control. There is no chance of getting them to specify some encoding to save as if they use Notepad, they will use whatever the default is, period.
so whatever is used in the rest of the document can be read by the parser. For HTML this is the meta-tag, so the parser is supposed to switch to the indicated encoding whenever it reaches the charset meta tag.
I feel a bit like Alice, disappearing down a rabbit hole, "Curioser and curiouser"....
So you're saying you expect to change the decoding while you're in the middle of reading a text file?! I don't even know how to do that from my Python: when I open a file for text-read I specify an encoding, which it uses as it reads lines. I don't think I can change that halfway along.... E.g. from the Python docs for
open()
:As mentioned in the Overview, Python distinguishes between binary and text I/O. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.
OK, I get further. It turns out the text-open returns an object (
class io.TextIOWrapper
) which does have areconfigure
method allowing encoding to be respecified. However, I am not surprised to read:It is not possible to change the encoding or newline if some data has already been read from the stream.
which is about what I would expect. What is going on here? This is getting crazy!
I know that's not your problem, so just in outline how would you expect to achieve reading such an HTML from, say, Qt/
QFile
?@JonB said in QTextDocument::toHtml() "encoding" parameter:
So you're saying you expect to change the decoding while you're in the middle of reading a text file?!
Of course. Text display goes like this: bytes -> encoding -> font for locale -> display of glyphs
So the encoding, as suggested by its name, is the way a character (or rather a code point) is encoded into byte(s).
For example for XML the text declaration can specify an encoding differing from the default utf8.I don't even know how to do that from my Python
Sorry, I'm completely clueless here.
I know that's not your problem, so just in outline how would you expect to achieve reading such an HTML from, say, Qt/QFile?
You open the file, as usual. Then you start reading the data in unencoded form (i.e. in
QByteArray
); then parse it as if it were containing utf8 (the default for HTML5) and whenever you parse the meta tag and get the requested encoding, you attach a QTextCodec and start converting the raw bytes to unicode (i.e. theQString
's internal utf16). Thereafter it's easy as you are working withQString
s. -
@JonB said in QTextDocument::toHtml() "encoding" parameter:
So you're saying you expect to change the decoding while you're in the middle of reading a text file?!
Of course. Text display goes like this: bytes -> encoding -> font for locale -> display of glyphs
So the encoding, as suggested by its name, is the way a character (or rather a code point) is encoded into byte(s).
For example for XML the text declaration can specify an encoding differing from the default utf8.I don't even know how to do that from my Python
Sorry, I'm completely clueless here.
I know that's not your problem, so just in outline how would you expect to achieve reading such an HTML from, say, Qt/QFile?
You open the file, as usual. Then you start reading the data in unencoded form (i.e. in
QByteArray
); then parse it as if it were containing utf8 (the default for HTML5) and whenever you parse the meta tag and get the requested encoding, you attach a QTextCodec and start converting the raw bytes to unicode (i.e. theQString
's internal utf16). Thereafter it's easy as you are working withQString
s.@kshegunov
Yes, given your implementation I get it.Trouble is, for Python (remember I'm a noob there too) it doesn't seem you handle files like that. We specify the desired encoding as a parameter to file
open()
(read or write, on a text file), and then when characters are read/written the decode/encode is auto-performed.[Just BTW: I do understand I could (presumably) do the whole thing from Python via PyQt using Qt's
QFile
and maybeQTextCodec
etc. But while the app heavily uses Qt it is still a Python program and there are good reasons why it uses Python file handling for all purposes. I do not have the luxury/choice of chucking that away in favour of Qt.]P.S.
So among all Qt's various useful classes, there isn't one which will open an HTML/XML file, do whatever work to parse the correct encoding if present in the header, and then return you a finalQString
of the content having been appropriately decoded as best it can? I could use that! -
@kshegunov
Yes, given your implementation I get it.Trouble is, for Python (remember I'm a noob there too) it doesn't seem you handle files like that. We specify the desired encoding as a parameter to file
open()
(read or write, on a text file), and then when characters are read/written the decode/encode is auto-performed.[Just BTW: I do understand I could (presumably) do the whole thing from Python via PyQt using Qt's
QFile
and maybeQTextCodec
etc. But while the app heavily uses Qt it is still a Python program and there are good reasons why it uses Python file handling for all purposes. I do not have the luxury/choice of chucking that away in favour of Qt.]P.S.
So among all Qt's various useful classes, there isn't one which will open an HTML/XML file, do whatever work to parse the correct encoding if present in the header, and then return you a finalQString
of the content having been appropriately decoded as best it can? I could use that!@JonB said in QTextDocument::toHtml() "encoding" parameter:
So among all Qt's various useful classes, there isn't one which will open an HTML/XML file, do whatever work to parse the correct encoding if present in the header, and then return you a final QString of the content having been appropriately decoded as best it can? I could use that!
Check QXmlStreamReader and/or QDomDocument and see if they do you any good.
-
@JonB said in QTextDocument::toHtml() "encoding" parameter:
So among all Qt's various useful classes, there isn't one which will open an HTML/XML file, do whatever work to parse the correct encoding if present in the header, and then return you a final QString of the content having been appropriately decoded as best it can? I could use that!
Check QXmlStreamReader and/or QDomDocument and see if they do you any good.
@kshegunov
Thanks, but I think they're both going to want to find (well formed) XML and parse it. My input will be HTML (and not XHTML btw), plus all I want is the resulting content as a singleQString
for Python, so I don't think they'll help.Maybe I need to go see somewhere if there is a Python skeleton for doing this decoding correctly. It seems like this code is needed any time you want to open an HTML document which might specify an encoding for reading the content, which ought to be a pretty standard thing that will be wanted, I'm surprised it's so tricky?
-
@kshegunov
Thanks, but I think they're both going to want to find (well formed) XML and parse it. My input will be HTML (and not XHTML btw), plus all I want is the resulting content as a singleQString
for Python, so I don't think they'll help.Maybe I need to go see somewhere if there is a Python skeleton for doing this decoding correctly. It seems like this code is needed any time you want to open an HTML document which might specify an encoding for reading the content, which ought to be a pretty standard thing that will be wanted, I'm surprised it's so tricky?
The
QtWebKit
module might be an option, however I've never used it ... People perfected parsing bad HTML over the last 20 years ... ;)I'm surprised it's so tricky?
I guess you were mostly shielded from this whole process, judging by your default 8bit encoding ... :)
For me it used to be cp1251, and of course it was incompatible with KOI8-R which was what linuxes mostly stuck to. And of course cp1251 is compatible with cp1252, but then the latter was slightly different from ASCII. I have 15 year old IRC logs that are completely inaccessible to me as I don't currently have anything that can read windows-1251 ...
-
@JKSH
For the pasting, users can paste whatever they like from wherever they like and I have to record this verbatim, for legal reasons. I take your point about that particular character, but once I start stripping things out I wouldn't know where to stop. Although you say it's "different from juggling encodings", the issue is that character encodes OK toutf-8
but causes fatal error tocp1252
when I try to save, which is the default Python encoding under Windows.I recommend always saving (and hence reading) in UTF-8, no matter what platform you're on.
Now that is really interesting! Clearly you can see that I'm in a mess, and am looking for some way out. A solution whereby I always knew what encoding to use unconditionally would be a huge boon. I could track down all the Python file "open"s and change them all over to UTF-8, and hopefully then be a happy bunny in all circumstances. Furthermore that would ensure interoperability with Linux (where default is already UTF-8), which would also be nice.
I need to press you a bit more on this solution, if I may, and you'd be kind enough to stick with me. Do you know any of the following:
-
We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8? Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?
-
If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?
-
What goes on with encoding declarations in HTML? Some, but not all, of these files are HTML. I know I could start saving with
<head> <meta charset="utf-8"/>
. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the fileopen()
method. I don't think you can change your mind about the encoding once you have opened a file. So how does this work when reading the HTML file --- which encoding should I pass toopen('r')
given that it might encounter acharset
specification after a while when reading the content??
Thank you so much for your kind time on this!
@JonB said in QTextDocument::toHtml() "encoding" parameter:
- We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8?
Nope! If a character can be encoded in CP1252, then it can also be encoded in UTF-8.
This is one reason why folks are pushing for UTF-8 to be the default, the One Encoding to Rule Them All.
Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?
Yes. Example:
£
- If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?
If the file already contains UTF-8 specific byte sequences, then Notepad will still re-save it as UTF-8.
If the file does not contain any UTF-8 specific byte sequences, then Notepad doesn't think it's UTF-8 so it won't re-save as UTF-8.
- ...I know I could start saving with
<head> <meta charset="utf-8"/>
. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the fileopen()
method. I don't think you can change your mind about the encoding once you have opened a file.
- Open the HTML file using the UTF-8 decoder.
- Check the
charset
field. - If the charset is UTF-8, GOTO Happy Ending.
- If the charset is not UTF-8, close the file and re-open it using the decoder for the declared charset.
This is the underlying assumption: No matter what encoding the file is in, the charset metadata is legible to a UTF-8 decoder. <subliminal_message>Isn't UTF-8 wonderful?</subliminal_message>
-