QTextDocument::toHtml() "encoding" parameter
-
You're welcome!
@JonB said in QTextDocument::toHtml() "encoding" parameter:
One thing I do not totally get: you say the the encoding for
£
in UTF-8 is0xC2A3
. Now, call me gullible, but I thought the point of UTF-8 was that all the characters it supports are represented in, well, 8 bits!I can see how the name "UTF-8" gives that impression.
8 bits can only encode 256 unique "characters" though, which is woefully inadequate for Unicode's goal of covering all common languages today. Unicode can encode over 1 000 000 unique "characters": https://en.wikipedia.org/wiki/Code_point
What I'm now seeing is: UTF-16 always takes 16 bits
Yes.[EDIT: Oops, actually UTF-16 is variable-width! ]UTF-8 seems to try to fit in 8 bits, but can go to 16 bits if it wants to.
No. The only 8-bit "characters" in UTF-8 are the ASCII characters. (128 in total, but not all of them are real text "characters". Some are control codes.) This design allows a UTF-8 decoder to read ASCII input.
Other "characters" can take up to 32 bits in UTF-8.
How does decoder know? Answer must be that the leading
0xC2
byte tells it this is a 16-bit sequence? I certainly did not know that.Yep, you got it.
The leading byte tells the decoder how many bytes this character takes. There are a few other rules too; see the 1st table at https://en.wikipedia.org/wiki/UTF-8#Description if you're interested.
So far, I have put the following code in and seem to not be experiencing any problems:
...
I'm not sure what that literal Python code is doing in terms of each of the 3 cases you mention about possible£
encodings in the source to substitute, but it seems to work in practice....To test it, get a copy of inputs that you know caused issues before and see if the new code handles it. Try it with both UTF-8 and ISO-8859-1 inputs.
Your code ensures that your app outputs UTF-8 compatible data. It doesn't do anything to inputs that come into your app, however.
Can't use auto-detection form chardet as I'm Python(3)-only. Have to do any work myself.
Having said that, the more I think about it the more I believe my only problem is
£
sterling. Users speak English (i.e. they are not American), so it's not like I have to deal with Cyrillic or Chinese. The only currency used is UK.For that level of "practicality", one possible approach for checking inputs is to search for
0xC2A3
in the raw input byte stream (remember to check before any decoding occurs). If it's found, treat the input as UTF-8. If not, treat it as ISO-8859-1.P.S. Are you implying that "English-speaking" and "American" are mutually-exclusive? ;-)
I have found that the magic of (Python 3)
html.replace("£", "£")
seems to make my HTML documents much more acceptable as UTF-8 HTML.What you've actually done is produce ASCII outputs.
&
,p
,o
,u
,n
,d
,;
are all ASCII characters. This allows lots of decoders (both UTF-8 and non-UTF-8) to read it.I think this has reminded me: I may have gotten away with "funny"/"unspecified" encodings when putting HTML into file/browser, but sending in SMTP email is more rigorous about not guessing/complaining. Like I said, my input sources are various, and so are my output destinations!
Outputs are easier to deal with. Just spit out UTF-8 and most self-respecting software should be happy to accept it. You can deal with "It doesn't work!" complaints on a case-by-case basis, whexpects.I ruled a kingdom, and my knights embarked on noble quests
@JKSH
I have now come across another related problem with encoding & decoding. Again, it's not to do with Qt itself.Before I type it all in here to ask, would you be prepared to read & answer it if I did so? I don't want to type it all in if no-one will answer, I would quite understand, but can save myself the effort. Thanks.
-
@JKSH
I have now come across another related problem with encoding & decoding. Again, it's not to do with Qt itself.Before I type it all in here to ask, would you be prepared to read & answer it if I did so? I don't want to type it all in if no-one will answer, I would quite understand, but can save myself the effort. Thanks.
@JonB said in QTextDocument::toHtml() "encoding" parameter:
Before I type it all in here to ask, would you be prepared to read & answer it if I did so?
Happy to answer :)
Still, you could try writing a TL;DR (summarized) version first. Perhaps it could lead to the answers you want without requiring an essay from you.
-
@JonB said in QTextDocument::toHtml() "encoding" parameter:
Before I type it all in here to ask, would you be prepared to read & answer it if I did so?
Happy to answer :)
Still, you could try writing a TL;DR (summarized) version first. Perhaps it could lead to the answers you want without requiring an essay from you.
@JKSH
Thanks :)TL;DR #1:
I still hate this flipping encodings, though maybe I understand a touch better.TL;DR #2:
The new question: I had assumed that if I found I could not encode (got character encoding error) with encoding_1, and had then saved back to file using encoding_2 which did work, upon reading back and decoding I would get similar error when I tried first with encoding_1, and would therefore know to decode using encoding_2. Instead, the reading accepted the character from the other encoding but displayed it as "rubbish" in its encoding.This is depressing and is making my brain ache... !
-
@JKSH
Thanks :)TL;DR #1:
I still hate this flipping encodings, though maybe I understand a touch better.TL;DR #2:
The new question: I had assumed that if I found I could not encode (got character encoding error) with encoding_1, and had then saved back to file using encoding_2 which did work, upon reading back and decoding I would get similar error when I tried first with encoding_1, and would therefore know to decode using encoding_2. Instead, the reading accepted the character from the other encoding but displayed it as "rubbish" in its encoding.This is depressing and is making my brain ache... !
@JonB said in QTextDocument::toHtml() "encoding" parameter:
I had assumed that if I found I could not encode (got character encoding error) with encoding_1, and had then saved back to file using encoding_2 which did work, upon reading back and decoding I would get similar error when I tried first with encoding_1
This assumption doesn't work. Error detection is easy when encoding but hard when decoding.
Examples:
- If you give an ASCII encoder the
£
character, it can tell you straight up, "I don't support this character!" - If you give an ISO-8859-1 decoder the bytes
0xC2A3
(which is UTF-8 for£
), it will do this:- Convert the
0xA3
byte. The decoder is happy because0xA3
is£
in ISO-8859-1. - Convert the
0xC2
byte. The decoder is happy because0xC2
isÂ
in ISO-8859-1.
- Convert the
"£Â" is perfectly valid text, so how is the decoder meant to know that the human won't like it?
In summary, an encoder knows immediately when it's given rubbish, but a decoder can't always tell.
- If you give an ASCII encoder the
-
@JonB said in QTextDocument::toHtml() "encoding" parameter:
I had assumed that if I found I could not encode (got character encoding error) with encoding_1, and had then saved back to file using encoding_2 which did work, upon reading back and decoding I would get similar error when I tried first with encoding_1
This assumption doesn't work. Error detection is easy when encoding but hard when decoding.
Examples:
- If you give an ASCII encoder the
£
character, it can tell you straight up, "I don't support this character!" - If you give an ISO-8859-1 decoder the bytes
0xC2A3
(which is UTF-8 for£
), it will do this:- Convert the
0xA3
byte. The decoder is happy because0xA3
is£
in ISO-8859-1. - Convert the
0xC2
byte. The decoder is happy because0xC2
isÂ
in ISO-8859-1.
- Convert the
"£Â" is perfectly valid text, so how is the decoder meant to know that the human won't like it?
In summary, an encoder knows immediately when it's given rubbish, but a decoder can't always tell.
@JKSH
Thanks for this. Unfortunately, it's the way I discovered it works (not surprisingly), but it's not what I want it to do! :(- By default, Python uses the "user's preferred encoding" when opening files.
- Under Linux that's utf-8, but under Windows it's that damn cp1252.
- A Windows user pastes in some text from elsewhere that happens to contain
\u200b
, which is a "non-breaking space" character, apparently. - My code tries to encode during write with default cp1252, this fails on that character.
- I fall back to encoding with utf-8, that works, I can save, great.
- Later I come to read that file back in.
- Instead of it failing decoding with default cp1252, so I'd know to try utf-8, it succeeds.
- But I don't get the utf-8 non-breaking space character, I get a couple of rubbish characters instead. Which don't look good.
- But I have no way of knowing I should have decoded the file with utf-8....
Yuck!
- If you give an ASCII encoder the
-
@JonB said in QTextDocument::toHtml() "encoding" parameter:
- By default, Python uses the "user's preferred encoding" when opening files.
- Under Linux that's utf-8, but under Windows it's that damn cp1252.
I recommend always saving (and hence reading) in UTF-8, no matter what platform you're on.
- A Windows user pastes in some text from elsewhere that happens to contain
\u200b
, which is a "non-breaking space" character, apparently.
This is a different problem from the issue of juggling encodings. 99.9% of the time, people don't actually want
\u200b
in their documents: https://stackoverflow.com/questions/7055600/u200b-zero-width-space-characters-in-my-js-code-where-did-they-come-from -
@JonB said in QTextDocument::toHtml() "encoding" parameter:
- By default, Python uses the "user's preferred encoding" when opening files.
- Under Linux that's utf-8, but under Windows it's that damn cp1252.
I recommend always saving (and hence reading) in UTF-8, no matter what platform you're on.
- A Windows user pastes in some text from elsewhere that happens to contain
\u200b
, which is a "non-breaking space" character, apparently.
This is a different problem from the issue of juggling encodings. 99.9% of the time, people don't actually want
\u200b
in their documents: https://stackoverflow.com/questions/7055600/u200b-zero-width-space-characters-in-my-js-code-where-did-they-come-from@JKSH
For the pasting, users can paste whatever they like from wherever they like and I have to record this verbatim, for legal reasons. I take your point about that particular character, but once I start stripping things out I wouldn't know where to stop. Although you say it's "different from juggling encodings", the issue is that character encodes OK toutf-8
but causes fatal error tocp1252
when I try to save, which is the default Python encoding under Windows.I recommend always saving (and hence reading) in UTF-8, no matter what platform you're on.
Now that is really interesting! Clearly you can see that I'm in a mess, and am looking for some way out. A solution whereby I always knew what encoding to use unconditionally would be a huge boon. I could track down all the Python file "open"s and change them all over to UTF-8, and hopefully then be a happy bunny in all circumstances. Furthermore that would ensure interoperability with Linux (where default is already UTF-8), which would also be nice.
I need to press you a bit more on this solution, if I may, and you'd be kind enough to stick with me. Do you know any of the following:
-
We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8? Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?
-
If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?
-
What goes on with encoding declarations in HTML? Some, but not all, of these files are HTML. I know I could start saving with
<head> <meta charset="utf-8"/>
. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the fileopen()
method. I don't think you can change your mind about the encoding once you have opened a file. So how does this work when reading the HTML file --- which encoding should I pass toopen('r')
given that it might encounter acharset
specification after a while when reading the content??
Thank you so much for your kind time on this!
-
@JKSH
For the pasting, users can paste whatever they like from wherever they like and I have to record this verbatim, for legal reasons. I take your point about that particular character, but once I start stripping things out I wouldn't know where to stop. Although you say it's "different from juggling encodings", the issue is that character encodes OK toutf-8
but causes fatal error tocp1252
when I try to save, which is the default Python encoding under Windows.I recommend always saving (and hence reading) in UTF-8, no matter what platform you're on.
Now that is really interesting! Clearly you can see that I'm in a mess, and am looking for some way out. A solution whereby I always knew what encoding to use unconditionally would be a huge boon. I could track down all the Python file "open"s and change them all over to UTF-8, and hopefully then be a happy bunny in all circumstances. Furthermore that would ensure interoperability with Linux (where default is already UTF-8), which would also be nice.
I need to press you a bit more on this solution, if I may, and you'd be kind enough to stick with me. Do you know any of the following:
-
We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8? Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?
-
If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?
-
What goes on with encoding declarations in HTML? Some, but not all, of these files are HTML. I know I could start saving with
<head> <meta charset="utf-8"/>
. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the fileopen()
method. I don't think you can change your mind about the encoding once you have opened a file. So how does this work when reading the HTML file --- which encoding should I pass toopen('r')
given that it might encounter acharset
specification after a while when reading the content??
Thank you so much for your kind time on this!
@JonB said in QTextDocument::toHtml() "encoding" parameter:
If you allow me ...
- We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8? Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?
cp1252 is very similar to ISO-8859-1, a.k.a. Latin1, but not exactly the same, 'cause Microsoft. In any case there are a few codepoints that are different between Latin1 and cp1252 that are going to give you trouble if you directly try to decode a cp1252 text through utf8. These include the euro sign, and slanted apostrophies and quotation marks, the permille sign among a few others.
Note: I talk about differences between Latin1 and cp1252 only because Latin1 is a subset of utf8, thus you can decode Latin1 text directly as if it were encoded in utf8.
- If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?
I would imagine it'd either use the local 8-bit encoding, which can be cp1252 or Latin1, or it can save it as utf8. There should be a way to specify that when saving the actual file.
- What goes on with encoding declarations in HTML? Some, but not all, of these files are HTML. I know I could start saving with
<head> <meta charset="utf-8"/>
. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the fileopen()
method. I don't think you can change your mind about the encoding once you have opened a file. So how does this work when reading the HTML file --- which encoding should I pass toopen('r')
given that it might encounter acharset
specification after a while when reading the content??
HTML is quite similar to XML in that regard. In XML you have the preamble (which contains the encoding of the document) that is supposed to be always encoded in latin 8bit, so whatever is used in the rest of the document can be read by the parser. For HTML this is the meta-tag, so the parser is supposed to switch to the indicated encoding whenever it reaches the charset meta tag.
-
-
@JonB said in QTextDocument::toHtml() "encoding" parameter:
If you allow me ...
- We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8? Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?
cp1252 is very similar to ISO-8859-1, a.k.a. Latin1, but not exactly the same, 'cause Microsoft. In any case there are a few codepoints that are different between Latin1 and cp1252 that are going to give you trouble if you directly try to decode a cp1252 text through utf8. These include the euro sign, and slanted apostrophies and quotation marks, the permille sign among a few others.
Note: I talk about differences between Latin1 and cp1252 only because Latin1 is a subset of utf8, thus you can decode Latin1 text directly as if it were encoded in utf8.
- If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?
I would imagine it'd either use the local 8-bit encoding, which can be cp1252 or Latin1, or it can save it as utf8. There should be a way to specify that when saving the actual file.
- What goes on with encoding declarations in HTML? Some, but not all, of these files are HTML. I know I could start saving with
<head> <meta charset="utf-8"/>
. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the fileopen()
method. I don't think you can change your mind about the encoding once you have opened a file. So how does this work when reading the HTML file --- which encoding should I pass toopen('r')
given that it might encounter acharset
specification after a while when reading the content??
HTML is quite similar to XML in that regard. In XML you have the preamble (which contains the encoding of the document) that is supposed to be always encoded in latin 8bit, so whatever is used in the rest of the document can be read by the parser. For HTML this is the meta-tag, so the parser is supposed to switch to the indicated encoding whenever it reaches the charset meta tag.
@kshegunov said in QTextDocument::toHtml() "encoding" parameter:
There should be a way to specify that when saving the actual file.
My end-users are quite beyond my control. There is no chance of getting them to specify some encoding to save as if they use Notepad, they will use whatever the default is, period.
so whatever is used in the rest of the document can be read by the parser. For HTML this is the meta-tag, so the parser is supposed to switch to the indicated encoding whenever it reaches the charset meta tag.
I feel a bit like Alice, disappearing down a rabbit hole, "Curioser and curiouser"....
So you're saying you expect to change the decoding while you're in the middle of reading a text file?! I don't even know how to do that from my Python: when I open a file for text-read I specify an encoding, which it uses as it reads lines. I don't think I can change that halfway along.... E.g. from the Python docs for
open()
:As mentioned in the Overview, Python distinguishes between binary and text I/O. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.
OK, I get further. It turns out the text-open returns an object (
class io.TextIOWrapper
) which does have areconfigure
method allowing encoding to be respecified. However, I am not surprised to read:It is not possible to change the encoding or newline if some data has already been read from the stream.
which is about what I would expect. What is going on here? This is getting crazy!
I know that's not your problem, so just in outline how would you expect to achieve reading such an HTML from, say, Qt/
QFile
? -
@kshegunov said in QTextDocument::toHtml() "encoding" parameter:
There should be a way to specify that when saving the actual file.
My end-users are quite beyond my control. There is no chance of getting them to specify some encoding to save as if they use Notepad, they will use whatever the default is, period.
so whatever is used in the rest of the document can be read by the parser. For HTML this is the meta-tag, so the parser is supposed to switch to the indicated encoding whenever it reaches the charset meta tag.
I feel a bit like Alice, disappearing down a rabbit hole, "Curioser and curiouser"....
So you're saying you expect to change the decoding while you're in the middle of reading a text file?! I don't even know how to do that from my Python: when I open a file for text-read I specify an encoding, which it uses as it reads lines. I don't think I can change that halfway along.... E.g. from the Python docs for
open()
:As mentioned in the Overview, Python distinguishes between binary and text I/O. Files opened in binary mode (including 'b' in the mode argument) return contents as bytes objects without any decoding. In text mode (the default, or when 't' is included in the mode argument), the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given.
OK, I get further. It turns out the text-open returns an object (
class io.TextIOWrapper
) which does have areconfigure
method allowing encoding to be respecified. However, I am not surprised to read:It is not possible to change the encoding or newline if some data has already been read from the stream.
which is about what I would expect. What is going on here? This is getting crazy!
I know that's not your problem, so just in outline how would you expect to achieve reading such an HTML from, say, Qt/
QFile
?@JonB said in QTextDocument::toHtml() "encoding" parameter:
So you're saying you expect to change the decoding while you're in the middle of reading a text file?!
Of course. Text display goes like this: bytes -> encoding -> font for locale -> display of glyphs
So the encoding, as suggested by its name, is the way a character (or rather a code point) is encoded into byte(s).
For example for XML the text declaration can specify an encoding differing from the default utf8.I don't even know how to do that from my Python
Sorry, I'm completely clueless here.
I know that's not your problem, so just in outline how would you expect to achieve reading such an HTML from, say, Qt/QFile?
You open the file, as usual. Then you start reading the data in unencoded form (i.e. in
QByteArray
); then parse it as if it were containing utf8 (the default for HTML5) and whenever you parse the meta tag and get the requested encoding, you attach a QTextCodec and start converting the raw bytes to unicode (i.e. theQString
's internal utf16). Thereafter it's easy as you are working withQString
s. -
@JonB said in QTextDocument::toHtml() "encoding" parameter:
So you're saying you expect to change the decoding while you're in the middle of reading a text file?!
Of course. Text display goes like this: bytes -> encoding -> font for locale -> display of glyphs
So the encoding, as suggested by its name, is the way a character (or rather a code point) is encoded into byte(s).
For example for XML the text declaration can specify an encoding differing from the default utf8.I don't even know how to do that from my Python
Sorry, I'm completely clueless here.
I know that's not your problem, so just in outline how would you expect to achieve reading such an HTML from, say, Qt/QFile?
You open the file, as usual. Then you start reading the data in unencoded form (i.e. in
QByteArray
); then parse it as if it were containing utf8 (the default for HTML5) and whenever you parse the meta tag and get the requested encoding, you attach a QTextCodec and start converting the raw bytes to unicode (i.e. theQString
's internal utf16). Thereafter it's easy as you are working withQString
s.@kshegunov
Yes, given your implementation I get it.Trouble is, for Python (remember I'm a noob there too) it doesn't seem you handle files like that. We specify the desired encoding as a parameter to file
open()
(read or write, on a text file), and then when characters are read/written the decode/encode is auto-performed.[Just BTW: I do understand I could (presumably) do the whole thing from Python via PyQt using Qt's
QFile
and maybeQTextCodec
etc. But while the app heavily uses Qt it is still a Python program and there are good reasons why it uses Python file handling for all purposes. I do not have the luxury/choice of chucking that away in favour of Qt.]P.S.
So among all Qt's various useful classes, there isn't one which will open an HTML/XML file, do whatever work to parse the correct encoding if present in the header, and then return you a finalQString
of the content having been appropriately decoded as best it can? I could use that! -
@kshegunov
Yes, given your implementation I get it.Trouble is, for Python (remember I'm a noob there too) it doesn't seem you handle files like that. We specify the desired encoding as a parameter to file
open()
(read or write, on a text file), and then when characters are read/written the decode/encode is auto-performed.[Just BTW: I do understand I could (presumably) do the whole thing from Python via PyQt using Qt's
QFile
and maybeQTextCodec
etc. But while the app heavily uses Qt it is still a Python program and there are good reasons why it uses Python file handling for all purposes. I do not have the luxury/choice of chucking that away in favour of Qt.]P.S.
So among all Qt's various useful classes, there isn't one which will open an HTML/XML file, do whatever work to parse the correct encoding if present in the header, and then return you a finalQString
of the content having been appropriately decoded as best it can? I could use that!@JonB said in QTextDocument::toHtml() "encoding" parameter:
So among all Qt's various useful classes, there isn't one which will open an HTML/XML file, do whatever work to parse the correct encoding if present in the header, and then return you a final QString of the content having been appropriately decoded as best it can? I could use that!
Check QXmlStreamReader and/or QDomDocument and see if they do you any good.
-
@JonB said in QTextDocument::toHtml() "encoding" parameter:
So among all Qt's various useful classes, there isn't one which will open an HTML/XML file, do whatever work to parse the correct encoding if present in the header, and then return you a final QString of the content having been appropriately decoded as best it can? I could use that!
Check QXmlStreamReader and/or QDomDocument and see if they do you any good.
@kshegunov
Thanks, but I think they're both going to want to find (well formed) XML and parse it. My input will be HTML (and not XHTML btw), plus all I want is the resulting content as a singleQString
for Python, so I don't think they'll help.Maybe I need to go see somewhere if there is a Python skeleton for doing this decoding correctly. It seems like this code is needed any time you want to open an HTML document which might specify an encoding for reading the content, which ought to be a pretty standard thing that will be wanted, I'm surprised it's so tricky?
-
@kshegunov
Thanks, but I think they're both going to want to find (well formed) XML and parse it. My input will be HTML (and not XHTML btw), plus all I want is the resulting content as a singleQString
for Python, so I don't think they'll help.Maybe I need to go see somewhere if there is a Python skeleton for doing this decoding correctly. It seems like this code is needed any time you want to open an HTML document which might specify an encoding for reading the content, which ought to be a pretty standard thing that will be wanted, I'm surprised it's so tricky?
The
QtWebKit
module might be an option, however I've never used it ... People perfected parsing bad HTML over the last 20 years ... ;)I'm surprised it's so tricky?
I guess you were mostly shielded from this whole process, judging by your default 8bit encoding ... :)
For me it used to be cp1251, and of course it was incompatible with KOI8-R which was what linuxes mostly stuck to. And of course cp1251 is compatible with cp1252, but then the latter was slightly different from ASCII. I have 15 year old IRC logs that are completely inaccessible to me as I don't currently have anything that can read windows-1251 ...
-
@JKSH
For the pasting, users can paste whatever they like from wherever they like and I have to record this verbatim, for legal reasons. I take your point about that particular character, but once I start stripping things out I wouldn't know where to stop. Although you say it's "different from juggling encodings", the issue is that character encodes OK toutf-8
but causes fatal error tocp1252
when I try to save, which is the default Python encoding under Windows.I recommend always saving (and hence reading) in UTF-8, no matter what platform you're on.
Now that is really interesting! Clearly you can see that I'm in a mess, and am looking for some way out. A solution whereby I always knew what encoding to use unconditionally would be a huge boon. I could track down all the Python file "open"s and change them all over to UTF-8, and hopefully then be a happy bunny in all circumstances. Furthermore that would ensure interoperability with Linux (where default is already UTF-8), which would also be nice.
I need to press you a bit more on this solution, if I may, and you'd be kind enough to stick with me. Do you know any of the following:
-
We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8? Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?
-
If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?
-
What goes on with encoding declarations in HTML? Some, but not all, of these files are HTML. I know I could start saving with
<head> <meta charset="utf-8"/>
. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the fileopen()
method. I don't think you can change your mind about the encoding once you have opened a file. So how does this work when reading the HTML file --- which encoding should I pass toopen('r')
given that it might encounter acharset
specification after a while when reading the content??
Thank you so much for your kind time on this!
@JonB said in QTextDocument::toHtml() "encoding" parameter:
- We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8?
Nope! If a character can be encoded in CP1252, then it can also be encoded in UTF-8.
This is one reason why folks are pushing for UTF-8 to be the default, the One Encoding to Rule Them All.
Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?
Yes. Example:
£
- If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?
If the file already contains UTF-8 specific byte sequences, then Notepad will still re-save it as UTF-8.
If the file does not contain any UTF-8 specific byte sequences, then Notepad doesn't think it's UTF-8 so it won't re-save as UTF-8.
- ...I know I could start saving with
<head> <meta charset="utf-8"/>
. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the fileopen()
method. I don't think you can change your mind about the encoding once you have opened a file.
- Open the HTML file using the UTF-8 decoder.
- Check the
charset
field. - If the charset is UTF-8, GOTO Happy Ending.
- If the charset is not UTF-8, close the file and re-open it using the decoder for the declared charset.
This is the underlying assumption: No matter what encoding the file is in, the charset metadata is legible to a UTF-8 decoder. <subliminal_message>Isn't UTF-8 wonderful?</subliminal_message>
-
-
@JonB said in QTextDocument::toHtml() "encoding" parameter:
If you allow me ...
- We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8? Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?
cp1252 is very similar to ISO-8859-1, a.k.a. Latin1, but not exactly the same, 'cause Microsoft. In any case there are a few codepoints that are different between Latin1 and cp1252 that are going to give you trouble if you directly try to decode a cp1252 text through utf8. These include the euro sign, and slanted apostrophies and quotation marks, the permille sign among a few others.
Note: I talk about differences between Latin1 and cp1252 only because Latin1 is a subset of utf8, thus you can decode Latin1 text directly as if it were encoded in utf8.
- If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?
I would imagine it'd either use the local 8-bit encoding, which can be cp1252 or Latin1, or it can save it as utf8. There should be a way to specify that when saving the actual file.
- What goes on with encoding declarations in HTML? Some, but not all, of these files are HTML. I know I could start saving with
<head> <meta charset="utf-8"/>
. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the fileopen()
method. I don't think you can change your mind about the encoding once you have opened a file. So how does this work when reading the HTML file --- which encoding should I pass toopen('r')
given that it might encounter acharset
specification after a while when reading the content??
HTML is quite similar to XML in that regard. In XML you have the preamble (which contains the encoding of the document) that is supposed to be always encoded in latin 8bit, so whatever is used in the rest of the document can be read by the parser. For HTML this is the meta-tag, so the parser is supposed to switch to the indicated encoding whenever it reaches the charset meta tag.
@kshegunov said in QTextDocument::toHtml() "encoding" parameter:
Latin1 is a subset of utf8, thus you can decode Latin1 text directly as if it were encoded in utf8.
No.
"£" is
0xA3
in ISO-8859-1 but0xC2A3
in UTF-8.@kshegunov said in QTextDocument::toHtml() "encoding" parameter:
I have 15 year old IRC logs that are completely inaccessible to me as I don't currently have anything that can read windows-1251 ...
What platform do you currently use?
- Linux? Use iconv: https://stackoverflow.com/questions/15422753/iconv-convert-from-cp1252-to-utf-8
- Windows? Use Notepad++: https://notepad-plus-plus.org/
-
@kshegunov said in QTextDocument::toHtml() "encoding" parameter:
Latin1 is a subset of utf8, thus you can decode Latin1 text directly as if it were encoded in utf8.
No.
"£" is
0xA3
in ISO-8859-1 but0xC2A3
in UTF-8.@kshegunov said in QTextDocument::toHtml() "encoding" parameter:
I have 15 year old IRC logs that are completely inaccessible to me as I don't currently have anything that can read windows-1251 ...
What platform do you currently use?
- Linux? Use iconv: https://stackoverflow.com/questions/15422753/iconv-convert-from-cp1252-to-utf-8
- Windows? Use Notepad++: https://notepad-plus-plus.org/
@JKSH said in QTextDocument::toHtml() "encoding" parameter:
No.
"£" is 0xA3 in ISO-8859-1 but 0xC2A3 in UTF-8.You're making me look bad!
Apparently I'm wrong. Only the codepoint stays the same: U+00A3.@JKSH said in QTextDocument::toHtml() "encoding" parameter:
What platform do you currently use?
Linux. There wasn't really a serious need to reencode them, so that's why I didn't. In any case thanks for the links!
-
@JonB said in QTextDocument::toHtml() "encoding" parameter:
- We know there are certainly characters which encode to UTF-8 but not to CP1252. Are there any (not too obscure! I only care about English!) characters which encode to CP1252 but not to UTF-8?
Nope! If a character can be encoded in CP1252, then it can also be encoded in UTF-8.
This is one reason why folks are pushing for UTF-8 to be the default, the One Encoding to Rule Them All.
Are there any characters which decode correctly from (a file saved in) CP1252 but "generate rubbish"/error if decoded via UTF-8?
Yes. Example:
£
- If I save this text file in UTF-8, and user goes into stinky Notepad on it under Windows and saves back, does Notepad save as CP1252?
If the file already contains UTF-8 specific byte sequences, then Notepad will still re-save it as UTF-8.
If the file does not contain any UTF-8 specific byte sequences, then Notepad doesn't think it's UTF-8 so it won't re-save as UTF-8.
- ...I know I could start saving with
<head> <meta charset="utf-8"/>
. What I don't get is: this is a declaration inside the text file. From Python I must pass an encoding to the fileopen()
method. I don't think you can change your mind about the encoding once you have opened a file.
- Open the HTML file using the UTF-8 decoder.
- Check the
charset
field. - If the charset is UTF-8, GOTO Happy Ending.
- If the charset is not UTF-8, close the file and re-open it using the decoder for the declared charset.
This is the underlying assumption: No matter what encoding the file is in, the charset metadata is legible to a UTF-8 decoder. <subliminal_message>Isn't UTF-8 wonderful?</subliminal_message>
Nope! If a character can be encoded in CP1252, then it can also be encoded in UTF-8.
This is good news for me, thanks. But do you mean it will have the same codepoint, or do you mean it will be encodable but possibly by a different one? I suspect the latter?
£
is doable in both, but is not the same in either, right? [EDIT: Looks like "codepoint" is the wrong word here, I clearly mean the "input/output bytes" here.]This is one reason why folks are pushing for UTF-8 to be the default, the One Encoding to Rule Them All.
Would suit me down the ground.
Thanks for confirmation of approach to correct handling of reading HTML file, similar to @kshegunov. I like that you start by opening still as text file with utf-8 decoder, as opposed to binary opener, as this fits much better with Python file handling.
I hope I get a Happy Ending. If not your close and re-open again fits best with Python, as it's not possible to change decoder during read as per @kshegunov's suggestion. Though it's hideously inefficient :) And btw won't be doable if the HTML text is arriving via a pipe instead of a file, which is a bit of a limitation :(
Do you feel like offering some code to achieve the "Check the charset field."? This doesn't look like a "one-liner" to me. What can come before any
<head>
? (e.g.DOCTYPE
, comments, whitespace, blank lines, other stuff?)<head>
is optional, isn't it? When do you stop if it's not going to be present? (e.g. if you hit<html>
or<body>
or something?) -
@JKSH said in QTextDocument::toHtml() "encoding" parameter:
No.
"£" is 0xA3 in ISO-8859-1 but 0xC2A3 in UTF-8.You're making me look bad!
Apparently I'm wrong. Only the codepoint stays the same: U+00A3.@JKSH said in QTextDocument::toHtml() "encoding" parameter:
What platform do you currently use?
Linux. There wasn't really a serious need to reencode them, so that's why I didn't. In any case thanks for the links!
@kshegunov said in QTextDocument::toHtml() "encoding" parameter:
@JKSH said in QTextDocument::toHtml() "encoding" parameter:
No.
"£" is 0xA3 in ISO-8859-1 but 0xC2A3 in UTF-8.You're making me look bad!
Apparently I'm wrong. Only the codepoint stays the same: U+00A3.T'wasn't my intention ^^;; You're welcome for the links!
Yeah, same code point, different output bytes. I don't often think in terms of code points -- As a programmer, I've found it most useful to think in terms of graphemes and raw bytes. People who design encodings or fonts would be more interested in the other concepts.
P.S. In this thread, whenever I've said "character", I really meant "grapheme".
P.P.S. If anyone's interested in the nuances between "grapheme", "code point", and other concepts, see https://stackoverflow.com/a/27331885/1144539 -
Nope! If a character can be encoded in CP1252, then it can also be encoded in UTF-8.
This is good news for me, thanks. But do you mean it will have the same codepoint, or do you mean it will be encodable but possibly by a different one? I suspect the latter?
£
is doable in both, but is not the same in either, right? [EDIT: Looks like "codepoint" is the wrong word here, I clearly mean the "input/output bytes" here.]This is one reason why folks are pushing for UTF-8 to be the default, the One Encoding to Rule Them All.
Would suit me down the ground.
Thanks for confirmation of approach to correct handling of reading HTML file, similar to @kshegunov. I like that you start by opening still as text file with utf-8 decoder, as opposed to binary opener, as this fits much better with Python file handling.
I hope I get a Happy Ending. If not your close and re-open again fits best with Python, as it's not possible to change decoder during read as per @kshegunov's suggestion. Though it's hideously inefficient :) And btw won't be doable if the HTML text is arriving via a pipe instead of a file, which is a bit of a limitation :(
Do you feel like offering some code to achieve the "Check the charset field."? This doesn't look like a "one-liner" to me. What can come before any
<head>
? (e.g.DOCTYPE
, comments, whitespace, blank lines, other stuff?)<head>
is optional, isn't it? When do you stop if it's not going to be present? (e.g. if you hit<html>
or<body>
or something?)@JonB said in QTextDocument::toHtml() "encoding" parameter:
do you mean it will be encodable but possibly by a different one? …
£
is doable in both, but is not the same in either, right?Right. As per the table above,
£
...- ...cannot be encoded in ASCII
- …can be encoded in CP1252 and Windows-1252 as
0xA3
- …can be encoded in UTF-8 as
0xC2A3
Though it's hideously inefficient :)
It is. But it's what we need to put up with if we want to support arbitrarily-encoded HTML/XML (particularly in Python)
This is less of a headache in C++ because as @kshegunov said, we can do ASCII searches in binary data.
And btw won't be doable if the HTML text is arriving via a pipe instead of a file, which is a bit of a limitation :(
This thread has been going for a while, but it's still not clear to me: When exactly does your app need to decode stuff? You've mentioned that it needs to re-open files produced by the app itself; does it also need to open user-created files? Does it decode files/data downloaded from the network?
Do you feel like offering some code to achieve the "Check the charset field."? This doesn't look like a "one-liner" to me.
Sorry, I'll pass. Precisely because it's not a one-liner ;)
There's a few ways to do it:
- Quick and dirty hack: Use textual searching
- Proper: Use a HTML parser. Far more inefficient than what we've already discussed.
I did a simple quick and dirty hack before: https://github.com/JKSH/QtSdkRepoChooser/blob/master/src/downloader.cpp#L58 (Man, I wasn't expecting that tool to still be alive and kicking 4 years later)
What can come before any
<head>
? (e.g.DOCTYPE
, comments, whitespace, blank lines, other stuff?)<head>
is optional, isn't it? When do you stop if it's not going to be present? (e.g. if you hit<html>
or<body>
or something?)Stopping at
<body>
seems like a good bet.For a quick and dirty hack, you can grab all strings from the start of the file up till
"<body"
(no closing bracket) and then scan this substring for"charset"
. If you find it, regex should be able to finish the job.