Is there a way to display characters from different charset?
-
Hello,
I am trying to make a subtitle editor and is mimicking the style editor in Aegisub.The thing is: inside a styled subtitle file (.ass file), there is an entry specifying the encoding and charset of the subtitle text according to CharacterSet Enumeration(which is part of WinAPI I believe). However, Qt seems to be using only the Unicode charset.
So Is there a way to display characters from another charset like SYMBOL_CHARSET (I haven't fond the standard name so I just call it 'SYMBOL_CHARSET')?
Inside the .ass file, the Encoding entry only accepts values in the CharacterSet Enumeration:
{
ANSI_CHARSET = 0x00000000,
DEFAULT_CHARSET = 0x00000001,
SYMBOL_CHARSET = 0x00000002,
MAC_CHARSET = 0x0000004D,
SHIFTJIS_CHARSET = 0x00000080,
HANGUL_CHARSET = 0x00000081,
JOHAB_CHARSET = 0x00000082,
GB2312_CHARSET = 0x00000086,
CHINESEBIG5_CHARSET = 0x00000088,
GREEK_CHARSET = 0x000000A1,
TURKISH_CHARSET = 0x000000A2,
VIETNAMESE_CHARSET = 0x000000A3,
HEBREW_CHARSET = 0x000000B1,
ARABIC_CHARSET = 0x000000B2,
BALTIC_CHARSET = 0x000000BA,
RUSSIAN_CHARSET = 0x000000CC,
THAI_CHARSET = 0x000000DE,
EASTEUROPE_CHARSET = 0x000000EE,
OEM_CHARSET = 0x000000FF
}
Judging by the name, I think this is a charset problem instead of an encoding problem.Here is an example. When:
FontName is set to 'Arial'
Encoding is set to 'SYMBOL_CHARSET'
'This is the SYMBOL_CHARSET.' is the preview text I typed in.
Below is the display I got:
It seems the display is using 'Wingdings' font, but 'Arial' is chosen in the FontName entry.
Here is the whole style panel:
Actually, Unicode has already included some of the Wingdings symbols according to this page. I think it would be better to stick to Unicode and have the Encoding entry always set to DEFAULT_CHARSET.
-
Yeah, the problem is text is such a vast topic and there's so much different apps, documentation, APIs and libraries that can't get their vocabulary in sync that it's sometimes hard to even talk about it without confusion, because what is called encoding in one place is called charset in another and it's the same with a lot of other names.
In any case, whatever a particular piece of internet calls a particular thing it boils down to the same thing:
- There's a table that maps an integer number to a particular letter, accent, sign or any other thingie. We'll call this encoding. This can be ASCII, ANSI, UTF-8, WIN-1250 or any other of the hundreds if not thousands of mappings.
- A font specifies a set of glyphs corresponding to numbers.
- A font can have one or more of these sets e.g decimal number 84 can be displayed as a T or a ❄ or something else. We'll call this character set (or charset for short).
So to display text you have to have a string, which is an array of numbers. You have to know which encoding they are using. You have to have a font that is designed to display glyphs for number range in that encoding and, if the font has more than one, you have to choose which character set to use. As I mentioned earlier fonts usually only support one.
What you have on the screenshot shown as Encoding is the character set. There doesn't seem to ba an actual selector for text enconding in that dialog. You're saying the presented picture is using Wingdings font although Arial is selected. My guess is that Arial doesn't actually support symbol charset, so the preview is using a known fallback that does.
That's also kinda what I suggested if you want to reflect that in Qt - since it doesn't have an option to select a charset from a font that has multiple character sets have a set of fonts that supports each of these charsets and switch between them. In other words treat charset as a hint for font selection and for DEFAULT_CHARSET use the actually specified font.
As for Wingdings and unicode - Wingdings is a font that has glyphs for numbers 0 -255. Unicode is a multi-byte encoding that assigns meaning to a particular number, but does not dictate how a font should display it (it's a strong suggestion but only a suggestion non the less). For example unicode defines a character that is suggested to be a snowflake (U+2744), but Wingdings font can't display it, because it 's out of its range (0-255). The thing that Wingdings displays as snowflake is number 84, which in Unicode is a capital letter T. That's why you see a snowflake in the preview when you type T. It's actually encoded in Unicode as 84 and displayed in a symbols charset by the Wingdings font.
I know it's confusing, but those two snowflakes are different things - one is a representation of a character T in symbols character set and the other is legitimate snowflake character as defined by unicode, that can be viewed in any character set.
I think it would be better to stick to Unicode and have the Encoding entry always set to DEFAULT_CHARSET.
I agree. The unicode specification kinda supersedes the need for character sets, which were yet another way to get around the limitation of single-byte encodings. Just keep in mind that to get the snowflake in the default charset you have to use the actual unicode snowflake character, as it has a different numerical value from the letters.
-
Encoding and charset are two different things. Unicode is encoding, not charset. Qt supports many encodings and conversion between them (see QTextCodec).
Charset on the other hand is a set o glyphs that characters in particular encoding map to. AFAIK Qt does not have support for them directly. Charsets are a property of a font. A single font can have multiple charsets, although usually there's just one. To "emulate" the support for charsets in Qt you'd just have multiple fonts, each with different single charset and you'd switch the font on the fly.
-
Thank you for the reply.
Inside an .ass file there is a separate entry called "FontName" that specifies fonts, which is differenet from the "Encoding" entry I mentioned.
The picture I show is under the setting:
FontName is set to 'Arial'
Encoding is set to 'SYMBOL_CHARSET'And 'This is the SYMBOL_CHARSET.' is the text I input and got processed and displayed into those symbols.
Since I don't know anything about C++, I cannnot understand what's behing the stage even though Aegisub is open source.
Also, Unicode seems to be a valid cahrset name when I look around for info about encodings and charsets, and names like 'utf-8' and 'utf-16' are its corresponding encodings.
I will updata my question to make it more clear.
-
Yeah, the problem is text is such a vast topic and there's so much different apps, documentation, APIs and libraries that can't get their vocabulary in sync that it's sometimes hard to even talk about it without confusion, because what is called encoding in one place is called charset in another and it's the same with a lot of other names.
In any case, whatever a particular piece of internet calls a particular thing it boils down to the same thing:
- There's a table that maps an integer number to a particular letter, accent, sign or any other thingie. We'll call this encoding. This can be ASCII, ANSI, UTF-8, WIN-1250 or any other of the hundreds if not thousands of mappings.
- A font specifies a set of glyphs corresponding to numbers.
- A font can have one or more of these sets e.g decimal number 84 can be displayed as a T or a ❄ or something else. We'll call this character set (or charset for short).
So to display text you have to have a string, which is an array of numbers. You have to know which encoding they are using. You have to have a font that is designed to display glyphs for number range in that encoding and, if the font has more than one, you have to choose which character set to use. As I mentioned earlier fonts usually only support one.
What you have on the screenshot shown as Encoding is the character set. There doesn't seem to ba an actual selector for text enconding in that dialog. You're saying the presented picture is using Wingdings font although Arial is selected. My guess is that Arial doesn't actually support symbol charset, so the preview is using a known fallback that does.
That's also kinda what I suggested if you want to reflect that in Qt - since it doesn't have an option to select a charset from a font that has multiple character sets have a set of fonts that supports each of these charsets and switch between them. In other words treat charset as a hint for font selection and for DEFAULT_CHARSET use the actually specified font.
As for Wingdings and unicode - Wingdings is a font that has glyphs for numbers 0 -255. Unicode is a multi-byte encoding that assigns meaning to a particular number, but does not dictate how a font should display it (it's a strong suggestion but only a suggestion non the less). For example unicode defines a character that is suggested to be a snowflake (U+2744), but Wingdings font can't display it, because it 's out of its range (0-255). The thing that Wingdings displays as snowflake is number 84, which in Unicode is a capital letter T. That's why you see a snowflake in the preview when you type T. It's actually encoded in Unicode as 84 and displayed in a symbols charset by the Wingdings font.
I know it's confusing, but those two snowflakes are different things - one is a representation of a character T in symbols character set and the other is legitimate snowflake character as defined by unicode, that can be viewed in any character set.
I think it would be better to stick to Unicode and have the Encoding entry always set to DEFAULT_CHARSET.
I agree. The unicode specification kinda supersedes the need for character sets, which were yet another way to get around the limitation of single-byte encodings. Just keep in mind that to get the snowflake in the default charset you have to use the actual unicode snowflake character, as it has a different numerical value from the letters.
-
I just tested some subtitle files(.ass files) outputed with the Encoding set to SYMBOL_CHARSET. And different media players gives different outcomes: some display gibberish, some display normal text and few display the Wingdings symbols shown in the preview correctly.
That's enough reason for me to always stick to Unicode.
Thank you for you profound answer Chris!
-
@IvanIsLearning Yeah, like I said - it's confusing and media player creators often don't handle it correctly. In fact it's sometimes impossible to handle it correctly e.g. when the file is specifying a combination of a charset and a font that doesn't support it. It's the fault of the author, but the player needs to make some assumptions/fallbacks in such case:
some display gibberish - they probably assume wrong encoding
some display normal text - they probably ignore character set and just use whatever the specified font offers
few display the Wingdings symbols - they probably ignore font specified and use a fallback that supports the charset