Need advice on simple item/view design

Robert Hairgrove

The challenge (and this is not homework!):
Present a table of 8-bit code points for any supported text encoding to the user to prepare for execution of algorithms for recognition of the encoding of random text files which are not in Unicode but in some 8-bit encoding.

The idea is that the code points, or byte values, are displayed in the items which are arranged like tiles of a rectangular table, and the user can merely click on an item to change its recognition prejudice: green for OK, yellow for warning, and red for invalid encoding. It's basically like a chess board, except that it is not actually square, and each click cycles through the three states.

For example:
Most (or all?) of the ISO-8859-x encodings exclude code points between 0x7F and 0x9F. If a file contains any bytes within this range, the ISO encoding can be excluded. Such tiles would be colored red.

There are other OEM encodings such as CP437 where essentially all code points are defined. However, some of those are highly improbable in certain circumstances, such as any bytes above 0xB0 for CP437 which are MS-DOS graphics drawing characters. These are technically legal, but in my particular use case (determining the encoding of CSV files) they should at least trigger a warning. The user can decide whether to treat these as errors or warnings. If the latter, they can be colored yellow.

My first impulse was to set up custom item views, models, and delegates. OTOH, it seems like overkill, and I could do this just as quickly using a fixed GUI with QLabel widgets which can handle the mouse clicks and set their background color..

What do you think? Should I try to implement the model/view paradigm for something this simple? Or do you agree that it would be overkill?

mzimmers

IMO this seems like a natural application for the model/view/delegate design pattern. Your model could be a map of the 8-bit values and their respective display statuses, your view would probably be something like a GridView, and the delegate would be the tile. Your tile coloring could be handled by properties within the delegate.

IgKh

@Robert-Hairgrove Widgets, yes?

If you want to go the model/view way, I don't think you need a custom view and delegate. A QTableView with the default delegate and a custom QAbstractTableModel-based model should be sufficient.

What I'd do is to have the model present itself as a 16x16 table and have the data override answer to Qt::BackgroundRole with the colored brush object based on the internal data structure you'd like. You can update the model with a slot on it you can connect to the desired signal of the table view (don't forget to emit dataChanged).

OR you can just make do with a bunch of labels in a grid layout. It is more about your architecture preference; this use case as described doesn't really make a lot of use of the strong points of the model/view classes.

Pl45m4

@Robert-Hairgrove

Are you, by any chance, trying to build a HexEditor?

I think the delegate approach might be fitting the most to color-code your bytes.

I'm working on something similar as part of a larger project which will be not public, but I'm planing to extract the view and share it as standalone Qt lib... unfortunately it's not that progressed right now, so I don't have anything ready to share

JonB

@Robert-Hairgrove said in Need advice on simple item/view design:

and I could do this just as quickly using a fixed GUI with QLabel widgets which can handle the mouse clicks and set their background color..

Initially I thought you meant a single QLabel holding multiple characters, recognising which character user clicks on and using rich text/HTML to set color for each character. That would be tricky IMHO.

Now since you say "widgets" plural I assume you mean one QLabel per character. But you should not do this: tables (many instances) of widgets are a no-no for efficiency/memory. Don't use a gridview with loads of widgets either. Unless the number of cells is very limited, which I don't think it will be. As the others have said you must/should use a QTableView, with a delegate if desired but not with each cell holding any widget, for display, and hence a model to back it. If all you need is to set the background/foreground color of a bit of text you can just use the default delegate and roles without needing to write your own delegate, as @IgKh has written.

Robert Hairgrove

@Pl45m4 Actually, I am using a hex editor widget in another part of this application, but I am using this: https://github.com/Simsys/qhexedit2

There's also Gnome GHex which I like as a stand-alone editor, but haven't tried to use as a library. QHexEdit fits in nicely since it is also Qt-based (but doesn't use a model/view design).

What I am doing is to allow the user to configure how the application will try to recognize CSV files. I have discovered that except for Unicode, which is fairly easy to detect (especially if a BOM is present), most libraries I have tried such as ICU and uchardet fail miserably when presented with CSV files. They do very well for text files because they use a lot of linguistic heuristics and filters. CSV files, OTOH, can have almost anything as content.

My aim is not just to find a particular encoding, but to facilitate finding "broken" CSV files which can happen e.g. when someone loads a perfectly fine Latin1 encoded file into one particularly popular spreadsheet program, copies and pastes some text from somewhere else into the spreadsheet which contains characters not supported by some 8-bit character set, and re-exports the file using that 8-bit codec. For example, import a file using some other encoding, insert lines containing the character ã, then export the file using cp437, for example (the code point 0xC6 corresponds to a box drawing character "╞" in cp437. I'm not sure which codec has that byte defined as "ã" -- ISO-8859-1 has it at 0xE3, for example). Or the same with a file originally encoded as Windows-1252 which contains names with something like "ž" and then exporting that as ISO-8859-1 (ISO files are not valid if they contain any code points between 0x7F and 0x9F, and Windows-1252 does use that range for some extended characters).

So while almost all code pojnts above 0x1F correspond to legal bytes in cp437, the ones above 0xB0 inclusive would be highly unlikely to occur in a CSV file which is supposed to contain only text and numbers, maybe a currency sign or three, and standard punctuation. So the user could mark the code point corresponding to the "╞" as an error, although technically speaking it might warrant only a warning.

Robert Hairgrove

@IgKh Thanks!

I know pretty much how I would set it up using just a grid and some QLabel widgets, but am a bit worried that if I do have to scale it up a bit, that I will eventually get stuck because that approach may not be flexible enough.

I think I will try both ways you suggest; if I can keep it simple, without having to do too much customization, I would prefer the model/view approach.

Robert Hairgrove

ChatGPT gave me this:

Which 8-bit text encoding has the letter 'ã' (small 'a' with tilde)
assigned to the code point 0xC6?

You're referring to DOS Code Page 850 (also known as CP 850, “DOS Latin‑1”). In this encoding, the small 'a' with tilde (ã) is indeed assigned to the byte value 0xC6 ([utf8-chartable.de][1], [Android Git Repositories][2], [Debian Sources][3]).

To summarize:

In ISO‑8859‑1 (Latin‑1) and Unicode‑based encodings like UTF‑8, ã is not at 0xC6—it appears at 0xE3 ([technojeeves.com][4], [utf8-chartable.de][5]).

But in CP 850, the mapping is different: 0xC6 maps to Unicode U+00E3, which is ã ([Android Git Repositories][2], [Debian Sources][3], [utf8-chartable.de][1], [Scribd][6]).

Therefore, the 8‑bit encoding you're asking about is Code Page 850 (CP 850).

It looks like the file I was examining might be legal cp850, but neither ICU nor uchardet was able to recognize it as such.