Qt's MOC and UTF-8

MasterQ

Hello,

I tried to use identifiers with UTF-8 encoded characters like German Umlauts (äöüß).

I get the following message from MOC while processing an enum:

AutoMoc subprocess error

The moc process failed to compile
  "SRC:/src/BankBusiness/AccountView.h"

...

Output
src/BankBusiness/AccountView.h:51:1: error: Parse error at "Gl"

ninja: build stopped: subcommand failed.

next character after 'Gl' is 'ä'.

int äeiöü = 0; // <= accepted

enum CSVField {
    GläubigerId, // <= MOC stopps here
}

This issue only happens (so far) for a definition of enum entries, not for regular identifiers.

I am wondering if Qt would have issues with UTF-8 in general.

Why is an Umlaut not accepted in an enum but for other identifiers? My source code encoding is fully UTF-8.

SimonSchroeder

@MasterQ said in Qt's MOC and UTF-8:

Compilers should not have issues with non-ascii identifiers these days.

Unicode in identifiers is still a fairly recent thing. And moc is not a real compiler. It just tries to parse the important bits of information. You cannot fully expect it to have every new functionality of the standard. It has been mentioned several times that Qt is closely following the development of reflection in C++. C++26 will do its first steps towards reflection, but it will not be enough for Qt, yet. Hopefully very soon Qt will ditch the moc (maybe C++29 will get all the features we need) and switch over to reflection. It does not make a lot of sense to make moc a lot more useable when a true solution is just around the corner. So, in the future you might be able to use umlauts everywhere.

I'm also not fully sure which Unicode characters are actually included (https://en.cppreference.com/w/cpp/language/identifiers). There is a mention of XID_Start and XID_Continue. There might be a difference for ä as a single code point and ¨ + a as two combining code points. I personally try to stick to English identifier names because you never know where your application might end up in the future. Your company might grow and become international. German identifiers will then make it hard to understand for foreign developers. And you'll never run into the problem you've mentioned...

Christian Ehrlicher

Don't use anything but ascii for variables or similar.

MasterQ

@Christian-Ehrlicher said in Qt's MOC and UTF-8:

Don't use anything but ascii for variables or similar.

That's an advice, not an explanation. ;-)

Would you label the issue I mentioned as a feature or a bug? Compilers should not have issues with non-ascii identifiers these days.

I am only wondering why the behaviour is different for the two examples. Non-ascii should be accepted or not. But not this mixture!

SimonSchroeder

@MasterQ said in Qt's MOC and UTF-8:

Compilers should not have issues with non-ascii identifiers these days.

Unicode in identifiers is still a fairly recent thing. And moc is not a real compiler. It just tries to parse the important bits of information. You cannot fully expect it to have every new functionality of the standard. It has been mentioned several times that Qt is closely following the development of reflection in C++. C++26 will do its first steps towards reflection, but it will not be enough for Qt, yet. Hopefully very soon Qt will ditch the moc (maybe C++29 will get all the features we need) and switch over to reflection. It does not make a lot of sense to make moc a lot more useable when a true solution is just around the corner. So, in the future you might be able to use umlauts everywhere.

I'm also not fully sure which Unicode characters are actually included (https://en.cppreference.com/w/cpp/language/identifiers). There is a mention of XID_Start and XID_Continue. There might be a difference for ä as a single code point and ¨ + a as two combining code points. I personally try to stick to English identifier names because you never know where your application might end up in the future. Your company might grow and become international. German identifiers will then make it hard to understand for foreign developers. And you'll never run into the problem you've mentioned...

MasterQ

Thank you for the info

Pl45m4

@SimonSchroeder said in Qt's MOC and UTF-8:

Unicode in identifiers is still a fairly recent thing

I cannot wait to debug foreign code where every variable and symbol is completely in Chinese/Japanese/Korean/Hebrew letters... :D

Even though it might be convenient to some, but when it comes to such things, you can overengineer and worsen things quickly.

MasterQ

@Pl45m4 said in Qt's MOC and UTF-8:

Even though it might be convenient to some, but when it comes to such things, you can overengineer and worsen things quickly.

This depends of the point of view. Ask non english speaking Chinese, Japanese, Korean, or Hebrew readers.

Why to exclude young Chinese guys from coding? ...

But I got your points and my question was not about "makes it sense" but more about "is it possible to do so, if you wish"

Cheers

Pl45m4

@MasterQ said in Qt's MOC and UTF-8:

This depends of the point of view. Ask non english speaking Chinese, Japanese, Korean, or Hebrew readers.
Why to exclude young Chinese guys from coding?

That was not intended to go in your direction :)

Over the years code conventions have developed, for good reasons.
As a German myself, I would never post code like (now I did here, LOL):

(I actually googled for words with Ä and ß... everything that came to my mind would not have made any sense)

void ÄußereKlasse::holeÖffentlichesMaß(int maß)
{
    std::string ßÄäÖÄöÜ = "Hello";
    std::string scheiße = "World"; // classic :D
    // ...
}

and then ask for assistance in any case other than code syntax. When there is nothing obviously wrong in C++ standard terms, it's a pain to figure out what is going on if you can't read sh*t...

It's like reverse engineering obfuscated code, except it's actually clear text, but you still need to figure out the hard way what this is all about...
If there's even the slightest chance that somebody else other than yourself will ever read your code or you even ask for help over the Internet... you should stick to those standards.

Just because you probably can (in the future), doesn't mean you should spam language specific characters from now on :))

MasterQ

I agree, no doubt.

But I can remember some FORTRAN code, maybe 30 years ago, where all variables were like x1, x2, ... I only had a chance to understand because I knew what the coder was intended to calculate, =8-0.

Even if 'x' is an ASCII character, the code was terribly unreadable. But that's another chapter of the lore.

TGIF

have a nice weekend

SimonSchroeder

@MasterQ said in Qt's MOC and UTF-8:

But I can remember some FORTRAN code, maybe 30 years ago, where all variables were like x1, x2

Well, short variable names back then had a couple of reasons. For one, memory was at a premium and shorter identifiers means less memory. This is further compounded if you consider punch cards with only up to 80 columns (and you had to start your code at column 7). I still have to work with some old 80 column FORTRAN code. It is annoying when you need to split an equation over several lines. Shorter names help you to fit everything into one line. Not to forget that identifiers are restricted to 8 characters. There is only so many meaningful identifiers with only 8 characters. And the first letter (initially) would define if your variable is integer or floating point. (This is why still to this day the most common loop variables are i,j,k,l,m,n as those where defined to be integers.)

@MasterQ said in Qt's MOC and UTF-8:

Why to exclude young Chinese guys from coding? ...

Everyone coding in (proper) C++ has to code in English. Keywords are English. So, you can either stick to English or mix languages, but you cannot write entirely in a language different from English. You could try with macros, but it is certainly not a good solution. Further, it would still restrict yourself to languages written left to right. I would claim that any programmer needs to know English in order to stay up to date. So, let's just agree to English as the common language to make code portable between different nations.

jsulm

@MasterQ said in Qt's MOC and UTF-8:

Why to exclude young Chinese guys from coding?

I'm quite confident young Chinese guys speak English well enough.
How should this work in a project where people from different countries are involved? If everyone involved in such a project starts to use his/her native language in code you can dump the project. In our company such code would never pass code review. It is not about excluding anybody, it is about having a common language everybody understands.

JonB

@jsulm So we should use Esperanto, which everyone understands, instead of English :)

jsulm

@JonB I'm sure more people speak Latin than Esperanto :-D

Pl45m4

@JonB said in Qt's MOC and UTF-8:

So we should use Esperanto, which everyone understands, instead of English :)

Mi ŝatas tion :D

@jsulm said in Qt's MOC and UTF-8:

I'm quite confident young Chinese guys speak English well enough.

That's why we have an International category for every major language, right ;-)

Even though it's not that helpful, but I think there are a lot of "programmers" in every region of the world, speaking their native language only while their English "knowledge is limited to the few "keywords" for C++ (or whatever code they are using)