QString, Unicode, and qWarning() Concerns

Rondog

Using OS X 10.10.5, XCode 7.2.1, Qt 5.6.0

I recently noticed that an incorrect assignment of a QString did not create a compiler error (or even a warning) and the end result was unexpected.

This is an example that shows the problem with the QString assignment:

	QString test_string;	
	test_string = 0.000;  // this was a mistake as the enclosing quotes were omitted.  The intention was to assign "0.000" to the QString.
	
	qDebug() << test_string;
	qDebug() << test_string.length();
	qDebug() << test_string.isNull();
	qDebug() << test_string.isEmpty();

// output
"\u0000"
1
false
false

The other odd thing noticed when troubleshooting this was that this seemed to break the qWarning() function. The output text was truncated to the null character as shown in the following example even though it was a non-zero length string:

	test_string = QStringLiteral("text with \0 null embedded");

	qWarning(test_string.toLatin1());
	qDebug() << test_string;

// output
"text with " 
"text with \u0000 null embedded"

So, my questions are:

How did the QString::operator=() function manage to turn a floating point value into a unicode null character?
Is it normal behaviour for qWarning to truncate text when dealing with null characters embedded in the string?
Referring to the first example where the first character of a QString as a null character "/u0000". It makes sense that the length is non-zero but shouldn't the function QString::isEmpty() return true?

kshegunov

@Rondog
Hi,

How did the QString::operator=() function manage to turn a floating point value into a unicode null character?

0.000 => 0 (implicitly, allowed by overload rules) => QChar (implicit constructor) => QString::operator = (QChar)

The compiler should've given you warning about double/float truncation though,

Is it normal behaviour for qWarning to truncate text when dealing with null characters embedded in the string?

My guess is the latin support stops at the '\0' character, which I think it shouldn't.

Referring to the first example where the first character of a QString as a null character "/u0000". It makes sense that the length is non-zero but shouldn't the function QString::isEmpty() return true

Nope. '\0' is a character as any other and is a valid one. So the string is not empty (although there's nothing to see in it).

Kind regards.

Rondog

I can see how implicit conversion could be used but in this case the compiler would need to know that the assigned value is equivalent to integer 0 and go from there. Occasionally I have seen constants decorated so the compiler knows the type of value (i.e. L"unicode text" or 0.0f which overrides the default of 8 bit char and double ) but I don't thing the compiler considers what the actual value is.

The function QString::isEmpty() is not clearly defined. When comparing to how char arrays are handled in C/C++ this seems to be wrong:

 char text[] = "text";
// strlen() would return 4 (doesn't include null character)
// sizeof() would return 5 (includes null character)

I guess it all depends of what your definition of empty is. I would think that a string only containing one character, a null character, would be considered an empty string. It will have a length of one element (same idea as sizeof() ) but has no text (same as strlen() ).

For the qWarning thing it must be more than conversion to latin (utf8) as both have a null character. It probably means it is using legacy code that can't deal with embedded null characters where qDebug() does not (?).

mrjj

Hi
Doc says
"A QString can embed '\0' characters (QChar::Null). The size() function always returns the size of the whole string, including embedded '\0' characters."

so i guess that is why its not empty. as Its not empty as such since \0 is (just) also a char.

kshegunov

@Rondog said:

I can see how implicit conversion could be used but in this case the compiler would need to know that the assigned value is equivalent to integer 0 and go from there.

C++ has a pretty crappy type discipline. So things like:

double x = 0.1;
int z = x; //< This is valid, you may not even get a warning; result is 0 as expected.

are correct. Most compilers will rise a truncation warning, but not all.

Occasionally I have seen constants decorated so the compiler knows the type of value (i.e. L"unicode text" or 0.0f which overrides the default of 8 bit char and double ) but I don't thing the compiler considers what the actual value is.

Yes, there are a handful of suffixes to force a behavior - u for unsigned, l for long, f for single-precision floating point (float). L"" for wide-char strings. In any case, I don't think this relates to the subject at hand.

When comparing to how char arrays are handled in C/C++

const char * const text = "xxx";
// strlen() returns 3, because it's specifically documented it doesn't count the \0 character
// sizeof() would return 8 on my machine (the size of void *), so don't use sizeof for this kind of tasks.

I would think that a string only containing one character, a null character, would be considered an empty string.

As with most string implementations the length of the string is kept separately, to allow \0 characters inside the array. Suppose you're working with encoded string data that uses \0 for something more than just "end of string" marker. If you enforce the behavior of strlen, then your string is not binary safe (i.e. it can't contain non-text/non-printable parts).

For the qWarning thing it must be more than conversion to latin (utf8) as both have a null character.

Latin1 is different from utf8. The former doesn't standardize any non-printable characters and is using fixed character size, while the latter is using much more complicated character encoding.

It probably means it is using legacy code that can't deal with embedded null characters where qDebug() does not (?).

Possibly, but it may also be working as it's designed.

Kind regards.

Rondog

@kshegunov ,

C++ has a pretty crappy type discipline.

I don't really agree with this statement. One of the features of C++ is that it is more type-safe than its predecessor. For cases where implicit conversion is a concern you can prevent this by specifying the keyword 'explicit' to the constructors or member functions. You can't do this with built in types (that I am aware of at least) but you can do this with your own classes.

Yes, there are a handful of suffixes to force a behavior - u for unsigned, l for long, f for single-precision floating point (float). L"" for wide-char strings. In any case, I don't think this relates to the subject at hand.

It does relate to the subject. When the compiler parses your program it identifies keywords, operators, and constants among other things. For a modern compiler I believe that all floating point values are assumed to be double precision float unless you specify otherwise (i.e. x = 1.23f which would be treated as single precision float). This is what I am using to base the assumption that the assignment was from a double precision floating number to a QString.

How the value is understood by the compiler is important. If, for example, you had this it wouldn't compile at all:

	char test_char[] = L"unicode test";

Unlike integer to double or other reasonable implicit conversions there is no conversion from 'wchar_t' to 'char'. The compiler could probably turn this into 'char' quite easily but it won't do this as the mechanism to go from one type to another must be clear somehow.

Going back to the problem assignment:

QString test_string = 0.000;

The value of 0.000 must be identified as a double precision floating point number and, somehow, this number was implicitly converted into a QChar() null value (?). It was a typo on my part but it really threw me for a loop tracking this down (I use qWarning() alot and this broke in this case).

const char * const text = "xxx";
// strlen() returns 3, because it's specifically documented it doesn't count the \0 character
// sizeof() would return 8 on my machine (the size of void *), so don't use sizeof for this kind of tasks.

In your example you are correct but you are using a pointer and not a char array. If your version looked like the following example then sizeof() would return the length of the text and the terminating null character:

 char text[] = "text";
// strlen() would return 4 (doesn't include null character)
// sizeof() would return 5 (includes null character)

As with most string implementations the length of the string is kept separately, to allow \0 characters inside the array. Suppose you're working with encoded string data that uses \0 for something more than just "end of string" marker. If you enforce the behavior of strlen, then your string is not binary safe (i.e. it can't contain non-text/non-printable parts).

You are right that newer string classes use a container of some sort to store all the characters with a separate value for length. You can dump anything into this (including null characters if you want). Using a string class for binary data is kind of a stretch though and there are other ways to handle this kind of data. I doubt this was a design consideration when developing a string class.

JKSH

@Rondog said:

Using OS X 10.10.5, XCode 7.2.1, Qt 5.6.0

Using Windows 10, MSVC 2013, Qt 5.6.0:

main.cpp:18: error: C2440: 'initializing' : cannot convert from 'double' to 'QString'
No constructor could take the source type, or constructor overload resolution was ambiguous

Using Windows 10, MinGW (GCC) 4.9.2, Qt 5.5.1:

main.cpp:18: error: conversion from 'double' to non-scalar type 'QString' requested
  QString str = 0.000;
                ^

Looks like XCode is the silly one.

kshegunov

@Rondog said:

I don't really agree with this statement.

Well, let's agree to disagree then.

For cases where implicit conversion is a concern you can prevent this by specifying the keyword 'explicit' to the constructors or member functions. You can't do this with built in types (that I am aware of at least) but you can do this with your own classes.

All this proves C++ (through backwards compatibility with C) is very weakly typed.

For a modern compiler I believe that all floating point values are assumed to be double precision float unless you specify otherwise

I believe that too.

This is what I am using to base the assumption that the assignment was from a double precision floating number to a QString.

You misunderstood me. The reason the expression works is that your compiler allowed double to be truncated to int silently. Once you got the integer it's pretty obvious how you arrive at the QString.

How the value is understood by the compiler is important. If, for example, you had this it wouldn't compile at all

Would depend on the compiler and its configuration (as proven by @JKSH's test).

Unlike integer to double or other reasonable implicit conversions there is no conversion from 'wchar_t' to 'char'.

My point is that double to integer is also "reasonable" in the sense you get a truncated value. (Same with converting from const char * const to const wchar_t * const). The compiler can do this and will do it readily if there weren't the warning/errors safeguards put by the compiler devs. Older or non-compliant compilers just eat that stuff up, no problem.

Using a string class for binary data is kind of a stretch though and there are other ways to handle this kind of data.

There's always another way. But consider you're talking with a device/PC in text mode and you're using utf8 (i.e. a complex encoding) and you have non-printables/null characters in your protocol. How's that a stretch?

I doubt this was a design consideration when developing a string class.

I'm pretty sure it was. Also the trivial not-counting optimization, but that's a somewhat weak argument to keep the length separately.

Rondog

@JKSH,

Looks like XCode is the silly one.

That makes the most sense (a compiler bug). The other version where double -> integer -> QChar -> QString was a bit alarming. I should have tried this myself on another compiler but it is not something you normally consider.

@kshegunov,

There's always another way. But consider you're talking with a device/PC in text mode and you're using utf8 (i.e. a complex encoding) and you have non-printables/null characters in your protocol. How's that a stretch?

One of my favorite books on C++ is called The Design and Evolution of C++ by Bjarne Stroustrup. There is a section that talks about what was considered when adding the std::string class; things that would make this class useful but not trying to create a bear. I don't recall reading anything favorable about having features that covered every possible fringe use of this class. I am sure the std::string class doesn't agree with everyone but what was done was a good compromise of features and goals to problems and short falls.

The string class is arguably one of the most important classes you can have. The Qt QString class is used by everything so even a simple Hello World application written in Qt will likely have many instances of this class. There are features in QString that make this less of a performance hit (such as the copy on write feature for this class) and I am sure there are other things internal to this class that make it as efficient as possible considering that it is so frequently used.

I am of the opinion that a string class (or any class for that matter) should have a narrow purpose and that it is a mistake to try and make something that covers every possible contingency. If you want to have a string class that can be used to handle binary data or something else unrelated to strings then you will end up with a monster sooner or later. My idea of a good string class is something that can handle strings efficiently and doesn't force the memory footprint of the application to explode.

[I doubt this was a design consideration when developing a string class.]
I'm pretty sure it was. Also the trivial not-counting optimization, but that's a somewhat weak argument to keep the length separately.

I was not suggesting that keeping the length of the string as a separate variable is a bad idea as there are advantages to this approach. One advantage to keeping the length as a separate variable, as you pointed out, is that you can have things like embedded null characters. One disadvantage is that it requires more memory for each instance of a string class which can be a problem if you have many instances of this class. The extra memory requirement is likely the main reason C character arrays (strings) are so primitive.

If your argument is simply that it allows for the capability to have an embedded null character, but this is a feature that is really not needed for string data, then it is not a good point. You may be right that having the ability to have embedded null characters in a QString was a design goal but more likely this was simply something that became possible from using a container for the string data (?).