Will the function unicode() always return the same address in the lifetime of a QString object?

okellogg

Hi,

If we define an object of type QString, is it safe to do pointer calculations using the return value from QString::unicode()?

As an example, see
https://github.com/KDE/kdevelop/blob/3.5/lib/cppparser/lexer.h#245

 inline const CHARTYPE* offset( int offset ) const {
    return m_source.unicode() + offset;
}

inline int getOffset( const QChar* p ) const {
    return int(p - (m_source.unicode()));
}

Let's say that usrPtr is initially assigned from m_source.unicode(), then lots of operations are done on m_source, and later m_source.getOffset(usrPtr) is called. Will the value returned still be "compatible" with usrPtr? Or is it possible that a different buffer (i.e. different start address) may be returned?

jeremy_k

@okellogg said in Will the function unicode() always return the same address in the lifetime of a QString object?:

I found that on recent Qt versions, two calls to unicode() on the same variable, one near the start of lifetime and another after many string manipulations such as "insert", may in fact return different buffers.
This change appears to have happened in some version after Qt 5.9 - I had tested with up to Qt 5.9 and did not have this problem, even going back to Qt4.

While changes in the implementation (and operating system, malloc implementation, other memory allocations in the process, etc) may alter when two calls to QString::unicode() separated by a string modification return different pointers, the possibility is not new.

They all say the same thing: The result remains valid until the string is modified.
Code that fails to take this into account risks encountering C++ undefined behavior.

Approaching this from an implementation standpoint, this should not be a surprise to anybody familiar with realloc() or memory management in general.

jeremy_k

From the documentation https://doc.qt.io/qt-6/qstring.html#unicode:

The result remains valid until the string is modified.

If the operations only involve const member functions, storing the pointer is fine. Otherwise, there is no guarantee. It's worth noting that there are some functions, such as operator[], that have both const and non-const versions.

jsulm

This post is deleted!

okellogg

@jeremy_k Thanks for your reply.
I had written

[...] is it possible that a different buffer (i.e. different start address) may be returned?

I found that on recent Qt versions, two calls to unicode() on the same variable, one near the start of lifetime and another after many string manipulations such as "insert", may in fact return different buffers.
This change appears to have happened in some version after Qt 5.9 - I had tested with up to Qt 5.9 and did not have this problem, even going back to Qt4.

jeremy_k

@okellogg said in Will the function unicode() always return the same address in the lifetime of a QString object?:

I found that on recent Qt versions, two calls to unicode() on the same variable, one near the start of lifetime and another after many string manipulations such as "insert", may in fact return different buffers.
This change appears to have happened in some version after Qt 5.9 - I had tested with up to Qt 5.9 and did not have this problem, even going back to Qt4.

While changes in the implementation (and operating system, malloc implementation, other memory allocations in the process, etc) may alter when two calls to QString::unicode() separated by a string modification return different pointers, the possibility is not new.

They all say the same thing: The result remains valid until the string is modified.
Code that fails to take this into account risks encountering C++ undefined behavior.

Approaching this from an implementation standpoint, this should not be a surprise to anybody familiar with realloc() or memory management in general.

okellogg

Thanks again @jeremy_k for your explanations, they prompted me to make the changes in commit 3041141.

kshegunov

Out of curiosity, any specific reason you're trying to patch up a version that's 12 years old?
I'd have gone with the upstream if I were you, and just by glancing through that original piece of code, it looks very fragile ...

On an unrelated note, I'd have kept the tokens as a mirror of the original text and would have updated them/changed them based on the input coming in. Keeping a string that gets modified mid-way is a perfect recipe to hit all kinds of nastiness ...

okellogg

@kshegunov for discussion see https://bugs.kde.org/show_bug.cgi?id=338649#c6
(have not yet found the time for analyzing whether the kdevelop clang plugin could be usable for umbrello)