Using QSet<QString> to create a poor mans string pool?

alterratz

Hello all,

I am writing an application with lots of (repeating) string data. Some of the strings are to be stored in Objects. I could just use "QString m_member;" but I dislike this, since this neglects QTs nice ability of implicit string data sharing.
Therefore I cam up with this solution:
@
QSet<QString> stringpool;

void takeFromPool_solution1(QString& target, QString value)
{
QSet<QString>::iterator item = stringpool.insert(value);
target = *item;
}

void takeFromPool_solution2(QString& target, QString value)
{
QSet<QString>::iterator item = stringpool.find(value);
if (item == stringpool.end()) {
item = stringpool.insert(value);
}
target = *item;
}
......
QString x;
takeFromPool_solution1(x, QString("some common string"));
......
@

However, there are a few issues I'm not quite sure:
Is QSet the most efficient container class to do this? At least it has to calculate some hash value and who knows what else is happening inside there, but however it should be faster to find Strings than using an Array or some other container, and when using a map reference counting and auto deletion could also be implemented (which I do not require for my purpose) since the strings live till the end of the application.

Also I do not know if it is better to use find/insert or just to use insert instead, since the description says that there is only an actual insertion performed when there was no element with the searched value.

Thanks for your help and an interesting discussion in advance.

Best regards,
Bernhard

JKSH

Hi,

[quote]I dislike this, since this neglects QTs nice ability of implicit string data sharing.[/quote]What do you mean? All copies of a QString share common memory (until you modify one of the copies).

If you pass the same string to multiple objects, they will all share the string's internal data. Even if the original QString is destroyed, its data will persist until all other copies are destroyed.

alterratz

Hi again,

Thank you for your answer. I think i have to clarify the question:

Yes, if you copy them or assign them to each other they share the same data. However, when you parse a string and use independent strings to for instantiation the created QString objects cannot share the data.

i.e.
@
QString a("test");
QString b(a);
@ => same data used for a and b, nice

But
@
QString t("Hello World, Hello Qt");
QString x(t.midRef(0,5).toString());
QString y(t.midRef(13,5).toString());
@ => x and y have the same content but do of course not share the data

Therefore my idea is to create a collection of different strings, look the string up with some value and use the QString from the collection object to instantiate a new string. This takes some overhead but hopefully gives me shared data in return. If collection does not contain the given string, a new String is added to the collection and used to create other Strings with the same value and shared data.

Creating some map of signals has to be done online since the symbols may change. Also the data is quite big so that this approach might be necessary.

Best Regards,
Bernhard

Asperamanca

If you work with substrings a lot, you might consider using QStringRef.

alterratz

I am using QStringRef as far as possible, but ultimately it does not solve the problem. I am reading files of about 16M, sucking them into a QString as whole, extracting some Information with QRegularExpression, using capturedRef(), storing it, and then throwing away the string. Now I want to store the extracted Information as efficient as possible and therefore I need something like a string pool. I had hoped someone on the Forum has done this before and checked the efficiency of the proposed Solution.

NicuPopescu

I think you should use contains() which has less overhead for checking a set's node presence, since it does not do a copy of QSet's iterator class as in your solution and then proceeds with insertion, even though insertions will do a find (calculate the hash function) again ... this method should be faster when there are dupliacted info substrings

JKSH

Ah, I see. I misunderstood your original question.

Here's an alternative solution, which bypasses the QSet layer.
@
void takeFromPool_solution3(QString& target, const QString& value)
{
QString& item = hash[value];
if (item.isNull()) {
item = value;
}
target = item;
}
@

We know that

QHash::operator returns a default-constructed value if it doesn't contain the key.

A default-constructed QString is null.

All of your parsed strings are not null.

Therefore, you get a null string in line 3 if and only if you have never inserted this value before.

I believe this is less computationally expensive than using QSet and iterators, but I can't say for sure. Profile them all and compare them :)

P.S. Use a const ref for your 2nd argument. Otherwise, a copy is made unnecessarily.

[quote][QSet] should be faster to find Strings than using an Array or some other container[/quote]Agreed.

[quote]when using a map reference counting and auto deletion could also be implemented (which I do not require for my purpose)[/quote]Sounds like an overkill.

SGaist

Hi,

IIRC

@hash[value];@

also insert a default constructed value in the hash so its size will grow for nothing.

Using QHash::value(const QString &key) would make more sense, wouldn't it ?

NicuPopescu

QSet uses already a QHash internally by pairing keys with dummy values

@struct QHashDummyValue
{
};@

so it should be as fast as it could be a similar construct with QHash: for values there is nothing to be stored ... imho there is no sense to replace QSet with QHash and in fact to reinvent what set already does :)

JKSH

[quote author="SGaist" date="1386866581"]Hi,

IIRC

@hash[value];@

also insert a default constructed value in the hash so its size will grow for nothing.

Using QHash::value(const QString &key) would make more sense, wouldn't it ?[/quote]Hi,

If value isn't in hash, then...

Both hash[value] and hash.value(value) cause a null string to be constructed.
A new element needs to be added into hash anyway.

The QHash::value() approach will require QHash::insert(), which calls find(). The idea behind QHash::operator[] is to do the insertion via "item = value;", which doesn't need another find().

[quote author="NicuPopescu" date="1386868141"]so it should be as fast as it could be a similar construct with QHash: for values there is nothing to be stored ... imho there is no sense to replace QSet with QHash and in fact to reinvent what set already does :) [/quote]I'm wondering if there's any difference between direct references vs. iterators.

It's an academic exercise for me too. alterratz is looking for extreme optimizations for large data sets, so I think this problem is a good candidate for trying different little tweaks that usually aren't worth investigating ;)

alterratz

Thank your for an interesting Discussion. I hope I will find the time to run a few tests in the near future and report back the results.

alterratz

And the results are in:

I tested the following implementations:

@
QHash<QString,QString> stringpool1;
QMap<QString,QString> stringpool2;
QStringList stringList1;
QStringList stringList2;
QStringList stringList3;

void takeFromPool1 (QString& target, const QString& value)
{
QString& item = stringpool1[value];
if (item.isNull()) {
item = value;
}
target = item;
}

void takeFromPool2 (QString& target, const QString& value)
{
QString& item = stringpool2[value];
if (item.isNull()) {
item = value;
}
target = item;
}

..... (test example)

foreach (s, a2ldata.split(QRegular[removed]"\s+"), QString::SkipEmptyParts)) {
++i;
takeFromPool1(t, s);
stringList1.append(t);
}

@

Where I used a Hash and a Set, both with operator[]. In the first two tests I took a large text file (10+ times war and peace with approx. 30MB, removed all interpunctation) and appended each word to a string list, the first using the HASH, the second using the map. But look at the memory usage for the third run, where I appended the string directly to a QStringList. The memory usage goes through the roof while the execution time is faster than for both pool-based methods (no search of course but lots of memory allocations). Test cases 4 and 5 are just re-inserting (or merely lookups to be exactly) of all strings in the hash respectively map.

However the Hash performs considerably better regarding to execution time as well as memory use, so thank you all again for your help. However, i find it interesting than about 46 Bytes have been used for the second execution using the hash while no memory has been used for the set.

And here are the Results:
@
Starting /home/bernhard/work/qt_projects/0_build/QStringPoolTest/QStringPoolTest...
clock resolution: 1 ns

"data size is 30.324669 MB, loading took 175 ms"

QHash
"added 5662950 items in 5503 ms"
"items stored: stringpool1: 22788 stringlist1: 5662950"
"total MB used: 2.510651"

QMap
"added 11325900 items in 8010 ms"
"items stored: stringpool2: 22788 stringlist2: 5662950"
"total MB used: 2.266266"

QStringList
"added 16988850 items in 4620 ms"
"items stored: stringpool3: n.a. stringlist1: 5662950"
"total MB used: 270.009277"

QHash
"re-added 22651800 items to pool only in 5092 ms"
"items stored: stringpool1: 22788 stringlist1: 5662950"
"total MB used: 0.000046"

QMap
"re-added 28314750 items to pool only in 7597 ms"
"items stored: stringpool2: 22788 stringlist2: 5662950"
"total MB used: 0.000000"
@

P.S. For QStringList it should of course read stringlist3. This is just a typo in the qDebug string. The element count of the correct QStringList is printed.

If anyone is interested in playing around with it, everything, including a nice transcript of war and peace, can be found here:
http://rapidshare.com/share/7988DF3A0053D9F30284B48C207DE628

However, unlike QT, it is not cross-platform and works only in linux, since I do not know how to measure execution time and memory in windows.

An interesting Problem I have fond is that in some cases a QStringRef in combination with a QRegularExpression is a real performance killer (~700s instead of 800ms) , but I'll post this in a new thread, including some performance measurements, after I verify that it has not been reported as a bug already, and of course generate some test data run the measurements.

Best regards,
bernhard

NicuPopescu

bq. Where I used a Hash and a Set

where QSet in your test? QMap does ordering, QHash doesn't ... we have discussed QSet vs QHash as far as I see

alterratz

Late & Drunk programming => bad Idea ;)

@
QHash<QString,QString> stringpool1;
QHash<QString,char> stringpool2;
QSet<QString> stringpool3;

void takeFromPool1 (QString& target, const QString& value)
{
QString& item = stringpool1[value];
if (item.isNull()) {
item = value;
}
target = item;
}

void takeFromPool2 (QString& target, const QString& value)
{
QHash<QString,char>::const_iterator item = stringpool2.find(value);
if (item == stringpool2.constEnd()) {
item = stringpool2.insert(value, 0);
}
target = item.key();
}

void takeFromPool3 (QString& target, const QString& value)
{
QSet<QString>::const_iterator item = stringpool3.find(value);
if (item == stringpool3.constEnd()) {
item = stringpool3.insert(value);
}
target = *item;
}
@

and TADA the results:
@
clock resolution: 1 ns

"data size is 30.324669 MB, loading took 173 ms"

QHash operator[]
"added 5662950 items in 5616 ms"
"items stored: stringpool1: 22788 stringlist1: 5662950"
"total MB used: 2.510666"

QHash (find, insert)
"added 5662950 items in 5700 ms"
"items stored: stringpool2: 22788 stringlist2: 5662950"
"total MB used: 2.516357"

QSet
"added 5662950 items in 5818 ms"
"items stored: stringpool3: 22788 stringlist3: 5662950"
"total MB used: 2.171280"

QStringList
"added 5662950 items in 4761 ms"
"items stored: stringpool4: n.a. stringlist4: 5662950"
"total MB used: 270.059021"

QHash operator[]
"re-added 5662950 items to pool only in 5671 ms"
"items stored: stringpool1: 22788 stringlist1: 5662950"
"total MB used: 0.000046"

QHash (find, insert)
"re-added 5662950 items to pool only in 5265 ms"
"items stored: stringpool2: 22788 stringlist2: 5662950"
"total MB used: 0.000000"

QSet
"re-added 5662950 items to pool only in 5563 ms"
"items stored: stringpool3: 22788 stringlist3: 5662950"
"total MB used: 0.000000"
@

However, when reviewing the results I cannot belive the memory measurements. 2.5 MB for 5*10^6 Words is far too small. This cannot be correct. However, I belive that the memory usage of the pool is considerably lower than with the normal StringList. I used the mallinfo call to check the memory consumption but it seels than something different is required here. Maybe valgrind ....

UPDATE:
the (hopefully) corrected memory sizes:
"total MB used: 65.84" for both QHash implementations
"total MB used: 65.67" for QSet
"total MB used: 252.76" for QString
using "this tool":http://panthema.net/2013/malloc_count/ instead of mallinfo.