Using QSet<QString> to create a poor mans string pool?
-
I am using QStringRef as far as possible, but ultimately it does not solve the problem. I am reading files of about 16M, sucking them into a QString as whole, extracting some Information with QRegularExpression, using capturedRef(), storing it, and then throwing away the string. Now I want to store the extracted Information as efficient as possible and therefore I need something like a string pool. I had hoped someone on the Forum has done this before and checked the efficiency of the proposed Solution.
-
I think you should use contains() which has less overhead for checking a set's node presence, since it does not do a copy of QSet's iterator class as in your solution and then proceeds with insertion, even though insertions will do a find (calculate the hash function) again ... this method should be faster when there are dupliacted info substrings
-
Ah, I see. I misunderstood your original question.
Here's an alternative solution, which bypasses the QSet layer.
@
void takeFromPool_solution3(QString& target, const QString& value)
{
QString& item = hash[value];
if (item.isNull()) {
item = value;
}
target = item;
}
@We know that
QHash::operator returns a default-constructed value if it doesn't contain the key.
A default-constructed QString is null.
All of your parsed strings are not null.
Therefore, you get a null string in line 3 if and only if you have never inserted this value before.
I believe this is less computationally expensive than using QSet and iterators, but I can't say for sure. Profile them all and compare them :)
P.S. Use a const ref for your 2nd argument. Otherwise, a copy is made unnecessarily.
[quote][QSet] should be faster to find Strings than using an Array or some other container[/quote]Agreed.
[quote]when using a map reference counting and auto deletion could also be implemented (which I do not require for my purpose)[/quote]Sounds like an overkill.
-
Hi,
IIRC
@hash[value];@
also insert a default constructed value in the hash so its size will grow for nothing.
Using QHash::value(const QString &key) would make more sense, wouldn't it ?
-
QSet uses already a QHash internally by pairing keys with dummy values
@struct QHashDummyValue
{
};@so it should be as fast as it could be a similar construct with QHash: for values there is nothing to be stored ... imho there is no sense to replace QSet with QHash and in fact to reinvent what set already does :)
-
[quote author="SGaist" date="1386866581"]Hi,
IIRC
@hash[value];@
also insert a default constructed value in the hash so its size will grow for nothing.
Using QHash::value(const QString &key) would make more sense, wouldn't it ?[/quote]Hi,
If value isn't in hash, then...
- Both hash[value] and hash.value(value) cause a null string to be constructed.
- A new element needs to be added into hash anyway.
The QHash::value() approach will require QHash::insert(), which calls find(). The idea behind QHash::operator[] is to do the insertion via "item = value;", which doesn't need another find().
[quote author="NicuPopescu" date="1386868141"]so it should be as fast as it could be a similar construct with QHash: for values there is nothing to be stored ... imho there is no sense to replace QSet with QHash and in fact to reinvent what set already does :) [/quote]I'm wondering if there's any difference between direct references vs. iterators.
It's an academic exercise for me too. alterratz is looking for extreme optimizations for large data sets, so I think this problem is a good candidate for trying different little tweaks that usually aren't worth investigating ;)
-
Thank your for an interesting Discussion. I hope I will find the time to run a few tests in the near future and report back the results.
-
And the results are in:
I tested the following implementations:
@
QHash<QString,QString> stringpool1;
QMap<QString,QString> stringpool2;
QStringList stringList1;
QStringList stringList2;
QStringList stringList3;void takeFromPool1 (QString& target, const QString& value)
{
QString& item = stringpool1[value];
if (item.isNull()) {
item = value;
}
target = item;
}void takeFromPool2 (QString& target, const QString& value)
{
QString& item = stringpool2[value];
if (item.isNull()) {
item = value;
}
target = item;
}..... (test example)
foreach (s, a2ldata.split(QRegular[removed]"\s+"), QString::SkipEmptyParts)) {
++i;
takeFromPool1(t, s);
stringList1.append(t);
}@
Where I used a Hash and a Set, both with operator[]. In the first two tests I took a large text file (10+ times war and peace with approx. 30MB, removed all interpunctation) and appended each word to a string list, the first using the HASH, the second using the map. But look at the memory usage for the third run, where I appended the string directly to a QStringList. The memory usage goes through the roof while the execution time is faster than for both pool-based methods (no search of course but lots of memory allocations). Test cases 4 and 5 are just re-inserting (or merely lookups to be exactly) of all strings in the hash respectively map.
However the Hash performs considerably better regarding to execution time as well as memory use, so thank you all again for your help. However, i find it interesting than about 46 Bytes have been used for the second execution using the hash while no memory has been used for the set.
And here are the Results:
@
Starting /home/bernhard/work/qt_projects/0_build/QStringPoolTest/QStringPoolTest...
clock resolution: 1 ns"data size is 30.324669 MB, loading took 175 ms"
QHash
"added 5662950 items in 5503 ms"
"items stored: stringpool1: 22788 stringlist1: 5662950"
"total MB used: 2.510651"QMap
"added 11325900 items in 8010 ms"
"items stored: stringpool2: 22788 stringlist2: 5662950"
"total MB used: 2.266266"QStringList
"added 16988850 items in 4620 ms"
"items stored: stringpool3: n.a. stringlist1: 5662950"
"total MB used: 270.009277"QHash
"re-added 22651800 items to pool only in 5092 ms"
"items stored: stringpool1: 22788 stringlist1: 5662950"
"total MB used: 0.000046"QMap
"re-added 28314750 items to pool only in 7597 ms"
"items stored: stringpool2: 22788 stringlist2: 5662950"
"total MB used: 0.000000"
@P.S. For QStringList it should of course read stringlist3. This is just a typo in the qDebug string. The element count of the correct QStringList is printed.
If anyone is interested in playing around with it, everything, including a nice transcript of war and peace, can be found here:
http://rapidshare.com/share/7988DF3A0053D9F30284B48C207DE628However, unlike QT, it is not cross-platform and works only in linux, since I do not know how to measure execution time and memory in windows.
An interesting Problem I have fond is that in some cases a QStringRef in combination with a QRegularExpression is a real performance killer (~700s instead of 800ms) , but I'll post this in a new thread, including some performance measurements, after I verify that it has not been reported as a bug already, and of course generate some test data run the measurements.
Best regards,
bernhard -
bq. Where I used a Hash and a Set
where QSet in your test? QMap does ordering, QHash doesn't ... we have discussed QSet vs QHash as far as I see
-
Late & Drunk programming => bad Idea ;)
@
QHash<QString,QString> stringpool1;
QHash<QString,char> stringpool2;
QSet<QString> stringpool3;void takeFromPool1 (QString& target, const QString& value)
{
QString& item = stringpool1[value];
if (item.isNull()) {
item = value;
}
target = item;
}void takeFromPool2 (QString& target, const QString& value)
{
QHash<QString,char>::const_iterator item = stringpool2.find(value);
if (item == stringpool2.constEnd()) {
item = stringpool2.insert(value, 0);
}
target = item.key();
}void takeFromPool3 (QString& target, const QString& value)
{
QSet<QString>::const_iterator item = stringpool3.find(value);
if (item == stringpool3.constEnd()) {
item = stringpool3.insert(value);
}
target = *item;
}
@and TADA the results:
@
clock resolution: 1 ns"data size is 30.324669 MB, loading took 173 ms"
QHash operator[]
"added 5662950 items in 5616 ms"
"items stored: stringpool1: 22788 stringlist1: 5662950"
"total MB used: 2.510666"QHash (find, insert)
"added 5662950 items in 5700 ms"
"items stored: stringpool2: 22788 stringlist2: 5662950"
"total MB used: 2.516357"QSet
"added 5662950 items in 5818 ms"
"items stored: stringpool3: 22788 stringlist3: 5662950"
"total MB used: 2.171280"QStringList
"added 5662950 items in 4761 ms"
"items stored: stringpool4: n.a. stringlist4: 5662950"
"total MB used: 270.059021"QHash operator[]
"re-added 5662950 items to pool only in 5671 ms"
"items stored: stringpool1: 22788 stringlist1: 5662950"
"total MB used: 0.000046"QHash (find, insert)
"re-added 5662950 items to pool only in 5265 ms"
"items stored: stringpool2: 22788 stringlist2: 5662950"
"total MB used: 0.000000"QSet
"re-added 5662950 items to pool only in 5563 ms"
"items stored: stringpool3: 22788 stringlist3: 5662950"
"total MB used: 0.000000"
@However, when reviewing the results I cannot belive the memory measurements. 2.5 MB for 5*10^6 Words is far too small. This cannot be correct. However, I belive that the memory usage of the pool is considerably lower than with the normal StringList. I used the mallinfo call to check the memory consumption but it seels than something different is required here. Maybe valgrind ....
UPDATE:
the (hopefully) corrected memory sizes:
"total MB used: 65.84" for both QHash implementations
"total MB used: 65.67" for QSet
"total MB used: 252.76" for QString
using "this tool":http://panthema.net/2013/malloc_count/ instead of mallinfo.