"Getting an issue fixed" and "QHash result intersection performance"

silicomancer

@SGaist: I intentionally avoided posting the link here. Imagine more people would do this and would flood the forum with copies of bug tracker issues.

Anyway if you think it's the right place, here it is: https://bugreports.qt-project.org/browse/QTBUG-27013

It is about an algorithm I use that calls QHash::values(key) on multiple related QHash objects and intersects the results. The latter is not possible in accetable speed because QHash::values(key) returns an unordered QList<value> instead of QSet<value> which does not provide a fast intersect() method. QList::toSet() is my current workaround but slows down the code significantly. I have no idea why they preferred QList<value> for this since the data is unordered anyway.

@JKSH: No, actually I decided not to use self-build Qt. I instead invested some time into testing ANGLE on VMs and decided to switch to the ANGLE builds (which are provided in both x86 and x64 for VS2012).
The mailing list is a good idea. I will subscribe to it.

I wonder if there is any website collecting bounties for missing Qt features or bugs?!

SGaist

Are you thinking of something like:

@
QHash<int, QString> myHash;
QSet<QString> valuesSet = myHash.valuesSet(myKey);
@

?

Edit:Indeed it was QString I had in mind

silicomancer

Yes (providing you intended to write "QSet<QString> valuesSet"). That's exactly what I need. But I suppose there are a few more functions that could be improved in the same way.

Do you think that bounty idea is a good one?

NicuPopescu

bq. QList::toSet() is my current workaround but slows down the code significantly

if you investigated so much the issue tbh I wonder why you have not done your own "maptoset" or "hashtoset" templated functions? with the code similar like in toList qt's src a code as bellow is better than your workaround:

@template <class Key, class T>
QSet<T> maptoset(QMap<Key,T> map)
{

QSet<T> res;
res.reserve(map.size());
typename QMap<Key,T>::const_iterator i = map.begin();
while (i != map.end()) {
    res.insert(i.value());
    ++i;
}
return res;

}

template <class Key, class T>
QSet<T> hashtoset(QHash<Key,T> map)
{

QSet<T> res;
res.reserve(map.size());
typename QHash<Key,T>::const_iterator i = map.begin();
while (i != map.end()) {
    res.insert(i.value());
    ++i;
}
return res;

}
...

QHash<int,QString> myHash1;
QHash<int,QString> myHash2;
QHash<int,QString> myHash3;

myHash1[1] = "one";
myHash1[2] = "two";
myHash1[3] = "three";

myHash2[1] = "one";
myHash2[2] = "three";
myHash2[3] = "two";


myHash3[1] = "one";
myHash3[2] = "two";
myHash3[3] = "five";

QSet<QString> set1 = hashtoset(myHash1);
QSet<QString> set2 = hashtoset(myHash2);
QSet<QString> set3 = hashtoset(myHash3);

QSet<QString> setIntersect = set1.intersect(set2.intersect(set3));

qDebug()<<"interesection"<<setIntersect;

@

silicomancer

@NicuPopescu: Thanks for you suggestion. But that's not what I want to do. I am afraid you missed the point.

SGaist

Could you show an example of what you want to do ?

NicuPopescu

it was an answer to the unhappy workaround :) ... as regarding the main topic I let the other advanced people here to respond

ChrisW67

QMap::values() and QHash::values() can only sensibly return a QList. Both functions return all the values associated with the key specified in reverse insertion order (or all keys if no key is given). This list can, and in the general case will, contain duplicates. What you are proposing would have values() return a QSet where the values are the keys: this cannot contain duplicates. Your proposal might fit your particular data set but not the general case.

Your problem is how to efficiently find the intersection of several unordered lists. You convert each list to a QSet and then use intersect(), but that involves hashing every value several times. Have you tried the generic std::set_intersection approach?

@
#include <algorithm>

QList<int> first;
first << 35 << 10 << 15 << 20 << 15 << 35 << 30 << 35;
QList<int> second;
second << 10 << 45 << 35 << 35 << 35 << 35;

QList<int>::iterator it;
std::sort(first.begin(), first.end());
it = std::unique(first.begin(), first.end());
first.erase(it, first.end());

std::sort(second.begin(), second.end());
it = std::unique(second.begin(), second.end());
second.erase(it, second.end());

QList<int> result;
std::set_intersection (
first.begin(), first.end(),
second.begin(), second.end(),
std::back_inserter(result));
qDebug() <<"Result =" << result;
@

silicomancer

@ChrisW67: Your algorithm is much better than my workaround. It's about twice as fast. I will use it instead. Thank you very much!

But unfortunately it still is too slow for my application.

I would like to discuss my proposal. I think it would give this kind of data analysis a major speedup and is worth being considered. Additionally, using QHash::valuesSet(key) + QSet::intersect() would be much more easy to use and to understand.

I think having a QHash (or a QMultiHash in my case) with n:1 relation isn't unusual. Actually that is what was QMultiHash made for. There are a lot of applications where a fast n:1 lookup is needed. Simply most application that process large data amounts without involving a fully featured database. In these applications, valuesSet(key) would VERY useful. And even for a 1:n hash it could be useful to get a set of unique values. So I see no reason why valuesSet(key) (or similar) would be a bad idea. Specifically I don't think it is too specialized. Actually it would be probably more useful than the QList variant since getting the values in reverse insertion order is needed not very often on a QHash. But speed is a neck-breaker, especially for QHash

You say that creating a set involves expensive hashing. You are right. This is probably killing my code. But that hashing is only necessary using my workaround. QHash and QSet use the same hashing algorithm (I think QSet internally is a QHash), right? A Q(Multi)Hash::valuesSet(key) method could simply copy the found hashes from the source QHash to the result QSet. This would be surely quick. Then that would allow doing the desired fast QSet::intersect() without separate (expensive) std:unique and sort calls. I suppose that would be even much faster than your algorithm.

Are there any technical facts I am missing why this wouldn't work?

ChrisW67

Hashes are designed for rapid retrieval by key, i.e, 1 to n lookup. The keys are hashed to achieve the speed on large data sets (for small key counts it's not worth the effort). Whether there is one or many values for a given key is largely irrelevant: the values are simply stored in the correct hash bucket.

While QSet is based on a QHash what you are expecting as a result is a set built from the values for a given key in the source hash. The values are not hashed in the source QHash, so they will have to hashed in order to become the keys in the QSet internal hash table. Whether you do it internally or externally to QHash will make little performance difference.

The problem you described is finding all the values that are common to all the keys you started with. This tells you nothing about the keys to which these values belong that you did not already know at the outset. You are not determining the keys that contain a certain value, a task for a which a second, reversed hash would be suitable. Perhaps I do not have a grasp on the problem.

If you only ever need this structure for this query, or need the multiple values to be unique, then you could make the structure a map from key to QSet<valueType> so you build and store the sets as you go. You still have to hash all the values, but you move the cost from query to insertion time.

silicomancer

You are right, those are the wrong hashes. They would need to be calculated from the scratch. This would not work.

Actually the entire data structures used by me are pretty much optimized to speed up read accesses by moving the cost to insertion. This is ok and desired. So your map-of-sets idea is pretty good. I will definitely try this! I am not sure if it would be better to first merge those sub-sets to intersect that with the main result or to immediately intersect each of the sub-sets with the main result.

I think this solves my problem. But please allow me explore one more thought.

The problem with my intersection idea is the fact that QSet needs hashes for data storage. This is expensive. But QMap does not suffer from this drawback. What if someone would build something like a tree-based set, lets call it QTree. How would such a tree perform on my cascaded lookup-and-intersect problem? QMap and QHash do not replace each other - they both have their drawbacks and so they both exist. I would expect the same for QSet and something like QTree. Why does no such a container exist?

NicuPopescu

Hi,

first I missed your point, then I thought how to do hash tables as hash lists (sets with duplicates) from one key's multiple values and then a fast intersect function on them:

@template <class Key, class T>
QHash<T,T> HashToHashList(QHash<Key,T> hash)
{

QHash<T,T> res;
res.reserve(hash.size());
typename QHash<Key,T>::const_iterator i = hash.begin();
while (i != hash.end()) {
    res.insertMulti(i.value(),i.value());//difference from QSet which uses

//insert() ... now res is equivalent to QSet with duplicate values!
++i;
}
return res;
}

template <class T>
QHash<T,T> intersectHashLists(const QHash<T,T> &hash1,const QHash<T,T> &hash2)
{
QHash<T,T> copy1(hash1);
QHash<T,T> copy2(hash2);
typename QHash<T,T>::const_iterator i = hash1.constEnd();
while (i != hash1.constBegin()) {
--i;
if (!copy2.contains(*i))//fast look up by hash function
copy1.remove(*i);
}
return copy1;
}

...

QHash<int,QString> myHash1;
QHash<int,QString> myHash2;
QHash<int,QString> myHash3;

myHash1[1] = "one";
myHash1[1] = "one";
myHash1[2] = "two";
myHash1[3] = "three";
myHash1.insertMulti(1,"four");
myHash1.insertMulti(1,"eight");
myHash1.insertMulti(1,"five");
myHash1.insertMulti(1,"six");
myHash1.insertMulti(1,"one");

qDebug()<<"myHash1"<<myHash1;

myHash2[1] = "one";
myHash2[2] = "three";
myHash2[3] = "two";
myHash2.insertMulti(1,"four");
myHash2.insertMulti(1,"seven");
myHash2.insertMulti(1,"five");
myHash2.insertMulti(1,"six");

QHash<QString,QString> hashList1 = HashToHashList(myHash1);
QHash<QString,QString> hashList2 = HashToHashList(myHash2);

qDebug()<<"hashList1"<<hashList1;
qDebug()<<"hashList2"<<hashList2;

QHash<QString,QString> intersectionHashLists = intersectHashLists(hashList1,hashList2);

qDebug()<<"intersectionHashLists"<<intersectionHashLists;

...
@

QSet intert() source:
@ inline iterator insert(const T &value)
{ return static_cast<typename Hash::iterator>(q_hash.insert(value, QHashDummyValue()));//does not allow duplicates@

hope I have not missed the point again :)

L.E. brr ... I missed the key finding again but I'll follow up :)

silicomancer

@NicuPopescu: I think your new algorithm is more or less the same as I requested in my issue. The line "res.insertMulti(i.value(),i.value());" does pretty much the same as set.insert() since it also calculates a hash for each value (which is the responsible for making my requested algorithm slow too).

EDIT: And I missed your edit :-)

NicuPopescu

except that for copying from source to QHash or QSet involves calls to qhash function, there is an important difference in my approach: you can have duplicates, entire source will be found in new copy; I think the solution works well if you don't need to filter by a key which I missed to address in pursuing how to convert a container(I used QHash with mulitple values but it could be other) to a hash table for fast intersections ... edit: I was so sure that I found a possible solution and I realized after posting that I forgot to treat finding values by a key

to filter by a key is not so big deal with little condition: either to have access to hash data or at least to have access to hash iterator encapsulated QHashNode

now I have other wonder: since templated code in general is placed in headers, as it is the case for stl or qt container classes, why it couldn't be a solution to work with a local/project working copy slightly modified to meet your needs or even to do changes directly in qt sources, no need of qt rebuild ... all my efforts so far was to find a solution keeping qt source untouched

NicuPopescu

I slightly modified ..\Qt5.1.1\5.1.1\mingw48_32\include\QtCore\qhash.h

@template <class Key, class T>
Q_OUTOFLINE_TEMPLATE QHash<T,T> QHash<Key, T>::valuesHash(const Key &akey) const
{
QHash<T,T> res;
Node *node = *findNode(akey);
if (node != e) {
do {
res.insertMulti(node->value,node->value);
} while ((node = node->next) != e && node->key == akey);
}
return res;
}@

and added as member in QHash class and it works as expected ... hope so :)