Performance improvements in Qt5?
-
I tested a simple wordcount app with Qt5 and noticed a significant performance improvement.
The app parses text, divides the text into unique words and their usage count using user determined separator characters.
The file I used for benchmarking is a whooping 77.7 MB, and here is the result:
Qt4
!http://i48.tinypic.com/ofndzb.png(qt4)!Qt5
!http://i48.tinypic.com/2coj3ih.png(qt5)!As you can see, the Qt5 version is more than twice as fast.
So what is this due to? Move semantics? The compiler (Qt5 is MSVC2012 x64, Qt4 is MinGW x86)? Or some other improvement in Qt5?
On a side note, I am not using all that much of Qt in this app - besides the GUI stuff there is only QString and QMap.
-
Hi!
Difference between x64 and x86 is gigantic. x64 is much much faster. So here is your main reason for speed difference. Also compilers are different. I find MSVC compiler slightly faster than MinGWs.
Also there is disk cache. I/O from disk is extremely slow. And if you first ran Qt 4 app and then Qt 5, that also helped Qt 5 a little bit, cause file was cached ( it was a tiny difference here).
But it'd be interesting to see, what are the results under same circumstances :) .
-
How (often) do you update the table view? This might boil down to an improvement in the painting system. If you really want to get to the root of the improvement (which would surely be of interest to everyone here), you should benchmark specific parts with QElapsedTimer. It can achieve microsecond accuracy or better. At least separate the disk I/O, the processing algorithm and the GUI updates. As Jake said, disk caches can significantly influence the outcome of your benchmark, that's why I suggest separating disk I/O (i.e. load all data into RAM before processing, not while). Processor caches are another issue but since you're comparing two different compilations (thus not running within one process lifetime), there should be no bias there.
Easiest solution: use valgrind. But since you mention MSVC, I don't think it's available to you.
-
DerManu - I only time the text processing stage, the table view, item models and all that stuff is done afterwards. Also, I have eliminated disk IO by copying everything to the RAM and reading from there.
Some pseudo code to the part that I am timing:
@void processChar(const QChar ch)
if ch is not contained within the separators QString
if there is no current word, say there is
append ch to the current word
else (if ch is a separator)
if there is a current word
if the current word is not contained within the map
add to the map with value 1
else (if word already exists)
find the key and increment its value by 1
else (skip)@So the performance improvement comes entirely from within this small function, which does nothing but append to a pre-allocated QString, search another QString for specific characters and search the map for a specific key.
Jake007 - the primary motivation for x64 was to address more memory. Direct performance improvements from porting a 32bit app to 64bit are miniscule at best. You can expect serious performance increase only in cases you really need much memory or you extensively process 64bit primitives.
I suspect there is a much higher performance delta between MinGW and MSVC than between 32 and 64 bit. But still, most performance differences come from vendor specific implementations of the STL rather than the efficiency of the generated binary. And I am not using STL at all.
-
On the other hand, I keep forgetting it is not a matter of MinGW vs MSVC, but a matter of a pretty much ANCIENT MinGW that ships with Qt.
I doubt the SDK compiler supports AVX, while the MSVC2012 supports those for sure.
So my uneducated guess is the old GCC compiler does auto vectorization to 128bit SSE, while the MSVC2012 vectorizes to 256bit AVX - which perfectly explains the 2x increase of performance.
In short, this unexpected performance boost is not due to Qt5 but due to no longer using the outdated compiler from Qt4.
-
That might be true. GCC for the Microsoft® Windows® Qt SDK crowd is almost four years old. Poor guys. That's GCC 4.4, but AVX is supported since 4.6 (two years). I wonder why the Qt developers would let something like this happen. It really makes Qt applications on Microsoft® Windows® appear in a bad light very unnecessarily. On GNU/Linux you get GCC 4.7.2, with up-to-date distros.
Anyhow, yeah, now that's indeed one of the few acceptable reasons to stick with msvc for some time on that OS. However, compiling Qt yourself with a more modern version of MinGW should work smoothly, right?
Some comments on your code (you might have implemented these techniques already, but stripped them off the pseudocode for brevity):
- To check whether a character is a separator, use a bool lookup array: bool separatorLookup[ 256]. In the initialization phase of your algorithm, you once fill that array with corresponding true/false values corresponding to the integer representation of every possible char value. Then, all you need to do, is use a char directly as index of that array, to check whether it's a separator or not – this is lightning fast and that's how (good) parsers do it.
- Use a QHash instead of a QMap for your word histogram. Judging from your pseudocode, it will be a better fit for the job, and thus faster. If you want to have it sorted afterwards by key (as the QMap does by design), export the hash's contents to a QMap or other sortable container once in the end. (But I'm guessing you'd want it sorted by value anyway.)
-
You are right, it is much faster with QHash, and since I do sorting using the standard item model and table view I don't really need the map. Just switching to a QHash reduced the time to 4.2 sec, so it is safe to assume in my average case, the lookup for QHash is at least twice as fast.
As for using a 256 byte array for the separators - I will have to investigate that further. I intend to support wider than 8bit characters, QString itself is 16bit, and so far so good, since the array would be 64k and neatly fit inside a single cache line, but for potential 32bit characters I think keeping only separators confined in a string will be the more cache friendly and less cache polluting method, considering a 32bit lookup array of bools is like 4 GB...
Overall I am quite happy with the performance, for a single thread I currently get about 17 million characters per second, and I'll make the whole thing multithreaded.
-
[quote author="utcenter" date="1356787694"]As for using a 256 byte array for the separators - I will have to investigate that further. I intend to support wider than 8bit characters[/quote]
Then you might also want to try a QHash for this. The key to speed here is calculating the appropriate lookup index directly from the character. The most direct calculation is just transforming the character itself to an int (that's the bool array solution). another, less memory intensive way is hashing the character and using the hash (itself just an integer) as index. For your 32-bit example, this means not using that 32-bit number as an index to a huge bool array, but using that 32-bit number modulo x. The hashing function here is the modulo function (very fast and good enough for the job). where x is a fitting number to make the lookup table small enough to be reasonable, and on the other hand large enough to keep collisions at a minimum. This is what QHash does for you behind the scenes, and why it's so extremely fast.
So to summarize: Use @QHash<QChar, bool> separatorLookup;@
-
Now I "cache" the characters from the separators string into a 16bit lookup array - performance increased to 25-27 million characters per second, plus it is no longer bound to the number of separator characters since parsing the string is done only once.
-
Nice. These were the obvious flaws in your algorithm. Now it gets a bit more interesting, if you want more performance. (My gut feeling tells me you can push this below one second, but probably not without ditching Qt and going to low level C.)
Some thoughts:
- Make your function inline nicely. Something that gets called so often as "processChar", shouldn't need a context switch.
- remove that const infront of the QChar. (Seems crazy but for such small datatypes, this helps, sometimes – a bit of compiler magic here. You'd need to look at the asm output to find out systematically.)
- The part:
@ if ch is not contained within the separators QString
if there is no current word, say there is
append ch to the current word@might be faster when just saving integer values of the positions of the separator, and then extracting the word (once found) via yourString.mid(...). If you do this cleverly, you will save one intermediate value copy of each word (memory or even cache access is expensive compared to CPU registers. That integer would probably be held in a register for the whole time). In high performance code, appending to a string in the inner loop is insane.
-
I think the compiler does a fairly good job at inlining on its own, since doing it explicitly results in nothing, either that, or the method is too complex to inline.
bq. In high performance code, appending to a string in the inner loop is insane.
I reserve plenty of space to ensure I avoid potential reallocation.
What I found to be odd is that passing a QChar reference is actually faster than passing a copy, even thou I am using a 64bit build and QChar is only 16bit, worst case scenario performance should be the same but it takes a small hit when passing by value.
I compared this algorithm to the performance of word count in MS Word. Keeping in mind word only counts characters and words and does not even go into the details of unique words and word usage count, analyzing the same text in MS Word takes about 4-5 seconds, for my algorithm it takes only 230 msec, and I still have the option to make it multithreaded.
But hey, I am pretty happy with having "ONLY" 25x times better performance than a top tier professional text editor that doesn't even do as much work :)