In this specific case, the multi-threading overhead (semaphore management) is noticeable (tasks duration < 50ms). Of course a way to improve the code is to merge tasks (batches of 10 to 1000 jobs to reduce the impact of overhead but with the drawback of code complexity).
You definitely need that, no matter what threading technology you choose. Such small tasks are inefficient for many reasons, not only because of the synchronization primitives. Batch them up in vector/array contiguously in memory (prefer std::vector here to avoid Qt's d-ptr indirection), so you don't get constant cache misses and stay away from heap (re)allocations. Aside from that you may want to explore lock-free queues[1, 2], although I really doubt the benefit will be that significant, especially if you batch the data first.
If I show you the algo, I don't think you will see the bottlenecks at first sight
Don't be so sure, you don't know what I do. In any case the point is moot, since you can't disclose any code.