[SOLVED] Multi-threading slower than single thread
I've implemented a neural network where all neurons are individual QRunnable. Each neuron only has a simple calculation (sigmoid function), but the neurons are executed a couple million times. I'm running a VM with 4 cores. When I execute the neurons sequentially, my entire process takes 4 minutes. I start the neurons as follows:
@for(int i = 0; i < mNeurons.size(); ++i)
Then I tried to make my neural network multi-threaded by executing the neurons in parallel. Note that all neurons are created on startup. When using threads, my process needs about 45 minutes. This is how I run them:
@QThreadPool *pool = QThreadPool::globalInstance();
for(int i = 0; i < mNeurons.size(); ++i)
I thought there was something wrong with QThreadPool, so I decided to implement all neurons as separate threads and removing the pool. However, this didn't influence the execution - still 45 minutes. Does anyone know why it takes so much longer to execute when calling start() instead of run()? Is there so much work done in start, before run() is executed?
There are many cases where multithreading will cost you performance instead of gain you. Your workload is probably too finely grained. A metaphor for this - 4 carpenters making tables will be 4 times faster if they all make their own table, but if they squeeze to work simultaneously on one single table, they will obstruct each other and waste more time than it would take a single one to complete the table.
Having multiple workers is no guarantee for performance increase if you don't don't set them up to the appropriate tasks.
More information on how to speed up your implementation would require the design of your neurons, the run method and their relation in the network. I've had performance drops due to multithreading way too finely in the past, but nothing even close to the drop you are experiencing.
Thanks, I think that is exactly what happens. My threads all access the same synapses, so I had to implement a mutex. So (probably) constantly locking/unlocking a mutex to handle cross-thread access to the same object creates a huge overhead.
Well, you should try to go for a lock free implementation, that will surely gain some performance, although I am not quite sure it will be enough to make multithreading in the scenario feasible.
Maybe it will be better to divide the different neurons into different groups based on their structure and dependency on each other, if this is a possibility anyway. Then, with multiple groups you can run each group in a thread and only synchronize on group level.