OpenMP optimization
-
Hi,
Confused with optimizing scalability of speed vs omp threads.
If running 12threads getting runtime of 3m:4sec
If running single thread 4m:22Code uses lots of 32bit memory reads and writes at random points over many giga bytes of ram.
Have optimized to bring points closer but still cache misses should be main bottleneck.But if running 2 or more of same apps on fewer OMP threads as separate processes, multi apps don't drop each other's speed significantly and finish with more work in same time.
Before bringing memory access points closer ,speed scaled quite well with OMP but after many optimizations that improved overall performance throwing not ethteads does not appear to improve significantly on mingw 7.3 and Qt 5.12.
But computer appears to have extra. How else could 2 separate processes do more.
Omp loops are as outside as possible ,
Should be no memory collisions.Using mostly:
#pragma omp parallel for schedule (dynamic)
What may be done wrong to cause threading performance in single app not to improve in this scenario?
-
Hi,
Confused with optimizing scalability of speed vs omp threads.
If running 12threads getting runtime of 3m:4sec
If running single thread 4m:22Code uses lots of 32bit memory reads and writes at random points over many giga bytes of ram.
Have optimized to bring points closer but still cache misses should be main bottleneck.But if running 2 or more of same apps on fewer OMP threads as separate processes, multi apps don't drop each other's speed significantly and finish with more work in same time.
Before bringing memory access points closer ,speed scaled quite well with OMP but after many optimizations that improved overall performance throwing not ethteads does not appear to improve significantly on mingw 7.3 and Qt 5.12.
But computer appears to have extra. How else could 2 separate processes do more.
Omp loops are as outside as possible ,
Should be no memory collisions.Using mostly:
#pragma omp parallel for schedule (dynamic)
What may be done wrong to cause threading performance in single app not to improve in this scenario?
Hi,
Adding a big amount of threads is usually not the correct answer. There is a sweet spot to find after which the benefit might stale and even go backward. This will partly depend on your hardware, the operations your are doing and how you access your data.
-
When adding a lot of threads to your computation the memory access patterns become the most crucial component to tune.
Even if you're not accessing the same specific addresses from multiple threads you have to remember that CPU cores work with cache lines. If you read and write to addresses that are closely packed together you might be hitting a cache line sync issue, where the more threads you add the slower it becomes, because cache line sync becomes prevalent.
On the other hand if you have a lot of random reads/writes across large address spaces with a lot of threads you could be simply running out of cache faster than you can swap it.
In general it's not as simple as throwing more threads at a problem. You have to very carefully tune cache utilization. Get to know your target hardware - how big the cache lines are, how much cache there is at each level and exactly how much of it your computation utilizes, how are cache lines synced between cores and core clusters. Having a bunch of data like this tune your data structures and data access so that threads don't consume more cache lines than they absolutely need and that they don't share cache lines when they don't absolutely need to. When talking about massive thread count it's counterintuitive, like bringing stuff closer together can result in worse performance due to cache line sharing and sync congestion.
As to when multiple processes can result in better performance than more threads - one such common cause is hidden access to some shared state. This is not always obvious. It might be hiding under a system or a library call that does some synchronization that you don't even know or think about. It can be hidden under memory allocation when threads are congested on the memory manager. It might have a mutex or other shared memory access sync. If you allocate a lot of small chunks then memory or address space fragmentation becomes a factor as each
new
becomes more and more expensive etc.In any case very important is tooling - multithreading is hard and often defies intuition. Don't guess. Arm yourself with good profiler for your platform, one that understands and helps you observe cache effects.
-
When adding a lot of threads to your computation the memory access patterns become the most crucial component to tune.
Even if you're not accessing the same specific addresses from multiple threads you have to remember that CPU cores work with cache lines. If you read and write to addresses that are closely packed together you might be hitting a cache line sync issue, where the more threads you add the slower it becomes, because cache line sync becomes prevalent.
On the other hand if you have a lot of random reads/writes across large address spaces with a lot of threads you could be simply running out of cache faster than you can swap it.
In general it's not as simple as throwing more threads at a problem. You have to very carefully tune cache utilization. Get to know your target hardware - how big the cache lines are, how much cache there is at each level and exactly how much of it your computation utilizes, how are cache lines synced between cores and core clusters. Having a bunch of data like this tune your data structures and data access so that threads don't consume more cache lines than they absolutely need and that they don't share cache lines when they don't absolutely need to. When talking about massive thread count it's counterintuitive, like bringing stuff closer together can result in worse performance due to cache line sharing and sync congestion.
As to when multiple processes can result in better performance than more threads - one such common cause is hidden access to some shared state. This is not always obvious. It might be hiding under a system or a library call that does some synchronization that you don't even know or think about. It can be hidden under memory allocation when threads are congested on the memory manager. It might have a mutex or other shared memory access sync. If you allocate a lot of small chunks then memory or address space fragmentation becomes a factor as each
new
becomes more and more expensive etc.In any case very important is tooling - multithreading is hard and often defies intuition. Don't guess. Arm yourself with good profiler for your platform, one that understands and helps you observe cache effects.
@Chris-Kawa said in OpenMP optimization:
Even if you're not accessing the same specific addresses from multiple threads you have to remember that CPU cores work with cache lines. If you read and write to addresses that are closely packed together you might be hitting a cache line sync issue, where the more threads you add the slower it becomes, because cache line sync becomes prevalent.
Thanks. Optimization appears more complex than before.
If running 4 programs with 2 threads with some same config . Result is >3x work done compared to single run that uses 8 cores.
Before speed optimizations and code changes, OMP scaled more linear but overall performance was worse compared to single thread currently.Used optimization flags: O3 , march=native , mavx
Optimization so far may have been making memory access patterns more linear.
But some data and configurations work faster with outer loops more parallelized or that data more linearly acessed, others if inner loops more parrallel.
Could there be auto tuning library that can optimize multiple loops threading at runtime to gain performance with memory access patterns?
-
@Chris-Kawa said in OpenMP optimization:
Even if you're not accessing the same specific addresses from multiple threads you have to remember that CPU cores work with cache lines. If you read and write to addresses that are closely packed together you might be hitting a cache line sync issue, where the more threads you add the slower it becomes, because cache line sync becomes prevalent.
Thanks. Optimization appears more complex than before.
If running 4 programs with 2 threads with some same config . Result is >3x work done compared to single run that uses 8 cores.
Before speed optimizations and code changes, OMP scaled more linear but overall performance was worse compared to single thread currently.Used optimization flags: O3 , march=native , mavx
Optimization so far may have been making memory access patterns more linear.
But some data and configurations work faster with outer loops more parallelized or that data more linearly acessed, others if inner loops more parrallel.
Could there be auto tuning library that can optimize multiple loops threading at runtime to gain performance with memory access patterns?
@Q139 said:
Could there be auto tuning library
No. There are tools that can help identify problems, but there's no magic button. You have to do the work.
-
Most likely you need to have a look at the different schedules OpenMP provides. My guess is that a single loop iteration does not have enough work. IIRC the chunk size by default is 8. Which means that each thread takes 8 adjacent iterations of the loop. If you don't do enough work within one iteration of the loop, say it takes 8ns, then synchronization between the threads to fight for the next chunk might take over, maybe 36ns. Synchronization overhead will eat up all your speed gains. Adding to that is that if every loop iteration accesses the same array at index
i
this would mean that you only access 8 neighboring memory addresses (times variable size) within one thread. This also means thati+8
andi-8
are accessed by other threads. This gives massive problems with caching.So, first thing to try is to increase the chunk size for the schedule. Also, I suggest to use a static schedule instead of dynamic to reduce communication overhead between threads. If the work within every iteration of the loop is (almost) the same, all threads will take the same time to process their chunks. Hence, a static schedule will (most likely) assign the same number of chunks to each thread as the dynamic one.
-
This post is deleted!
-
Multi apps doing more work was less false sharing related.
Running multi apps with identical task and less threads seems like good trick to test performance with reduced false sharing vs memory bandwidth.
Now placed for loops and data in a way to attemp maximize multithreaded memory access distance between threads instead of closeness for portions involving writing data while keeping memory access closeness and less copies for reading portions.
Hopefully that is right philosophy for multithreaded optimization on multilevel cached hardware.