How to vectorize operation on struct of data
-
Also note that SIMD works on aligned data. Your struct has no alignment specification so compiler has to generate slower code for unaligned data. Notice the
movdquinstruction. Theustands for "unaligned" and basically means it's slower because the cpu must first align the data to an address the SIMD processor can work with. If you specify alignment for your struct to be "SIMD friendly" like this:struct alignas(128) str {then that instruction turns intomovdqawhereastands for "aligned" and processor doesn't have to do extra work. Since this adds alignment to your data there will be some memory footprint increase, so that's again something to profile.@Chris-Kawa
Can you recommend good learning materials on SIMD optimized C++ coding or about optimized coding in general? -
Sorry, no. Different people prefer to learn different ways. I usually just dig into specs and manuals and test things out.
-
Sorry, no. Different people prefer to learn different ways. I usually just dig into specs and manuals and test things out.
Addition does not seem as simple operation anymore.
When padding, its effect on memory usage and cache line alignments are considered.One side of the operation is struct from long vector that is taken from memory quite randomly, quite low probability of continuous memory accesses , other side is single struct instance in function.
The random RAM access patterns probably are main reason it runs slower.Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?
Since the accesses from RAM are quite random , would it speed up that operation if all in vector padded in your opinion, or shift operations are cheap?
-
@Christian-Ehrlicher Yes , but += is addition operation, not only copy.
CompilerExplorer is nice tool to learn ASM and seek more under the hood.Using -O3
If struct consists of 4 items it does SIMD copy and add in lesser instructions.
If struct consists of 6 it does SIMD on 4 and then 2 copy & add operations separately.
Probably reason why it does 4+1+1 or just wont support better instructions for backward compatability?
Is there some magic SIMD compiler flag i am missing?@Q139 said in How to vectorize operation on struct of data:
but += is addition operation, not only copy.
then use '=' ... really that hard??
-
Addition does not seem as simple operation anymore.
When padding, its effect on memory usage and cache line alignments are considered.One side of the operation is struct from long vector that is taken from memory quite randomly, quite low probability of continuous memory accesses , other side is single struct instance in function.
The random RAM access patterns probably are main reason it runs slower.Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?
Since the accesses from RAM are quite random , would it speed up that operation if all in vector padded in your opinion, or shift operations are cheap?
@Q139 said in How to vectorize operation on struct of data:
Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?
That's not allowed since than you would not be able to mix it with other libs which do not align (due to missing -On)
-
@Q139 said in How to vectorize operation on struct of data:
but += is addition operation, not only copy.
then use '=' ... really that hard??
@Christian-Ehrlicher said in How to vectorize operation on struct of data:
@Q139 said in How to vectorize operation on struct of data:
but += is addition operation, not only copy.
then use '=' ... really that hard??
I dont understand.
a.a+=b.a a.b+=b.b a.c+=b.c ...a.a=a.a+b.a a.b=a.b+b.b a.c=a.c+b.c ... -
@Q139 said in How to vectorize operation on struct of data:
but += is addition operation, not only copy.
then use '=' ... really that hard??
@Christian-Ehrlicher said:
then use '=' ... really that hard??
I think you miss the point. OP is doing
a += b. It's not the same asa = bso he can't just replace that.@Q139 said:
The random RAM access patterns probably are main reason it runs slower.
If possible, you can try to prefetch the "random" data into a more local variable beforehand, but that's very case specific and there's no general solution for it. The ideal is linear access always. It's sorta an unachievable goal that you use as a general direction and the closer you get the better it will run. But there's always a margin close to the optimum in which it runs "good enough" and there's no point in further optimization because it's masked by bigger inefficiencies in other parts of the system. E.g. optimizing a function to run in 0.01ms instead of 0.02 is pointless when next thing you do is use I/O with >100ms latency. You must always choose your battles to seek the biggest overall gains and not obsess about single detail.
Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?
Compiler can't mess with alignment. As @Christian-Ehrlicher said it would generate incompatible data layouts from the same code.
would it speed up that operation if all in vector padded in your opinion, or shift operations are cheap?
It doesn't matter what I think. I can be right on one machine/architecture and wrong on another. When optimizing test and measure on as many of your target machines as you can. Intuition is usually working against you when doing low level stuff. Assemble a bag of tools and trust the numbers.
-
@Chris-Kawa said in How to vectorize operation on struct of data:
I think you miss the point. OP is doing a += b. It's not the same as a = b so he can't just replace that.
ok, but '+=' is replaced with avx instructions in -O3 so all is fine (esp. since there is still no validation that this is a bottleneck in his code).
/edit: end since it's a plain structure one can overload operator +=() and use a simply memcpy which is optimized even in -O1 -
@Chris-Kawa said in How to vectorize operation on struct of data:
I think you miss the point. OP is doing a += b. It's not the same as a = b so he can't just replace that.
ok, but '+=' is replaced with avx instructions in -O3 so all is fine (esp. since there is still no validation that this is a bottleneck in his code).
/edit: end since it's a plain structure one can overload operator +=() and use a simply memcpy which is optimized even in -O1@Christian-Ehrlicher Was running -O2 for few month , yesterday switched back to -O3 , got ~20% speed increase.
Some time ago i didnot notice big dif between -O2 and -O3 but i have increased(doubled) variable counts in multiple data structures lately.
Idea was that i can get similar speed back with vectorization as with fewer variables.
But cache misses still increase with more data.Scattered RAM access patterns are probably biggest bottleneck.
Could GPU architecture be good for speeding up algo that accesses memory in scattered fashion from all cores?
Also threadlock is problem if 2 or more threads take same values and fighting against it on 1000 cores seems bad idea. -
To achieve a similar code with -O2 instead using -O3 in this loop you can e.g. use
data &operator +=(const data &o) { __m128i *a1 = (__m128i*)this; __m128i *a2 = (__m128i*)&o; for (int i = 0; i < 2; ++i) *(a1 + i) += *(a2 + i); return *this; }But I would let the compiler do this job and again would do some performance analysis before thinking about such stuff to see where the hot spots really is.
-
@Christian-Ehrlicher Was running -O2 for few month , yesterday switched back to -O3 , got ~20% speed increase.
Some time ago i didnot notice big dif between -O2 and -O3 but i have increased(doubled) variable counts in multiple data structures lately.
Idea was that i can get similar speed back with vectorization as with fewer variables.
But cache misses still increase with more data.Scattered RAM access patterns are probably biggest bottleneck.
Could GPU architecture be good for speeding up algo that accesses memory in scattered fashion from all cores?
Also threadlock is problem if 2 or more threads take same values and fighting against it on 1000 cores seems bad idea.@Q139 said:
Could GPU architecture be good for speeding up algo that accesses memory in scattered fashion from all cores?
The problem with GPUs is that they usually have their own RAM so you have to transfer the data over the bus to and from the GPU and that can be a bottlenect too. GPUs are good if you can make the data stay on the GPU for most or all of the time. It's problematic if you need to juggle between CPU and GPU. There are architectures that use a unified memory structure (modern game consoles are one such example) where all RAM is shared by the CPU and GPU but that's again a specific hardware situation.
Another consideration is that GPUs memory is usually organized and accessed very differently from what the CPU does. GPUs work in waves, warps, clusters, tiles or whatever given vendor calls it. The point being that the caching, synchronization and access patterns are more tightly constrained on GPUs, not less.
Also threadlock is problem if 2 or more threads take same values and fighting against it on 1000 cores seems bad idea.
Reading same value is not a problem. Writing is. If you do accumulation for example, the usual solution is to use partial accumulation on each thread and then do a sync point and an extra pass to accumulate the partial results. There are standard facilities added in recent C++ standard for such paralelism, e.g. std::reduce algorithm that can parallelize and vectorize accumulation. There are also many libraries that do this like PPL or OpenMP. It's just a lot of testing and figuring out what works best for your case.
-
@Q139 said:
Could GPU architecture be good for speeding up algo that accesses memory in scattered fashion from all cores?
The problem with GPUs is that they usually have their own RAM so you have to transfer the data over the bus to and from the GPU and that can be a bottlenect too. GPUs are good if you can make the data stay on the GPU for most or all of the time. It's problematic if you need to juggle between CPU and GPU. There are architectures that use a unified memory structure (modern game consoles are one such example) where all RAM is shared by the CPU and GPU but that's again a specific hardware situation.
Another consideration is that GPUs memory is usually organized and accessed very differently from what the CPU does. GPUs work in waves, warps, clusters, tiles or whatever given vendor calls it. The point being that the caching, synchronization and access patterns are more tightly constrained on GPUs, not less.
Also threadlock is problem if 2 or more threads take same values and fighting against it on 1000 cores seems bad idea.
Reading same value is not a problem. Writing is. If you do accumulation for example, the usual solution is to use partial accumulation on each thread and then do a sync point and an extra pass to accumulate the partial results. There are standard facilities added in recent C++ standard for such paralelism, e.g. std::reduce algorithm that can parallelize and vectorize accumulation. There are also many libraries that do this like PPL or OpenMP. It's just a lot of testing and figuring out what works best for your case.
@Chris-Kawa said in How to vectorize operation on struct of data:
The problem with GPUs is that they usually have their own RAM so you have to transfer the data over the bus to and from the GPU and that can be a bottlenect too
Transfers could be done rarely to unload and fill GPU memory. Or if blocks of task are finished.
Also id assume that GPU threads can works on some memory while other memory locations are transfered.
@Chris-Kawa said in How to vectorize operation on struct of data:
The point being that the caching, synchronization and access patterns are more tightly constrained on GPUs, not less.
My assumptions about GPU memory access might be wrong.
I dont know GPU architecture limitations well but better access with 1000+ cores for scattered memory locations would be only reason to port to GPU.
@Chris-Kawa said in How to vectorize operation on struct of data:
Reading same value is not a problem. Writing is.
Even if GPU memory access would be superior in this case.
Avoiding write conflicts without sacrificing performance or extra memory seems harder with 1000+ threads.Can you reccomend something to read on GPU memory access constraints if working with scattered data?
-
@Chris-Kawa said in How to vectorize operation on struct of data:
The problem with GPUs is that they usually have their own RAM so you have to transfer the data over the bus to and from the GPU and that can be a bottlenect too
Transfers could be done rarely to unload and fill GPU memory. Or if blocks of task are finished.
Also id assume that GPU threads can works on some memory while other memory locations are transfered.
@Chris-Kawa said in How to vectorize operation on struct of data:
The point being that the caching, synchronization and access patterns are more tightly constrained on GPUs, not less.
My assumptions about GPU memory access might be wrong.
I dont know GPU architecture limitations well but better access with 1000+ cores for scattered memory locations would be only reason to port to GPU.
@Chris-Kawa said in How to vectorize operation on struct of data:
Reading same value is not a problem. Writing is.
Even if GPU memory access would be superior in this case.
Avoiding write conflicts without sacrificing performance or extra memory seems harder with 1000+ threads.Can you reccomend something to read on GPU memory access constraints if working with scattered data?
@Q139 said:
Avoiding write conflicts without sacrificing performance seems harder with 1000+ threads.
Yup, it is. There are atomics on GPUs, but if you're gonna do an atomic operation on a single variable from 1000+ threads then those are gonna get serialized anyway and, depending on your target setup, you can run into TDR issues. When multithreading you often need to use dedicated algorithms, like the partial sum I mentioned. One such example is bitonic sort, which is a very wasteful algorithm and terrible ona single thread but maps well to GPU hardware and is well suited for parallelism.
Can you reccomend something to read on GPU memory access constraints if working with scattered data?
It's kinda problematic because GPU technologi is very diverse and good practices will vary from chip to chip even. There are some umbrella technologies that try to abstract some details away so you can program for wider range of hardware, but they sometimes come at a cost of not fully leveraging given hardware capabilities. CUDA is one such technology that is popular and has extensive documentation/example base. OpenCL is another one and NVidia has a lot of samples and interesting papers for it.
-
This post is deleted!