Solved How to vectorize operation on struct of data
-
@Christian-Ehrlicher
Integer math on struct seems like good start on learning vectorization.If there is any speed advantages...What would be best to use , intel intrinsics or some SIMD library?
What tools are best for profiling bottlenecks/function times with Qt?
-
@Christian-Ehrlicher said in How to vectorize operation on struct of data:
Why not try it out? See https://godbolt.org/z/WbKjT1
Knowling little on ASM should i just look for shorter ASM code in comparisons?
-
movdqu xmm0, XMMWORD PTR [rsp+32] movdqu xmm1, XMMWORD PTR [rsp+48]
As you can see here, only two moves are done instead 8 which one would expect. And when you take a look at the context menu help: "Moves 128, 256 or 512 bits of packed byte/word/doubleword/quadword integer values from the source operand (the second operand) to the destination operand (first operand). This instruction can be used to load a vector register from a memory location, to store the contents of a vector register into a memory location, or to move data between two vector registers."
-
@Christian-Ehrlicher
Looking for problems where there are none.
Compiler engineers have solved alot.About profiling , How do you profile?
-
@Q139 said in How to vectorize operation on struct of data:
How do you profile?
callgrind or gperf or similar tools.
-
@Christian-Ehrlicher said in How to vectorize operation on struct of data:
movdqu xmm0, XMMWORD PTR [rsp+32]
movdqu xmm1, XMMWORD PTR [rsp+48]I think this code sample serves different purposes than
intA += intB
operations.-O2 flag
mov rax, QWORD PTR aaPtr[rip] mov edx, DWORD PTR bb[rip] add DWORD PTR [rax], edx mov edx, DWORD PTR bb[rip+4] add DWORD PTR [rax+4], edx mov edx, DWORD PTR bb[rip+8] add DWORD PTR [rax+8], edx mov edx, DWORD PTR bb[rip+12] add DWORD PTR [rax+12], edx mov edx, DWORD PTR bb[rip+16] add DWORD PTR [rax+16], edx mov edx, DWORD PTR bb[rip+20] add DWORD PTR [rax+20], edx
-O3 flag
mov rax, QWORD PTR aaPtr[rip] movdqu xmm0, XMMWORD PTR [rax] paddd xmm0, XMMWORD PTR bb[rip] movups XMMWORD PTR [rax], xmm0 mov edx, DWORD PTR bb[rip+16] add DWORD PTR [rax+16], edx mov edx, DWORD PTR bb[rip+20] add DWORD PTR [rax+20], edx
I am poorer coder but i get this. https://godbolt.org/z/8h41br
-
Don't copy the values by it's own but the complete struct.
-
@Christian-Ehrlicher Yes , but += is addition operation, not only copy.
CompilerExplorer is nice tool to learn ASM and seek more under the hood.Using -O3
If struct consists of 4 items it does SIMD copy and add in lesser instructions.
If struct consists of 6 it does SIMD on 4 and then 2 copy & add operations separately.
Probably reason why it does 4+1+1 or just wont support better instructions for backward compatability?
Is there some magic SIMD compiler flag i am missing? -
SIMD instructions work on 128 bit (or 256, or 512) data sets. Ints are 32 bit so there are 4 32 bit operations done in one instruction. If your struct has 6 integers the first four can be processed using SIMD, but there's no 64 bit SIMD addition, so to use SIMD on those 2 remaining values the compiler would have to generate code that allocates temporary 128 bits, copies the two values in the first half of that, does the SIMD addition and then copies back the two values to the original location. That would be slower than just doing the addition without SIMD. If you want it to use SIMD for entire struct you can add two dummy values at the end of your struct, but since it would make your code less readable you need to measure if the increased amount of memory needed for dummy values is justifies in increased computation speed. I'm guessing it's not, but that's something to check with a profiler..
-
Also note that SIMD works on aligned data. Your struct has no alignment specification so compiler has to generate slower code for unaligned data. Notice the
movdqu
instruction. Theu
stands for "unaligned" and basically means it's slower because the cpu must first align the data to an address the SIMD processor can work with. If you specify alignment for your struct to be "SIMD friendly" like this:struct alignas(128) str {
then that instruction turns intomovdqa
wherea
stands for "aligned" and processor doesn't have to do extra work. Since this adds alignment to your data there will be some memory footprint increase, so that's again something to profile. -
@Chris-Kawa
Can you recommend good learning materials on SIMD optimized C++ coding or about optimized coding in general? -
Sorry, no. Different people prefer to learn different ways. I usually just dig into specs and manuals and test things out.
-
Addition does not seem as simple operation anymore.
When padding, its effect on memory usage and cache line alignments are considered.One side of the operation is struct from long vector that is taken from memory quite randomly, quite low probability of continuous memory accesses , other side is single struct instance in function.
The random RAM access patterns probably are main reason it runs slower.Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?
Since the accesses from RAM are quite random , would it speed up that operation if all in vector padded in your opinion, or shift operations are cheap?
-
@Q139 said in How to vectorize operation on struct of data:
but += is addition operation, not only copy.
then use '=' ... really that hard??
-
@Q139 said in How to vectorize operation on struct of data:
Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?
That's not allowed since than you would not be able to mix it with other libs which do not align (due to missing -On)
-
@Christian-Ehrlicher said in How to vectorize operation on struct of data:
@Q139 said in How to vectorize operation on struct of data:
but += is addition operation, not only copy.
then use '=' ... really that hard??
I dont understand.
a.a+=b.a a.b+=b.b a.c+=b.c ...
a.a=a.a+b.a a.b=a.b+b.b a.c=a.c+b.c ...
-
@Christian-Ehrlicher said:
then use '=' ... really that hard??
I think you miss the point. OP is doing
a += b
. It's not the same asa = b
so he can't just replace that.@Q139 said:
The random RAM access patterns probably are main reason it runs slower.
If possible, you can try to prefetch the "random" data into a more local variable beforehand, but that's very case specific and there's no general solution for it. The ideal is linear access always. It's sorta an unachievable goal that you use as a general direction and the closer you get the better it will run. But there's always a margin close to the optimum in which it runs "good enough" and there's no point in further optimization because it's masked by bigger inefficiencies in other parts of the system. E.g. optimizing a function to run in 0.01ms instead of 0.02 is pointless when next thing you do is use I/O with >100ms latency. You must always choose your battles to seek the biggest overall gains and not obsess about single detail.
Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?
Compiler can't mess with alignment. As @Christian-Ehrlicher said it would generate incompatible data layouts from the same code.
would it speed up that operation if all in vector padded in your opinion, or shift operations are cheap?
It doesn't matter what I think. I can be right on one machine/architecture and wrong on another. When optimizing test and measure on as many of your target machines as you can. Intuition is usually working against you when doing low level stuff. Assemble a bag of tools and trust the numbers.
-
@Chris-Kawa said in How to vectorize operation on struct of data:
I think you miss the point. OP is doing a += b. It's not the same as a = b so he can't just replace that.
ok, but '+=' is replaced with avx instructions in -O3 so all is fine (esp. since there is still no validation that this is a bottleneck in his code).
/edit: end since it's a plain structure one can overload operator +=() and use a simply memcpy which is optimized even in -O1 -
@Christian-Ehrlicher Was running -O2 for few month , yesterday switched back to -O3 , got ~20% speed increase.
Some time ago i didnot notice big dif between -O2 and -O3 but i have increased(doubled) variable counts in multiple data structures lately.
Idea was that i can get similar speed back with vectorization as with fewer variables.
But cache misses still increase with more data.Scattered RAM access patterns are probably biggest bottleneck.
Could GPU architecture be good for speeding up algo that accesses memory in scattered fashion from all cores?
Also threadlock is problem if 2 or more threads take same values and fighting against it on 1000 cores seems bad idea. -
To achieve a similar code with -O2 instead using -O3 in this loop you can e.g. use
data &operator +=(const data &o) { __m128i *a1 = (__m128i*)this; __m128i *a2 = (__m128i*)&o; for (int i = 0; i < 2; ++i) *(a1 + i) += *(a2 + i); return *this; }
But I would let the compiler do this job and again would do some performance analysis before thinking about such stuff to see where the hot spots really is.