How to vectorize operation on struct of data
-
```struct data{
qint32 bPos=0;
qint32 bNeg=0;
qint32 sPos=0;
qint32 sNeg=0;
qint32 dPos=0;
qint32 dNeg=0;
qint32 rPos=0;
qint32 rNeg=0; };do{ data*sta = getData(runtime,d); sta->bPos += r.bPos; sta->bNeg += r.bNeg; sta->sPos += r.sPos; sta->sNeg += r.sNeg; sta->dPos += r.dPos; sta->dNeg += r.dNeg;//nb nb vectorize this }while(getNextCombo(ranges,d,combo));How to vectorize operation like this on AVX2/AVX512 supported cpus to be done in less CPU cycles? Compiler flags would be good but idk if possible. -
I don' see what you want to vectorize here. When sta and r are from the same struct then the compiler may do a memcpy if it is intelligent enough but nothing more is possible with this piece of code. And why do you want to optimize it at all - did you measure that this is a bottleneck in your application?
-
I don' see what you want to vectorize here. When sta and r are from the same struct then the compiler may do a memcpy if it is intelligent enough but nothing more is possible with this piece of code. And why do you want to optimize it at all - did you measure that this is a bottleneck in your application?
I try to go over similar parts like this that are looping many times.
It is not profiled as most time consuming portion but i did add notes for many portions of code that i thought might be vectorizable.Would be good to have all 4-8 struct += operations done in 1 go.
Maybe compiler already optimizes for this...Only solutiuon i know for time based profiling is using hardcoded timers.
Have used/tested valgrind/callgrind on ubuntu over the years but idk if it allows for time based function profiling also or if better tools are avalible. -
@Q139 said in How to vectorize operation on struct of data:
Maybe compiler already optimizes for this...
Why not try it out? See https://godbolt.org/z/WbKjT1
But even if it doesn't the cpu can fetch the data in a good order since the memory is continuous.
If you did not measure it I would not take care about such special stuff at all - -
@Q139 said in How to vectorize operation on struct of data:
Maybe compiler already optimizes for this...
Why not try it out? See https://godbolt.org/z/WbKjT1
But even if it doesn't the cpu can fetch the data in a good order since the memory is continuous.
If you did not measure it I would not take care about such special stuff at all -@Christian-Ehrlicher
Integer math on struct seems like good start on learning vectorization.If there is any speed advantages...What would be best to use , intel intrinsics or some SIMD library?
What tools are best for profiling bottlenecks/function times with Qt?
-
@Q139 said in How to vectorize operation on struct of data:
Maybe compiler already optimizes for this...
Why not try it out? See https://godbolt.org/z/WbKjT1
But even if it doesn't the cpu can fetch the data in a good order since the memory is continuous.
If you did not measure it I would not take care about such special stuff at all -@Christian-Ehrlicher said in How to vectorize operation on struct of data:
Why not try it out? See https://godbolt.org/z/WbKjT1
Knowling little on ASM should i just look for shorter ASM code in comparisons?
-
@Christian-Ehrlicher said in How to vectorize operation on struct of data:
Why not try it out? See https://godbolt.org/z/WbKjT1
Knowling little on ASM should i just look for shorter ASM code in comparisons?
movdqu xmm0, XMMWORD PTR [rsp+32] movdqu xmm1, XMMWORD PTR [rsp+48]As you can see here, only two moves are done instead 8 which one would expect. And when you take a look at the context menu help: "Moves 128, 256 or 512 bits of packed byte/word/doubleword/quadword integer values from the source operand (the second operand) to the destination operand (first operand). This instruction can be used to load a vector register from a memory location, to store the contents of a vector register into a memory location, or to move data between two vector registers."
-
movdqu xmm0, XMMWORD PTR [rsp+32] movdqu xmm1, XMMWORD PTR [rsp+48]As you can see here, only two moves are done instead 8 which one would expect. And when you take a look at the context menu help: "Moves 128, 256 or 512 bits of packed byte/word/doubleword/quadword integer values from the source operand (the second operand) to the destination operand (first operand). This instruction can be used to load a vector register from a memory location, to store the contents of a vector register into a memory location, or to move data between two vector registers."
@Christian-Ehrlicher
Looking for problems where there are none.
Compiler engineers have solved alot.About profiling , How do you profile?
-
@Q139 said in How to vectorize operation on struct of data:
How do you profile?
callgrind or gperf or similar tools.
-
movdqu xmm0, XMMWORD PTR [rsp+32] movdqu xmm1, XMMWORD PTR [rsp+48]As you can see here, only two moves are done instead 8 which one would expect. And when you take a look at the context menu help: "Moves 128, 256 or 512 bits of packed byte/word/doubleword/quadword integer values from the source operand (the second operand) to the destination operand (first operand). This instruction can be used to load a vector register from a memory location, to store the contents of a vector register into a memory location, or to move data between two vector registers."
@Christian-Ehrlicher said in How to vectorize operation on struct of data:
movdqu xmm0, XMMWORD PTR [rsp+32]
movdqu xmm1, XMMWORD PTR [rsp+48]I think this code sample serves different purposes than
intA += intBoperations.-O2 flag
mov rax, QWORD PTR aaPtr[rip] mov edx, DWORD PTR bb[rip] add DWORD PTR [rax], edx mov edx, DWORD PTR bb[rip+4] add DWORD PTR [rax+4], edx mov edx, DWORD PTR bb[rip+8] add DWORD PTR [rax+8], edx mov edx, DWORD PTR bb[rip+12] add DWORD PTR [rax+12], edx mov edx, DWORD PTR bb[rip+16] add DWORD PTR [rax+16], edx mov edx, DWORD PTR bb[rip+20] add DWORD PTR [rax+20], edx-O3 flag
mov rax, QWORD PTR aaPtr[rip] movdqu xmm0, XMMWORD PTR [rax] paddd xmm0, XMMWORD PTR bb[rip] movups XMMWORD PTR [rax], xmm0 mov edx, DWORD PTR bb[rip+16] add DWORD PTR [rax+16], edx mov edx, DWORD PTR bb[rip+20] add DWORD PTR [rax+20], edxI am poorer coder but i get this. https://godbolt.org/z/8h41br
-
Don't copy the values by it's own but the complete struct.
-
Don't copy the values by it's own but the complete struct.
@Christian-Ehrlicher Yes , but += is addition operation, not only copy.
CompilerExplorer is nice tool to learn ASM and seek more under the hood.Using -O3
If struct consists of 4 items it does SIMD copy and add in lesser instructions.
If struct consists of 6 it does SIMD on 4 and then 2 copy & add operations separately.
Probably reason why it does 4+1+1 or just wont support better instructions for backward compatability?
Is there some magic SIMD compiler flag i am missing? -
SIMD instructions work on 128 bit (or 256, or 512) data sets. Ints are 32 bit so there are 4 32 bit operations done in one instruction. If your struct has 6 integers the first four can be processed using SIMD, but there's no 64 bit SIMD addition, so to use SIMD on those 2 remaining values the compiler would have to generate code that allocates temporary 128 bits, copies the two values in the first half of that, does the SIMD addition and then copies back the two values to the original location. That would be slower than just doing the addition without SIMD. If you want it to use SIMD for entire struct you can add two dummy values at the end of your struct, but since it would make your code less readable you need to measure if the increased amount of memory needed for dummy values is justifies in increased computation speed. I'm guessing it's not, but that's something to check with a profiler..
-
Also note that SIMD works on aligned data. Your struct has no alignment specification so compiler has to generate slower code for unaligned data. Notice the
movdquinstruction. Theustands for "unaligned" and basically means it's slower because the cpu must first align the data to an address the SIMD processor can work with. If you specify alignment for your struct to be "SIMD friendly" like this:struct alignas(128) str {then that instruction turns intomovdqawhereastands for "aligned" and processor doesn't have to do extra work. Since this adds alignment to your data there will be some memory footprint increase, so that's again something to profile. -
Also note that SIMD works on aligned data. Your struct has no alignment specification so compiler has to generate slower code for unaligned data. Notice the
movdquinstruction. Theustands for "unaligned" and basically means it's slower because the cpu must first align the data to an address the SIMD processor can work with. If you specify alignment for your struct to be "SIMD friendly" like this:struct alignas(128) str {then that instruction turns intomovdqawhereastands for "aligned" and processor doesn't have to do extra work. Since this adds alignment to your data there will be some memory footprint increase, so that's again something to profile.@Chris-Kawa
Can you recommend good learning materials on SIMD optimized C++ coding or about optimized coding in general? -
Sorry, no. Different people prefer to learn different ways. I usually just dig into specs and manuals and test things out.
-
Sorry, no. Different people prefer to learn different ways. I usually just dig into specs and manuals and test things out.
Addition does not seem as simple operation anymore.
When padding, its effect on memory usage and cache line alignments are considered.One side of the operation is struct from long vector that is taken from memory quite randomly, quite low probability of continuous memory accesses , other side is single struct instance in function.
The random RAM access patterns probably are main reason it runs slower.Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?
Since the accesses from RAM are quite random , would it speed up that operation if all in vector padded in your opinion, or shift operations are cheap?
-
@Christian-Ehrlicher Yes , but += is addition operation, not only copy.
CompilerExplorer is nice tool to learn ASM and seek more under the hood.Using -O3
If struct consists of 4 items it does SIMD copy and add in lesser instructions.
If struct consists of 6 it does SIMD on 4 and then 2 copy & add operations separately.
Probably reason why it does 4+1+1 or just wont support better instructions for backward compatability?
Is there some magic SIMD compiler flag i am missing?@Q139 said in How to vectorize operation on struct of data:
but += is addition operation, not only copy.
then use '=' ... really that hard??
-
Addition does not seem as simple operation anymore.
When padding, its effect on memory usage and cache line alignments are considered.One side of the operation is struct from long vector that is taken from memory quite randomly, quite low probability of continuous memory accesses , other side is single struct instance in function.
The random RAM access patterns probably are main reason it runs slower.Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?
Since the accesses from RAM are quite random , would it speed up that operation if all in vector padded in your opinion, or shift operations are cheap?
@Q139 said in How to vectorize operation on struct of data:
Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?
That's not allowed since than you would not be able to mix it with other libs which do not align (due to missing -On)
-
@Q139 said in How to vectorize operation on struct of data:
but += is addition operation, not only copy.
then use '=' ... really that hard??
@Christian-Ehrlicher said in How to vectorize operation on struct of data:
@Q139 said in How to vectorize operation on struct of data:
but += is addition operation, not only copy.
then use '=' ... really that hard??
I dont understand.
a.a+=b.a a.b+=b.b a.c+=b.c ...a.a=a.a+b.a a.b=a.b+b.b a.c=a.c+b.c ...