How to vectorize operation on struct of data

Q139

```

struct data{
qint32 bPos=0;
qint32 bNeg=0;
qint32 sPos=0;
qint32 sNeg=0;
qint32 dPos=0;
qint32 dNeg=0;
qint32 rPos=0;
qint32 rNeg=0; };

                                            do{
                                             data*sta = getData(runtime,d);

                                                   sta->bPos  += r.bPos;
                                                   sta->bNeg  += r.bNeg;
                                                   sta->sPos += r.sPos;
                                                   sta->sNeg += r.sNeg;
                                                   sta->dPos += r.dPos;
                                                   sta->dNeg += r.dNeg;//nb nb vectorize this



                                          }while(getNextCombo(ranges,d,combo));

How to vectorize operation like this on AVX2/AVX512 supported cpus to be done in less CPU cycles?

Compiler flags would be good but idk if possible.

Christian Ehrlicher

I don' see what you want to vectorize here. When sta and r are from the same struct then the compiler may do a memcpy if it is intelligent enough but nothing more is possible with this piece of code. And why do you want to optimize it at all - did you measure that this is a bottleneck in your application?

Q139

@Christian-Ehrlicher

I try to go over similar parts like this that are looping many times.
It is not profiled as most time consuming portion but i did add notes for many portions of code that i thought might be vectorizable.

Would be good to have all 4-8 struct += operations done in 1 go.
Maybe compiler already optimizes for this...

Only solutiuon i know for time based profiling is using hardcoded timers.
Have used/tested valgrind/callgrind on ubuntu over the years but idk if it allows for time based function profiling also or if better tools are avalible.

Christian Ehrlicher

@Q139 said in How to vectorize operation on struct of data:

Maybe compiler already optimizes for this...

Why not try it out? See https://godbolt.org/z/WbKjT1
But even if it doesn't the cpu can fetch the data in a good order since the memory is continuous.
If you did not measure it I would not take care about such special stuff at all -

Q139

@Christian-Ehrlicher
Integer math on struct seems like good start on learning vectorization.

If there is any speed advantages...What would be best to use , intel intrinsics or some SIMD library?

What tools are best for profiling bottlenecks/function times with Qt?

Q139

@Christian-Ehrlicher said in How to vectorize operation on struct of data:

Why not try it out? See https://godbolt.org/z/WbKjT1

Knowling little on ASM should i just look for shorter ASM code in comparisons?

Christian Ehrlicher

movdqu  xmm0, XMMWORD PTR [rsp+32]
movdqu  xmm1, XMMWORD PTR [rsp+48]

As you can see here, only two moves are done instead 8 which one would expect. And when you take a look at the context menu help: "Moves 128, 256 or 512 bits of packed byte/word/doubleword/quadword integer values from the source operand (the second operand) to the destination operand (first operand). This instruction can be used to load a vector register from a memory location, to store the contents of a vector register into a memory location, or to move data between two vector registers."

Q139

@Christian-Ehrlicher
Looking for problems where there are none.
Compiler engineers have solved alot.

About profiling , How do you profile?

Christian Ehrlicher

@Q139 said in How to vectorize operation on struct of data:

How do you profile?

callgrind or gperf or similar tools.

Q139

@Christian-Ehrlicher said in How to vectorize operation on struct of data:

movdqu xmm0, XMMWORD PTR [rsp+32]
movdqu xmm1, XMMWORD PTR [rsp+48]

I think this code sample serves different purposes than intA += intB operations.

-O2 flag

        mov     rax, QWORD PTR aaPtr[rip]
        mov     edx, DWORD PTR bb[rip]
        add     DWORD PTR [rax], edx
        mov     edx, DWORD PTR bb[rip+4]
        add     DWORD PTR [rax+4], edx
        mov     edx, DWORD PTR bb[rip+8]
        add     DWORD PTR [rax+8], edx
        mov     edx, DWORD PTR bb[rip+12]
        add     DWORD PTR [rax+12], edx
        mov     edx, DWORD PTR bb[rip+16]
        add     DWORD PTR [rax+16], edx
        mov     edx, DWORD PTR bb[rip+20]
        add     DWORD PTR [rax+20], edx

-O3 flag

        mov     rax, QWORD PTR aaPtr[rip]
        movdqu  xmm0, XMMWORD PTR [rax]
        paddd   xmm0, XMMWORD PTR bb[rip]
        movups  XMMWORD PTR [rax], xmm0
        mov     edx, DWORD PTR bb[rip+16]
        add     DWORD PTR [rax+16], edx
        mov     edx, DWORD PTR bb[rip+20]
        add     DWORD PTR [rax+20], edx

I am poorer coder but i get this. https://godbolt.org/z/8h41br

Christian Ehrlicher

Don't copy the values by it's own but the complete struct.

Q139

@Christian-Ehrlicher Yes , but += is addition operation, not only copy.
CompilerExplorer is nice tool to learn ASM and seek more under the hood.

Using -O3
If struct consists of 4 items it does SIMD copy and add in lesser instructions.
If struct consists of 6 it does SIMD on 4 and then 2 copy & add operations separately.
Probably reason why it does 4+1+1 or just wont support better instructions for backward compatability?
Is there some magic SIMD compiler flag i am missing?

Chris Kawa

SIMD instructions work on 128 bit (or 256, or 512) data sets. Ints are 32 bit so there are 4 32 bit operations done in one instruction. If your struct has 6 integers the first four can be processed using SIMD, but there's no 64 bit SIMD addition, so to use SIMD on those 2 remaining values the compiler would have to generate code that allocates temporary 128 bits, copies the two values in the first half of that, does the SIMD addition and then copies back the two values to the original location. That would be slower than just doing the addition without SIMD. If you want it to use SIMD for entire struct you can add two dummy values at the end of your struct, but since it would make your code less readable you need to measure if the increased amount of memory needed for dummy values is justifies in increased computation speed. I'm guessing it's not, but that's something to check with a profiler..

Chris Kawa

Also note that SIMD works on aligned data. Your struct has no alignment specification so compiler has to generate slower code for unaligned data. Notice the movdqu instruction. The u stands for "unaligned" and basically means it's slower because the cpu must first align the data to an address the SIMD processor can work with. If you specify alignment for your struct to be "SIMD friendly" like this: struct alignas(128) str { then that instruction turns into movdqa where a stands for "aligned" and processor doesn't have to do extra work. Since this adds alignment to your data there will be some memory footprint increase, so that's again something to profile.

Q139

@Chris-Kawa
Can you recommend good learning materials on SIMD optimized C++ coding or about optimized coding in general?

Chris Kawa

Sorry, no. Different people prefer to learn different ways. I usually just dig into specs and manuals and test things out.

Q139

Addition does not seem as simple operation anymore.
When padding, its effect on memory usage and cache line alignments are considered.

One side of the operation is struct from long vector that is taken from memory quite randomly, quite low probability of continuous memory accesses , other side is single struct instance in function.
The random RAM access patterns probably are main reason it runs slower.

Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?

Since the accesses from RAM are quite random , would it speed up that operation if all in vector padded in your opinion, or shift operations are cheap?

Christian Ehrlicher

@Q139 said in How to vectorize operation on struct of data:

but += is addition operation, not only copy.

then use '=' ... really that hard??

Christian Ehrlicher

@Q139 said in How to vectorize operation on struct of data:

Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?

That's not allowed since than you would not be able to mix it with other libs which do not align (due to missing -On)

Q139

@Christian-Ehrlicher said in How to vectorize operation on struct of data:

@Q139 said in How to vectorize operation on struct of data:

but += is addition operation, not only copy.

then use '=' ... really that hard??

I dont understand.

a.a+=b.a
a.b+=b.b
a.c+=b.c
...

a.a=a.a+b.a
a.b=a.b+b.b
a.c=a.c+b.c
...

Discover and share your #QtStories

Upcoming Forum Update April 22nd

Solved How to vectorize operation on struct of data