Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. How to vectorize operation on struct of data
Forum Updated to NodeBB v4.3 + New Features

How to vectorize operation on struct of data

Scheduled Pinned Locked Moved Solved General and Desktop
28 Posts 3 Posters 3.3k Views 2 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • ? Offline
    ? Offline
    A Former User
    wrote on last edited by A Former User
    #1
     ```
    

    struct data{
    qint32 bPos=0;
    qint32 bNeg=0;
    qint32 sPos=0;
    qint32 sNeg=0;
    qint32 dPos=0;
    qint32 dNeg=0;
    qint32 rPos=0;
    qint32 rNeg=0; };

                                                do{
                                                 data*sta = getData(runtime,d);
    
                                                       sta->bPos  += r.bPos;
                                                       sta->bNeg  += r.bNeg;
                                                       sta->sPos += r.sPos;
                                                       sta->sNeg += r.sNeg;
                                                       sta->dPos += r.dPos;
                                                       sta->dNeg += r.dNeg;//nb nb vectorize this
    
    
    
                                              }while(getNextCombo(ranges,d,combo));
    
    How to vectorize operation like this on AVX2/AVX512 supported cpus to be done in less CPU cycles?
    
    Compiler flags would be good but idk if possible.
    1 Reply Last reply
    0
    • Christian EhrlicherC Online
      Christian EhrlicherC Online
      Christian Ehrlicher
      Lifetime Qt Champion
      wrote on last edited by
      #2

      I don' see what you want to vectorize here. When sta and r are from the same struct then the compiler may do a memcpy if it is intelligent enough but nothing more is possible with this piece of code. And why do you want to optimize it at all - did you measure that this is a bottleneck in your application?

      Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
      Visit the Qt Academy at https://academy.qt.io/catalog

      ? 1 Reply Last reply
      2
      • Christian EhrlicherC Christian Ehrlicher

        I don' see what you want to vectorize here. When sta and r are from the same struct then the compiler may do a memcpy if it is intelligent enough but nothing more is possible with this piece of code. And why do you want to optimize it at all - did you measure that this is a bottleneck in your application?

        ? Offline
        ? Offline
        A Former User
        wrote on last edited by
        #3

        @Christian-Ehrlicher

        I try to go over similar parts like this that are looping many times.
        It is not profiled as most time consuming portion but i did add notes for many portions of code that i thought might be vectorizable.

        Would be good to have all 4-8 struct += operations done in 1 go.
        Maybe compiler already optimizes for this...

        Only solutiuon i know for time based profiling is using hardcoded timers.
        Have used/tested valgrind/callgrind on ubuntu over the years but idk if it allows for time based function profiling also or if better tools are avalible.

        1 Reply Last reply
        0
        • Christian EhrlicherC Online
          Christian EhrlicherC Online
          Christian Ehrlicher
          Lifetime Qt Champion
          wrote on last edited by Christian Ehrlicher
          #4

          @Q139 said in How to vectorize operation on struct of data:

          Maybe compiler already optimizes for this...

          Why not try it out? See https://godbolt.org/z/WbKjT1
          But even if it doesn't the cpu can fetch the data in a good order since the memory is continuous.
          If you did not measure it I would not take care about such special stuff at all -

          Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
          Visit the Qt Academy at https://academy.qt.io/catalog

          ? 2 Replies Last reply
          3
          • Christian EhrlicherC Christian Ehrlicher

            @Q139 said in How to vectorize operation on struct of data:

            Maybe compiler already optimizes for this...

            Why not try it out? See https://godbolt.org/z/WbKjT1
            But even if it doesn't the cpu can fetch the data in a good order since the memory is continuous.
            If you did not measure it I would not take care about such special stuff at all -

            ? Offline
            ? Offline
            A Former User
            wrote on last edited by
            #5

            @Christian-Ehrlicher
            Integer math on struct seems like good start on learning vectorization.

            If there is any speed advantages...What would be best to use , intel intrinsics or some SIMD library?

            What tools are best for profiling bottlenecks/function times with Qt?

            1 Reply Last reply
            0
            • Christian EhrlicherC Christian Ehrlicher

              @Q139 said in How to vectorize operation on struct of data:

              Maybe compiler already optimizes for this...

              Why not try it out? See https://godbolt.org/z/WbKjT1
              But even if it doesn't the cpu can fetch the data in a good order since the memory is continuous.
              If you did not measure it I would not take care about such special stuff at all -

              ? Offline
              ? Offline
              A Former User
              wrote on last edited by A Former User
              #6

              @Christian-Ehrlicher said in How to vectorize operation on struct of data:

              Why not try it out? See https://godbolt.org/z/WbKjT1

              Knowling little on ASM should i just look for shorter ASM code in comparisons?

              Christian EhrlicherC 1 Reply Last reply
              0
              • ? A Former User

                @Christian-Ehrlicher said in How to vectorize operation on struct of data:

                Why not try it out? See https://godbolt.org/z/WbKjT1

                Knowling little on ASM should i just look for shorter ASM code in comparisons?

                Christian EhrlicherC Online
                Christian EhrlicherC Online
                Christian Ehrlicher
                Lifetime Qt Champion
                wrote on last edited by
                #7
                movdqu  xmm0, XMMWORD PTR [rsp+32]
                movdqu  xmm1, XMMWORD PTR [rsp+48]
                

                As you can see here, only two moves are done instead 8 which one would expect. And when you take a look at the context menu help: "Moves 128, 256 or 512 bits of packed byte/word/doubleword/quadword integer values from the source operand (the second operand) to the destination operand (first operand). This instruction can be used to load a vector register from a memory location, to store the contents of a vector register into a memory location, or to move data between two vector registers."

                Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                Visit the Qt Academy at https://academy.qt.io/catalog

                ? 2 Replies Last reply
                4
                • Christian EhrlicherC Christian Ehrlicher
                  movdqu  xmm0, XMMWORD PTR [rsp+32]
                  movdqu  xmm1, XMMWORD PTR [rsp+48]
                  

                  As you can see here, only two moves are done instead 8 which one would expect. And when you take a look at the context menu help: "Moves 128, 256 or 512 bits of packed byte/word/doubleword/quadword integer values from the source operand (the second operand) to the destination operand (first operand). This instruction can be used to load a vector register from a memory location, to store the contents of a vector register into a memory location, or to move data between two vector registers."

                  ? Offline
                  ? Offline
                  A Former User
                  wrote on last edited by A Former User
                  #8

                  @Christian-Ehrlicher
                  Looking for problems where there are none.
                  Compiler engineers have solved alot.

                  About profiling , How do you profile?

                  1 Reply Last reply
                  0
                  • Christian EhrlicherC Online
                    Christian EhrlicherC Online
                    Christian Ehrlicher
                    Lifetime Qt Champion
                    wrote on last edited by
                    #9

                    @Q139 said in How to vectorize operation on struct of data:

                    How do you profile?

                    callgrind or gperf or similar tools.

                    Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                    Visit the Qt Academy at https://academy.qt.io/catalog

                    1 Reply Last reply
                    0
                    • Christian EhrlicherC Christian Ehrlicher
                      movdqu  xmm0, XMMWORD PTR [rsp+32]
                      movdqu  xmm1, XMMWORD PTR [rsp+48]
                      

                      As you can see here, only two moves are done instead 8 which one would expect. And when you take a look at the context menu help: "Moves 128, 256 or 512 bits of packed byte/word/doubleword/quadword integer values from the source operand (the second operand) to the destination operand (first operand). This instruction can be used to load a vector register from a memory location, to store the contents of a vector register into a memory location, or to move data between two vector registers."

                      ? Offline
                      ? Offline
                      A Former User
                      wrote on last edited by A Former User
                      #10

                      @Christian-Ehrlicher said in How to vectorize operation on struct of data:

                      movdqu xmm0, XMMWORD PTR [rsp+32]
                      movdqu xmm1, XMMWORD PTR [rsp+48]

                      I think this code sample serves different purposes than intA += intB operations.

                      -O2 flag

                              mov     rax, QWORD PTR aaPtr[rip]
                              mov     edx, DWORD PTR bb[rip]
                              add     DWORD PTR [rax], edx
                              mov     edx, DWORD PTR bb[rip+4]
                              add     DWORD PTR [rax+4], edx
                              mov     edx, DWORD PTR bb[rip+8]
                              add     DWORD PTR [rax+8], edx
                              mov     edx, DWORD PTR bb[rip+12]
                              add     DWORD PTR [rax+12], edx
                              mov     edx, DWORD PTR bb[rip+16]
                              add     DWORD PTR [rax+16], edx
                              mov     edx, DWORD PTR bb[rip+20]
                              add     DWORD PTR [rax+20], edx
                      

                      -O3 flag

                              mov     rax, QWORD PTR aaPtr[rip]
                              movdqu  xmm0, XMMWORD PTR [rax]
                              paddd   xmm0, XMMWORD PTR bb[rip]
                              movups  XMMWORD PTR [rax], xmm0
                              mov     edx, DWORD PTR bb[rip+16]
                              add     DWORD PTR [rax+16], edx
                              mov     edx, DWORD PTR bb[rip+20]
                              add     DWORD PTR [rax+20], edx
                      

                      I am poorer coder but i get this. https://godbolt.org/z/8h41br

                      1 Reply Last reply
                      0
                      • Christian EhrlicherC Online
                        Christian EhrlicherC Online
                        Christian Ehrlicher
                        Lifetime Qt Champion
                        wrote on last edited by
                        #11

                        Don't copy the values by it's own but the complete struct.

                        Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                        Visit the Qt Academy at https://academy.qt.io/catalog

                        ? 1 Reply Last reply
                        0
                        • Christian EhrlicherC Christian Ehrlicher

                          Don't copy the values by it's own but the complete struct.

                          ? Offline
                          ? Offline
                          A Former User
                          wrote on last edited by A Former User
                          #12

                          @Christian-Ehrlicher Yes , but += is addition operation, not only copy.
                          CompilerExplorer is nice tool to learn ASM and seek more under the hood.

                          Using -O3
                          If struct consists of 4 items it does SIMD copy and add in lesser instructions.
                          If struct consists of 6 it does SIMD on 4 and then 2 copy & add operations separately.
                          Probably reason why it does 4+1+1 or just wont support better instructions for backward compatability?
                          Is there some magic SIMD compiler flag i am missing?

                          Christian EhrlicherC 1 Reply Last reply
                          0
                          • Chris KawaC Offline
                            Chris KawaC Offline
                            Chris Kawa
                            Lifetime Qt Champion
                            wrote on last edited by Chris Kawa
                            #13

                            SIMD instructions work on 128 bit (or 256, or 512) data sets. Ints are 32 bit so there are 4 32 bit operations done in one instruction. If your struct has 6 integers the first four can be processed using SIMD, but there's no 64 bit SIMD addition, so to use SIMD on those 2 remaining values the compiler would have to generate code that allocates temporary 128 bits, copies the two values in the first half of that, does the SIMD addition and then copies back the two values to the original location. That would be slower than just doing the addition without SIMD. If you want it to use SIMD for entire struct you can add two dummy values at the end of your struct, but since it would make your code less readable you need to measure if the increased amount of memory needed for dummy values is justifies in increased computation speed. I'm guessing it's not, but that's something to check with a profiler..

                            1 Reply Last reply
                            1
                            • Chris KawaC Offline
                              Chris KawaC Offline
                              Chris Kawa
                              Lifetime Qt Champion
                              wrote on last edited by
                              #14

                              Also note that SIMD works on aligned data. Your struct has no alignment specification so compiler has to generate slower code for unaligned data. Notice the movdqu instruction. The u stands for "unaligned" and basically means it's slower because the cpu must first align the data to an address the SIMD processor can work with. If you specify alignment for your struct to be "SIMD friendly" like this: struct alignas(128) str { then that instruction turns into movdqa where a stands for "aligned" and processor doesn't have to do extra work. Since this adds alignment to your data there will be some memory footprint increase, so that's again something to profile.

                              ? 1 Reply Last reply
                              2
                              • Chris KawaC Chris Kawa

                                Also note that SIMD works on aligned data. Your struct has no alignment specification so compiler has to generate slower code for unaligned data. Notice the movdqu instruction. The u stands for "unaligned" and basically means it's slower because the cpu must first align the data to an address the SIMD processor can work with. If you specify alignment for your struct to be "SIMD friendly" like this: struct alignas(128) str { then that instruction turns into movdqa where a stands for "aligned" and processor doesn't have to do extra work. Since this adds alignment to your data there will be some memory footprint increase, so that's again something to profile.

                                ? Offline
                                ? Offline
                                A Former User
                                wrote on last edited by
                                #15

                                @Chris-Kawa
                                Can you recommend good learning materials on SIMD optimized C++ coding or about optimized coding in general?

                                1 Reply Last reply
                                0
                                • Chris KawaC Offline
                                  Chris KawaC Offline
                                  Chris Kawa
                                  Lifetime Qt Champion
                                  wrote on last edited by
                                  #16

                                  Sorry, no. Different people prefer to learn different ways. I usually just dig into specs and manuals and test things out.

                                  ? 1 Reply Last reply
                                  0
                                  • Chris KawaC Chris Kawa

                                    Sorry, no. Different people prefer to learn different ways. I usually just dig into specs and manuals and test things out.

                                    ? Offline
                                    ? Offline
                                    A Former User
                                    wrote on last edited by A Former User
                                    #17

                                    Addition does not seem as simple operation anymore.
                                    When padding, its effect on memory usage and cache line alignments are considered.

                                    One side of the operation is struct from long vector that is taken from memory quite randomly, quite low probability of continuous memory accesses , other side is single struct instance in function.
                                    The random RAM access patterns probably are main reason it runs slower.

                                    Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?

                                    Since the accesses from RAM are quite random , would it speed up that operation if all in vector padded in your opinion, or shift operations are cheap?

                                    Christian EhrlicherC 1 Reply Last reply
                                    0
                                    • ? A Former User

                                      @Christian-Ehrlicher Yes , but += is addition operation, not only copy.
                                      CompilerExplorer is nice tool to learn ASM and seek more under the hood.

                                      Using -O3
                                      If struct consists of 4 items it does SIMD copy and add in lesser instructions.
                                      If struct consists of 6 it does SIMD on 4 and then 2 copy & add operations separately.
                                      Probably reason why it does 4+1+1 or just wont support better instructions for backward compatability?
                                      Is there some magic SIMD compiler flag i am missing?

                                      Christian EhrlicherC Online
                                      Christian EhrlicherC Online
                                      Christian Ehrlicher
                                      Lifetime Qt Champion
                                      wrote on last edited by
                                      #18

                                      @Q139 said in How to vectorize operation on struct of data:

                                      but += is addition operation, not only copy.

                                      then use '=' ... really that hard??

                                      Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                                      Visit the Qt Academy at https://academy.qt.io/catalog

                                      ? Chris KawaC 2 Replies Last reply
                                      0
                                      • ? A Former User

                                        Addition does not seem as simple operation anymore.
                                        When padding, its effect on memory usage and cache line alignments are considered.

                                        One side of the operation is struct from long vector that is taken from memory quite randomly, quite low probability of continuous memory accesses , other side is single struct instance in function.
                                        The random RAM access patterns probably are main reason it runs slower.

                                        Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?

                                        Since the accesses from RAM are quite random , would it speed up that operation if all in vector padded in your opinion, or shift operations are cheap?

                                        Christian EhrlicherC Online
                                        Christian EhrlicherC Online
                                        Christian Ehrlicher
                                        Lifetime Qt Champion
                                        wrote on last edited by
                                        #19

                                        @Q139 said in How to vectorize operation on struct of data:

                                        Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?

                                        That's not allowed since than you would not be able to mix it with other libs which do not align (due to missing -On)

                                        Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                                        Visit the Qt Academy at https://academy.qt.io/catalog

                                        1 Reply Last reply
                                        1
                                        • Christian EhrlicherC Christian Ehrlicher

                                          @Q139 said in How to vectorize operation on struct of data:

                                          but += is addition operation, not only copy.

                                          then use '=' ... really that hard??

                                          ? Offline
                                          ? Offline
                                          A Former User
                                          wrote on last edited by A Former User
                                          #20

                                          @Christian-Ehrlicher said in How to vectorize operation on struct of data:

                                          @Q139 said in How to vectorize operation on struct of data:

                                          but += is addition operation, not only copy.

                                          then use '=' ... really that hard??

                                          I dont understand.

                                          a.a+=b.a
                                          a.b+=b.b
                                          a.c+=b.c
                                          ...
                                          
                                          a.a=a.a+b.a
                                          a.b=a.b+b.b
                                          a.c=a.c+b.c
                                          ...
                                          
                                          1 Reply Last reply
                                          0

                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • Users
                                          • Groups
                                          • Search
                                          • Get Qt Extensions
                                          • Unsolved