Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. How to vectorize operation on struct of data
Forum Updated to NodeBB v4.3 + New Features

How to vectorize operation on struct of data

Scheduled Pinned Locked Moved Solved General and Desktop
28 Posts 3 Posters 3.4k Views 2 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • Christian EhrlicherC Christian Ehrlicher

    @Q139 said in How to vectorize operation on struct of data:

    Maybe compiler already optimizes for this...

    Why not try it out? See https://godbolt.org/z/WbKjT1
    But even if it doesn't the cpu can fetch the data in a good order since the memory is continuous.
    If you did not measure it I would not take care about such special stuff at all -

    ? Offline
    ? Offline
    A Former User
    wrote on last edited by
    #5

    @Christian-Ehrlicher
    Integer math on struct seems like good start on learning vectorization.

    If there is any speed advantages...What would be best to use , intel intrinsics or some SIMD library?

    What tools are best for profiling bottlenecks/function times with Qt?

    1 Reply Last reply
    0
    • Christian EhrlicherC Christian Ehrlicher

      @Q139 said in How to vectorize operation on struct of data:

      Maybe compiler already optimizes for this...

      Why not try it out? See https://godbolt.org/z/WbKjT1
      But even if it doesn't the cpu can fetch the data in a good order since the memory is continuous.
      If you did not measure it I would not take care about such special stuff at all -

      ? Offline
      ? Offline
      A Former User
      wrote on last edited by A Former User
      #6

      @Christian-Ehrlicher said in How to vectorize operation on struct of data:

      Why not try it out? See https://godbolt.org/z/WbKjT1

      Knowling little on ASM should i just look for shorter ASM code in comparisons?

      Christian EhrlicherC 1 Reply Last reply
      0
      • ? A Former User

        @Christian-Ehrlicher said in How to vectorize operation on struct of data:

        Why not try it out? See https://godbolt.org/z/WbKjT1

        Knowling little on ASM should i just look for shorter ASM code in comparisons?

        Christian EhrlicherC Offline
        Christian EhrlicherC Offline
        Christian Ehrlicher
        Lifetime Qt Champion
        wrote on last edited by
        #7
        movdqu  xmm0, XMMWORD PTR [rsp+32]
        movdqu  xmm1, XMMWORD PTR [rsp+48]
        

        As you can see here, only two moves are done instead 8 which one would expect. And when you take a look at the context menu help: "Moves 128, 256 or 512 bits of packed byte/word/doubleword/quadword integer values from the source operand (the second operand) to the destination operand (first operand). This instruction can be used to load a vector register from a memory location, to store the contents of a vector register into a memory location, or to move data between two vector registers."

        Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
        Visit the Qt Academy at https://academy.qt.io/catalog

        ? 2 Replies Last reply
        4
        • Christian EhrlicherC Christian Ehrlicher
          movdqu  xmm0, XMMWORD PTR [rsp+32]
          movdqu  xmm1, XMMWORD PTR [rsp+48]
          

          As you can see here, only two moves are done instead 8 which one would expect. And when you take a look at the context menu help: "Moves 128, 256 or 512 bits of packed byte/word/doubleword/quadword integer values from the source operand (the second operand) to the destination operand (first operand). This instruction can be used to load a vector register from a memory location, to store the contents of a vector register into a memory location, or to move data between two vector registers."

          ? Offline
          ? Offline
          A Former User
          wrote on last edited by A Former User
          #8

          @Christian-Ehrlicher
          Looking for problems where there are none.
          Compiler engineers have solved alot.

          About profiling , How do you profile?

          1 Reply Last reply
          0
          • Christian EhrlicherC Offline
            Christian EhrlicherC Offline
            Christian Ehrlicher
            Lifetime Qt Champion
            wrote on last edited by
            #9

            @Q139 said in How to vectorize operation on struct of data:

            How do you profile?

            callgrind or gperf or similar tools.

            Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
            Visit the Qt Academy at https://academy.qt.io/catalog

            1 Reply Last reply
            0
            • Christian EhrlicherC Christian Ehrlicher
              movdqu  xmm0, XMMWORD PTR [rsp+32]
              movdqu  xmm1, XMMWORD PTR [rsp+48]
              

              As you can see here, only two moves are done instead 8 which one would expect. And when you take a look at the context menu help: "Moves 128, 256 or 512 bits of packed byte/word/doubleword/quadword integer values from the source operand (the second operand) to the destination operand (first operand). This instruction can be used to load a vector register from a memory location, to store the contents of a vector register into a memory location, or to move data between two vector registers."

              ? Offline
              ? Offline
              A Former User
              wrote on last edited by A Former User
              #10

              @Christian-Ehrlicher said in How to vectorize operation on struct of data:

              movdqu xmm0, XMMWORD PTR [rsp+32]
              movdqu xmm1, XMMWORD PTR [rsp+48]

              I think this code sample serves different purposes than intA += intB operations.

              -O2 flag

                      mov     rax, QWORD PTR aaPtr[rip]
                      mov     edx, DWORD PTR bb[rip]
                      add     DWORD PTR [rax], edx
                      mov     edx, DWORD PTR bb[rip+4]
                      add     DWORD PTR [rax+4], edx
                      mov     edx, DWORD PTR bb[rip+8]
                      add     DWORD PTR [rax+8], edx
                      mov     edx, DWORD PTR bb[rip+12]
                      add     DWORD PTR [rax+12], edx
                      mov     edx, DWORD PTR bb[rip+16]
                      add     DWORD PTR [rax+16], edx
                      mov     edx, DWORD PTR bb[rip+20]
                      add     DWORD PTR [rax+20], edx
              

              -O3 flag

                      mov     rax, QWORD PTR aaPtr[rip]
                      movdqu  xmm0, XMMWORD PTR [rax]
                      paddd   xmm0, XMMWORD PTR bb[rip]
                      movups  XMMWORD PTR [rax], xmm0
                      mov     edx, DWORD PTR bb[rip+16]
                      add     DWORD PTR [rax+16], edx
                      mov     edx, DWORD PTR bb[rip+20]
                      add     DWORD PTR [rax+20], edx
              

              I am poorer coder but i get this. https://godbolt.org/z/8h41br

              1 Reply Last reply
              0
              • Christian EhrlicherC Offline
                Christian EhrlicherC Offline
                Christian Ehrlicher
                Lifetime Qt Champion
                wrote on last edited by
                #11

                Don't copy the values by it's own but the complete struct.

                Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                Visit the Qt Academy at https://academy.qt.io/catalog

                ? 1 Reply Last reply
                0
                • Christian EhrlicherC Christian Ehrlicher

                  Don't copy the values by it's own but the complete struct.

                  ? Offline
                  ? Offline
                  A Former User
                  wrote on last edited by A Former User
                  #12

                  @Christian-Ehrlicher Yes , but += is addition operation, not only copy.
                  CompilerExplorer is nice tool to learn ASM and seek more under the hood.

                  Using -O3
                  If struct consists of 4 items it does SIMD copy and add in lesser instructions.
                  If struct consists of 6 it does SIMD on 4 and then 2 copy & add operations separately.
                  Probably reason why it does 4+1+1 or just wont support better instructions for backward compatability?
                  Is there some magic SIMD compiler flag i am missing?

                  Christian EhrlicherC 1 Reply Last reply
                  0
                  • Chris KawaC Offline
                    Chris KawaC Offline
                    Chris Kawa
                    Lifetime Qt Champion
                    wrote on last edited by Chris Kawa
                    #13

                    SIMD instructions work on 128 bit (or 256, or 512) data sets. Ints are 32 bit so there are 4 32 bit operations done in one instruction. If your struct has 6 integers the first four can be processed using SIMD, but there's no 64 bit SIMD addition, so to use SIMD on those 2 remaining values the compiler would have to generate code that allocates temporary 128 bits, copies the two values in the first half of that, does the SIMD addition and then copies back the two values to the original location. That would be slower than just doing the addition without SIMD. If you want it to use SIMD for entire struct you can add two dummy values at the end of your struct, but since it would make your code less readable you need to measure if the increased amount of memory needed for dummy values is justifies in increased computation speed. I'm guessing it's not, but that's something to check with a profiler..

                    1 Reply Last reply
                    1
                    • Chris KawaC Offline
                      Chris KawaC Offline
                      Chris Kawa
                      Lifetime Qt Champion
                      wrote on last edited by
                      #14

                      Also note that SIMD works on aligned data. Your struct has no alignment specification so compiler has to generate slower code for unaligned data. Notice the movdqu instruction. The u stands for "unaligned" and basically means it's slower because the cpu must first align the data to an address the SIMD processor can work with. If you specify alignment for your struct to be "SIMD friendly" like this: struct alignas(128) str { then that instruction turns into movdqa where a stands for "aligned" and processor doesn't have to do extra work. Since this adds alignment to your data there will be some memory footprint increase, so that's again something to profile.

                      ? 1 Reply Last reply
                      2
                      • Chris KawaC Chris Kawa

                        Also note that SIMD works on aligned data. Your struct has no alignment specification so compiler has to generate slower code for unaligned data. Notice the movdqu instruction. The u stands for "unaligned" and basically means it's slower because the cpu must first align the data to an address the SIMD processor can work with. If you specify alignment for your struct to be "SIMD friendly" like this: struct alignas(128) str { then that instruction turns into movdqa where a stands for "aligned" and processor doesn't have to do extra work. Since this adds alignment to your data there will be some memory footprint increase, so that's again something to profile.

                        ? Offline
                        ? Offline
                        A Former User
                        wrote on last edited by
                        #15

                        @Chris-Kawa
                        Can you recommend good learning materials on SIMD optimized C++ coding or about optimized coding in general?

                        1 Reply Last reply
                        0
                        • Chris KawaC Offline
                          Chris KawaC Offline
                          Chris Kawa
                          Lifetime Qt Champion
                          wrote on last edited by
                          #16

                          Sorry, no. Different people prefer to learn different ways. I usually just dig into specs and manuals and test things out.

                          ? 1 Reply Last reply
                          0
                          • Chris KawaC Chris Kawa

                            Sorry, no. Different people prefer to learn different ways. I usually just dig into specs and manuals and test things out.

                            ? Offline
                            ? Offline
                            A Former User
                            wrote on last edited by A Former User
                            #17

                            Addition does not seem as simple operation anymore.
                            When padding, its effect on memory usage and cache line alignments are considered.

                            One side of the operation is struct from long vector that is taken from memory quite randomly, quite low probability of continuous memory accesses , other side is single struct instance in function.
                            The random RAM access patterns probably are main reason it runs slower.

                            Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?

                            Since the accesses from RAM are quite random , would it speed up that operation if all in vector padded in your opinion, or shift operations are cheap?

                            Christian EhrlicherC 1 Reply Last reply
                            0
                            • ? A Former User

                              @Christian-Ehrlicher Yes , but += is addition operation, not only copy.
                              CompilerExplorer is nice tool to learn ASM and seek more under the hood.

                              Using -O3
                              If struct consists of 4 items it does SIMD copy and add in lesser instructions.
                              If struct consists of 6 it does SIMD on 4 and then 2 copy & add operations separately.
                              Probably reason why it does 4+1+1 or just wont support better instructions for backward compatability?
                              Is there some magic SIMD compiler flag i am missing?

                              Christian EhrlicherC Offline
                              Christian EhrlicherC Offline
                              Christian Ehrlicher
                              Lifetime Qt Champion
                              wrote on last edited by
                              #18

                              @Q139 said in How to vectorize operation on struct of data:

                              but += is addition operation, not only copy.

                              then use '=' ... really that hard??

                              Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                              Visit the Qt Academy at https://academy.qt.io/catalog

                              ? Chris KawaC 2 Replies Last reply
                              0
                              • ? A Former User

                                Addition does not seem as simple operation anymore.
                                When padding, its effect on memory usage and cache line alignments are considered.

                                One side of the operation is struct from long vector that is taken from memory quite randomly, quite low probability of continuous memory accesses , other side is single struct instance in function.
                                The random RAM access patterns probably are main reason it runs slower.

                                Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?

                                Since the accesses from RAM are quite random , would it speed up that operation if all in vector padded in your opinion, or shift operations are cheap?

                                Christian EhrlicherC Offline
                                Christian EhrlicherC Offline
                                Christian Ehrlicher
                                Lifetime Qt Champion
                                wrote on last edited by
                                #19

                                @Q139 said in How to vectorize operation on struct of data:

                                Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?

                                That's not allowed since than you would not be able to mix it with other libs which do not align (due to missing -On)

                                Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                                Visit the Qt Academy at https://academy.qt.io/catalog

                                1 Reply Last reply
                                1
                                • Christian EhrlicherC Christian Ehrlicher

                                  @Q139 said in How to vectorize operation on struct of data:

                                  but += is addition operation, not only copy.

                                  then use '=' ... really that hard??

                                  ? Offline
                                  ? Offline
                                  A Former User
                                  wrote on last edited by A Former User
                                  #20

                                  @Christian-Ehrlicher said in How to vectorize operation on struct of data:

                                  @Q139 said in How to vectorize operation on struct of data:

                                  but += is addition operation, not only copy.

                                  then use '=' ... really that hard??

                                  I dont understand.

                                  a.a+=b.a
                                  a.b+=b.b
                                  a.c+=b.c
                                  ...
                                  
                                  a.a=a.a+b.a
                                  a.b=a.b+b.b
                                  a.c=a.c+b.c
                                  ...
                                  
                                  1 Reply Last reply
                                  0
                                  • Christian EhrlicherC Christian Ehrlicher

                                    @Q139 said in How to vectorize operation on struct of data:

                                    but += is addition operation, not only copy.

                                    then use '=' ... really that hard??

                                    Chris KawaC Offline
                                    Chris KawaC Offline
                                    Chris Kawa
                                    Lifetime Qt Champion
                                    wrote on last edited by
                                    #21

                                    @Christian-Ehrlicher said:

                                    then use '=' ... really that hard??

                                    I think you miss the point. OP is doing a += b. It's not the same as a = b so he can't just replace that.

                                    @Q139 said:

                                    The random RAM access patterns probably are main reason it runs slower.

                                    If possible, you can try to prefetch the "random" data into a more local variable beforehand, but that's very case specific and there's no general solution for it. The ideal is linear access always. It's sorta an unachievable goal that you use as a general direction and the closer you get the better it will run. But there's always a margin close to the optimum in which it runs "good enough" and there's no point in further optimization because it's masked by bigger inefficiencies in other parts of the system. E.g. optimizing a function to run in 0.01ms instead of 0.02 is pointless when next thing you do is use I/O with >100ms latency. You must always choose your battles to seek the biggest overall gains and not obsess about single detail.

                                    Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?

                                    Compiler can't mess with alignment. As @Christian-Ehrlicher said it would generate incompatible data layouts from the same code.

                                    would it speed up that operation if all in vector padded in your opinion, or shift operations are cheap?

                                    It doesn't matter what I think. I can be right on one machine/architecture and wrong on another. When optimizing test and measure on as many of your target machines as you can. Intuition is usually working against you when doing low level stuff. Assemble a bag of tools and trust the numbers.

                                    1 Reply Last reply
                                    1
                                    • Christian EhrlicherC Offline
                                      Christian EhrlicherC Offline
                                      Christian Ehrlicher
                                      Lifetime Qt Champion
                                      wrote on last edited by Christian Ehrlicher
                                      #22

                                      @Chris-Kawa said in How to vectorize operation on struct of data:

                                      I think you miss the point. OP is doing a += b. It's not the same as a = b so he can't just replace that.

                                      ok, but '+=' is replaced with avx instructions in -O3 so all is fine (esp. since there is still no validation that this is a bottleneck in his code).

                                      /edit: end since it's a plain structure one can overload operator +=() and use a simply memcpy which is optimized even in -O1

                                      Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                                      Visit the Qt Academy at https://academy.qt.io/catalog

                                      ? 1 Reply Last reply
                                      1
                                      • Christian EhrlicherC Christian Ehrlicher

                                        @Chris-Kawa said in How to vectorize operation on struct of data:

                                        I think you miss the point. OP is doing a += b. It's not the same as a = b so he can't just replace that.

                                        ok, but '+=' is replaced with avx instructions in -O3 so all is fine (esp. since there is still no validation that this is a bottleneck in his code).

                                        /edit: end since it's a plain structure one can overload operator +=() and use a simply memcpy which is optimized even in -O1

                                        ? Offline
                                        ? Offline
                                        A Former User
                                        wrote on last edited by A Former User
                                        #23

                                        @Christian-Ehrlicher Was running -O2 for few month , yesterday switched back to -O3 , got ~20% speed increase.

                                        Some time ago i didnot notice big dif between -O2 and -O3 but i have increased(doubled) variable counts in multiple data structures lately.

                                        Idea was that i can get similar speed back with vectorization as with fewer variables.
                                        But cache misses still increase with more data.

                                        Scattered RAM access patterns are probably biggest bottleneck.
                                        Could GPU architecture be good for speeding up algo that accesses memory in scattered fashion from all cores?
                                        Also threadlock is problem if 2 or more threads take same values and fighting against it on 1000 cores seems bad idea.

                                        Chris KawaC 1 Reply Last reply
                                        0
                                        • Christian EhrlicherC Offline
                                          Christian EhrlicherC Offline
                                          Christian Ehrlicher
                                          Lifetime Qt Champion
                                          wrote on last edited by
                                          #24

                                          To achieve a similar code with -O2 instead using -O3 in this loop you can e.g. use

                                            data &operator +=(const data &o) {
                                                __m128i *a1 = (__m128i*)this;
                                                __m128i *a2 = (__m128i*)&o;
                                                for (int i = 0; i < 2; ++i)
                                                          *(a1 + i) += *(a2 + i);
                                                    return *this;
                                            }
                                          

                                          But I would let the compiler do this job and again would do some performance analysis before thinking about such stuff to see where the hot spots really is.

                                          Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                                          Visit the Qt Academy at https://academy.qt.io/catalog

                                          1 Reply Last reply
                                          3

                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • Users
                                          • Groups
                                          • Search
                                          • Get Qt Extensions
                                          • Unsolved