Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. How to vectorize operation on struct of data
Forum Updated to NodeBB v4.3 + New Features

How to vectorize operation on struct of data

Scheduled Pinned Locked Moved Solved General and Desktop
28 Posts 3 Posters 3.4k Views 2 Watching
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • Chris KawaC Chris Kawa

    Also note that SIMD works on aligned data. Your struct has no alignment specification so compiler has to generate slower code for unaligned data. Notice the movdqu instruction. The u stands for "unaligned" and basically means it's slower because the cpu must first align the data to an address the SIMD processor can work with. If you specify alignment for your struct to be "SIMD friendly" like this: struct alignas(128) str { then that instruction turns into movdqa where a stands for "aligned" and processor doesn't have to do extra work. Since this adds alignment to your data there will be some memory footprint increase, so that's again something to profile.

    ? Offline
    ? Offline
    A Former User
    wrote on last edited by
    #15

    @Chris-Kawa
    Can you recommend good learning materials on SIMD optimized C++ coding or about optimized coding in general?

    1 Reply Last reply
    0
    • Chris KawaC Offline
      Chris KawaC Offline
      Chris Kawa
      Lifetime Qt Champion
      wrote on last edited by
      #16

      Sorry, no. Different people prefer to learn different ways. I usually just dig into specs and manuals and test things out.

      ? 1 Reply Last reply
      0
      • Chris KawaC Chris Kawa

        Sorry, no. Different people prefer to learn different ways. I usually just dig into specs and manuals and test things out.

        ? Offline
        ? Offline
        A Former User
        wrote on last edited by A Former User
        #17

        Addition does not seem as simple operation anymore.
        When padding, its effect on memory usage and cache line alignments are considered.

        One side of the operation is struct from long vector that is taken from memory quite randomly, quite low probability of continuous memory accesses , other side is single struct instance in function.
        The random RAM access patterns probably are main reason it runs slower.

        Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?

        Since the accesses from RAM are quite random , would it speed up that operation if all in vector padded in your opinion, or shift operations are cheap?

        Christian EhrlicherC 1 Reply Last reply
        0
        • ? A Former User

          @Christian-Ehrlicher Yes , but += is addition operation, not only copy.
          CompilerExplorer is nice tool to learn ASM and seek more under the hood.

          Using -O3
          If struct consists of 4 items it does SIMD copy and add in lesser instructions.
          If struct consists of 6 it does SIMD on 4 and then 2 copy & add operations separately.
          Probably reason why it does 4+1+1 or just wont support better instructions for backward compatability?
          Is there some magic SIMD compiler flag i am missing?

          Christian EhrlicherC Offline
          Christian EhrlicherC Offline
          Christian Ehrlicher
          Lifetime Qt Champion
          wrote on last edited by
          #18

          @Q139 said in How to vectorize operation on struct of data:

          but += is addition operation, not only copy.

          then use '=' ... really that hard??

          Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
          Visit the Qt Academy at https://academy.qt.io/catalog

          ? Chris KawaC 2 Replies Last reply
          0
          • ? A Former User

            Addition does not seem as simple operation anymore.
            When padding, its effect on memory usage and cache line alignments are considered.

            One side of the operation is struct from long vector that is taken from memory quite randomly, quite low probability of continuous memory accesses , other side is single struct instance in function.
            The random RAM access patterns probably are main reason it runs slower.

            Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?

            Since the accesses from RAM are quite random , would it speed up that operation if all in vector padded in your opinion, or shift operations are cheap?

            Christian EhrlicherC Offline
            Christian EhrlicherC Offline
            Christian Ehrlicher
            Lifetime Qt Champion
            wrote on last edited by
            #19

            @Q139 said in How to vectorize operation on struct of data:

            Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?

            That's not allowed since than you would not be able to mix it with other libs which do not align (due to missing -On)

            Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
            Visit the Qt Academy at https://academy.qt.io/catalog

            1 Reply Last reply
            1
            • Christian EhrlicherC Christian Ehrlicher

              @Q139 said in How to vectorize operation on struct of data:

              but += is addition operation, not only copy.

              then use '=' ... really that hard??

              ? Offline
              ? Offline
              A Former User
              wrote on last edited by A Former User
              #20

              @Christian-Ehrlicher said in How to vectorize operation on struct of data:

              @Q139 said in How to vectorize operation on struct of data:

              but += is addition operation, not only copy.

              then use '=' ... really that hard??

              I dont understand.

              a.a+=b.a
              a.b+=b.b
              a.c+=b.c
              ...
              
              a.a=a.a+b.a
              a.b=a.b+b.b
              a.c=a.c+b.c
              ...
              
              1 Reply Last reply
              0
              • Christian EhrlicherC Christian Ehrlicher

                @Q139 said in How to vectorize operation on struct of data:

                but += is addition operation, not only copy.

                then use '=' ... really that hard??

                Chris KawaC Offline
                Chris KawaC Offline
                Chris Kawa
                Lifetime Qt Champion
                wrote on last edited by
                #21

                @Christian-Ehrlicher said:

                then use '=' ... really that hard??

                I think you miss the point. OP is doing a += b. It's not the same as a = b so he can't just replace that.

                @Q139 said:

                The random RAM access patterns probably are main reason it runs slower.

                If possible, you can try to prefetch the "random" data into a more local variable beforehand, but that's very case specific and there's no general solution for it. The ideal is linear access always. It's sorta an unachievable goal that you use as a general direction and the closer you get the better it will run. But there's always a margin close to the optimum in which it runs "good enough" and there's no point in further optimization because it's masked by bigger inefficiencies in other parts of the system. E.g. optimizing a function to run in 0.01ms instead of 0.02 is pointless when next thing you do is use I/O with >100ms latency. You must always choose your battles to seek the biggest overall gains and not obsess about single detail.

                Do you know if compiler at -O3 optimization add padding/align single instance of struct in function or programmer would need to specify it?

                Compiler can't mess with alignment. As @Christian-Ehrlicher said it would generate incompatible data layouts from the same code.

                would it speed up that operation if all in vector padded in your opinion, or shift operations are cheap?

                It doesn't matter what I think. I can be right on one machine/architecture and wrong on another. When optimizing test and measure on as many of your target machines as you can. Intuition is usually working against you when doing low level stuff. Assemble a bag of tools and trust the numbers.

                1 Reply Last reply
                1
                • Christian EhrlicherC Offline
                  Christian EhrlicherC Offline
                  Christian Ehrlicher
                  Lifetime Qt Champion
                  wrote on last edited by Christian Ehrlicher
                  #22

                  @Chris-Kawa said in How to vectorize operation on struct of data:

                  I think you miss the point. OP is doing a += b. It's not the same as a = b so he can't just replace that.

                  ok, but '+=' is replaced with avx instructions in -O3 so all is fine (esp. since there is still no validation that this is a bottleneck in his code).

                  /edit: end since it's a plain structure one can overload operator +=() and use a simply memcpy which is optimized even in -O1

                  Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                  Visit the Qt Academy at https://academy.qt.io/catalog

                  ? 1 Reply Last reply
                  1
                  • Christian EhrlicherC Christian Ehrlicher

                    @Chris-Kawa said in How to vectorize operation on struct of data:

                    I think you miss the point. OP is doing a += b. It's not the same as a = b so he can't just replace that.

                    ok, but '+=' is replaced with avx instructions in -O3 so all is fine (esp. since there is still no validation that this is a bottleneck in his code).

                    /edit: end since it's a plain structure one can overload operator +=() and use a simply memcpy which is optimized even in -O1

                    ? Offline
                    ? Offline
                    A Former User
                    wrote on last edited by A Former User
                    #23

                    @Christian-Ehrlicher Was running -O2 for few month , yesterday switched back to -O3 , got ~20% speed increase.

                    Some time ago i didnot notice big dif between -O2 and -O3 but i have increased(doubled) variable counts in multiple data structures lately.

                    Idea was that i can get similar speed back with vectorization as with fewer variables.
                    But cache misses still increase with more data.

                    Scattered RAM access patterns are probably biggest bottleneck.
                    Could GPU architecture be good for speeding up algo that accesses memory in scattered fashion from all cores?
                    Also threadlock is problem if 2 or more threads take same values and fighting against it on 1000 cores seems bad idea.

                    Chris KawaC 1 Reply Last reply
                    0
                    • Christian EhrlicherC Offline
                      Christian EhrlicherC Offline
                      Christian Ehrlicher
                      Lifetime Qt Champion
                      wrote on last edited by
                      #24

                      To achieve a similar code with -O2 instead using -O3 in this loop you can e.g. use

                        data &operator +=(const data &o) {
                            __m128i *a1 = (__m128i*)this;
                            __m128i *a2 = (__m128i*)&o;
                            for (int i = 0; i < 2; ++i)
                                      *(a1 + i) += *(a2 + i);
                                return *this;
                        }
                      

                      But I would let the compiler do this job and again would do some performance analysis before thinking about such stuff to see where the hot spots really is.

                      Qt Online Installer direct download: https://download.qt.io/official_releases/online_installers/
                      Visit the Qt Academy at https://academy.qt.io/catalog

                      1 Reply Last reply
                      3
                      • ? A Former User

                        @Christian-Ehrlicher Was running -O2 for few month , yesterday switched back to -O3 , got ~20% speed increase.

                        Some time ago i didnot notice big dif between -O2 and -O3 but i have increased(doubled) variable counts in multiple data structures lately.

                        Idea was that i can get similar speed back with vectorization as with fewer variables.
                        But cache misses still increase with more data.

                        Scattered RAM access patterns are probably biggest bottleneck.
                        Could GPU architecture be good for speeding up algo that accesses memory in scattered fashion from all cores?
                        Also threadlock is problem if 2 or more threads take same values and fighting against it on 1000 cores seems bad idea.

                        Chris KawaC Offline
                        Chris KawaC Offline
                        Chris Kawa
                        Lifetime Qt Champion
                        wrote on last edited by
                        #25

                        @Q139 said:

                        Could GPU architecture be good for speeding up algo that accesses memory in scattered fashion from all cores?

                        The problem with GPUs is that they usually have their own RAM so you have to transfer the data over the bus to and from the GPU and that can be a bottlenect too. GPUs are good if you can make the data stay on the GPU for most or all of the time. It's problematic if you need to juggle between CPU and GPU. There are architectures that use a unified memory structure (modern game consoles are one such example) where all RAM is shared by the CPU and GPU but that's again a specific hardware situation.

                        Another consideration is that GPUs memory is usually organized and accessed very differently from what the CPU does. GPUs work in waves, warps, clusters, tiles or whatever given vendor calls it. The point being that the caching, synchronization and access patterns are more tightly constrained on GPUs, not less.

                        Also threadlock is problem if 2 or more threads take same values and fighting against it on 1000 cores seems bad idea.

                        Reading same value is not a problem. Writing is. If you do accumulation for example, the usual solution is to use partial accumulation on each thread and then do a sync point and an extra pass to accumulate the partial results. There are standard facilities added in recent C++ standard for such paralelism, e.g. std::reduce algorithm that can parallelize and vectorize accumulation. There are also many libraries that do this like PPL or OpenMP. It's just a lot of testing and figuring out what works best for your case.

                        ? 1 Reply Last reply
                        2
                        • Chris KawaC Chris Kawa

                          @Q139 said:

                          Could GPU architecture be good for speeding up algo that accesses memory in scattered fashion from all cores?

                          The problem with GPUs is that they usually have their own RAM so you have to transfer the data over the bus to and from the GPU and that can be a bottlenect too. GPUs are good if you can make the data stay on the GPU for most or all of the time. It's problematic if you need to juggle between CPU and GPU. There are architectures that use a unified memory structure (modern game consoles are one such example) where all RAM is shared by the CPU and GPU but that's again a specific hardware situation.

                          Another consideration is that GPUs memory is usually organized and accessed very differently from what the CPU does. GPUs work in waves, warps, clusters, tiles or whatever given vendor calls it. The point being that the caching, synchronization and access patterns are more tightly constrained on GPUs, not less.

                          Also threadlock is problem if 2 or more threads take same values and fighting against it on 1000 cores seems bad idea.

                          Reading same value is not a problem. Writing is. If you do accumulation for example, the usual solution is to use partial accumulation on each thread and then do a sync point and an extra pass to accumulate the partial results. There are standard facilities added in recent C++ standard for such paralelism, e.g. std::reduce algorithm that can parallelize and vectorize accumulation. There are also many libraries that do this like PPL or OpenMP. It's just a lot of testing and figuring out what works best for your case.

                          ? Offline
                          ? Offline
                          A Former User
                          wrote on last edited by A Former User
                          #26

                          @Chris-Kawa said in How to vectorize operation on struct of data:

                          The problem with GPUs is that they usually have their own RAM so you have to transfer the data over the bus to and from the GPU and that can be a bottlenect too

                          Transfers could be done rarely to unload and fill GPU memory. Or if blocks of task are finished.

                          Also id assume that GPU threads can works on some memory while other memory locations are transfered.

                          @Chris-Kawa said in How to vectorize operation on struct of data:

                          The point being that the caching, synchronization and access patterns are more tightly constrained on GPUs, not less.

                          My assumptions about GPU memory access might be wrong.

                          I dont know GPU architecture limitations well but better access with 1000+ cores for scattered memory locations would be only reason to port to GPU.

                          @Chris-Kawa said in How to vectorize operation on struct of data:

                          Reading same value is not a problem. Writing is.

                          Even if GPU memory access would be superior in this case.
                          Avoiding write conflicts without sacrificing performance or extra memory seems harder with 1000+ threads.

                          Can you reccomend something to read on GPU memory access constraints if working with scattered data?

                          Chris KawaC 1 Reply Last reply
                          0
                          • ? A Former User

                            @Chris-Kawa said in How to vectorize operation on struct of data:

                            The problem with GPUs is that they usually have their own RAM so you have to transfer the data over the bus to and from the GPU and that can be a bottlenect too

                            Transfers could be done rarely to unload and fill GPU memory. Or if blocks of task are finished.

                            Also id assume that GPU threads can works on some memory while other memory locations are transfered.

                            @Chris-Kawa said in How to vectorize operation on struct of data:

                            The point being that the caching, synchronization and access patterns are more tightly constrained on GPUs, not less.

                            My assumptions about GPU memory access might be wrong.

                            I dont know GPU architecture limitations well but better access with 1000+ cores for scattered memory locations would be only reason to port to GPU.

                            @Chris-Kawa said in How to vectorize operation on struct of data:

                            Reading same value is not a problem. Writing is.

                            Even if GPU memory access would be superior in this case.
                            Avoiding write conflicts without sacrificing performance or extra memory seems harder with 1000+ threads.

                            Can you reccomend something to read on GPU memory access constraints if working with scattered data?

                            Chris KawaC Offline
                            Chris KawaC Offline
                            Chris Kawa
                            Lifetime Qt Champion
                            wrote on last edited by
                            #27

                            @Q139 said:

                            Avoiding write conflicts without sacrificing performance seems harder with 1000+ threads.

                            Yup, it is. There are atomics on GPUs, but if you're gonna do an atomic operation on a single variable from 1000+ threads then those are gonna get serialized anyway and, depending on your target setup, you can run into TDR issues. When multithreading you often need to use dedicated algorithms, like the partial sum I mentioned. One such example is bitonic sort, which is a very wasteful algorithm and terrible ona single thread but maps well to GPU hardware and is well suited for parallelism.

                            Can you reccomend something to read on GPU memory access constraints if working with scattered data?

                            It's kinda problematic because GPU technologi is very diverse and good practices will vary from chip to chip even. There are some umbrella technologies that try to abstract some details away so you can program for wider range of hardware, but they sometimes come at a cost of not fully leveraging given hardware capabilities. CUDA is one such technology that is popular and has extensive documentation/example base. OpenCL is another one and NVidia has a lot of samples and interesting papers for it.

                            1 Reply Last reply
                            0
                            • ? Offline
                              ? Offline
                              A Former User
                              wrote on last edited by
                              #28
                              This post is deleted!
                              1 Reply Last reply
                              0

                              • Login

                              • Login or register to search.
                              • First post
                                Last post
                              0
                              • Categories
                              • Recent
                              • Tags
                              • Popular
                              • Users
                              • Groups
                              • Search
                              • Get Qt Extensions
                              • Unsolved