Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Special Interest Groups
  3. C++ Gurus
  4. How to increase speed of large for loops

How to increase speed of large for loops

Scheduled Pinned Locked Moved Unsolved C++ Gurus
30 Posts 8 Posters 6.5k Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • K kane9x

    This post is deleted!

    jsulmJ Online
    jsulmJ Online
    jsulm
    Lifetime Qt Champion
    wrote on last edited by jsulm
    #2

    @kane9x First thing to do would be to get these x, y, z once:

    auto x = cloudB->points[v[p0]].x;
    auto y = cloudB->points[v[p1]].y;
    auto z = cloudB->points[v[p1]].z;
    
    1 Reply Last reply
    4
    • K kane9x

      This post is deleted!

      J.HilkJ Online
      J.HilkJ Online
      J.Hilk
      Moderators
      wrote on last edited by
      #3

      @kane9x
      the 2nd thing to do would be to reduce the number of std::pow /sqrt calls. Those a very time consuming operations,
      or replace it with a faster one than the standard one

      I would suggest looking into this article:
      https://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/


      Be aware of the Qt Code of Conduct, when posting : https://forum.qt.io/topic/113070/qt-code-of-conduct


      Q: What's that?
      A: It's blue light.
      Q: What does it do?
      A: It turns blue.

      1 Reply Last reply
      6
      • JohanSoloJ Offline
        JohanSoloJ Offline
        JohanSolo
        wrote on last edited by
        #4

        If I may, std::pow for integer exponents can be trivially replaced by this small template:

        template< int exponent, typename T >
        T power( T base )
        {
            if ( exponent == 0 )
            {
                return T( 1 );
            }
            else if ( exponent < 0 )
            {
                return T( 1 ) / power< -exponent >( base );
            }
            else if ( exponent % 2 == 0 )
            {
                return power< exponent / 2 >( base * base );
            }
            else
            {
                return power< exponent / 2 >( base * base ) * base;
            }
        }
        
        

        `They did not know it was impossible, so they did it.'
        -- Mark Twain

        JonBJ 1 Reply Last reply
        2
        • K kane9x

          This post is deleted!

          jsulmJ Online
          jsulmJ Online
          jsulm
          Lifetime Qt Champion
          wrote on last edited by jsulm
          #5
          This post is deleted!
          1 Reply Last reply
          1
          • JohanSoloJ JohanSolo

            If I may, std::pow for integer exponents can be trivially replaced by this small template:

            template< int exponent, typename T >
            T power( T base )
            {
                if ( exponent == 0 )
                {
                    return T( 1 );
                }
                else if ( exponent < 0 )
                {
                    return T( 1 ) / power< -exponent >( base );
                }
                else if ( exponent % 2 == 0 )
                {
                    return power< exponent / 2 >( base * base );
                }
                else
                {
                    return power< exponent / 2 >( base * base ) * base;
                }
            }
            
            
            JonBJ Offline
            JonBJ Offline
            JonB
            wrote on last edited by
            #6

            @JohanSolo
            I wondered about this too. Or in the OP's case he is always using pow() to square, he could replace with in-line single multiply. However, I don't know whether the code for pow() already takes a simple case like this into account and is already efficient?

            JohanSoloJ 1 Reply Last reply
            0
            • JonBJ JonB

              @JohanSolo
              I wondered about this too. Or in the OP's case he is always using pow() to square, he could replace with in-line single multiply. However, I don't know whether the code for pow() already takes a simple case like this into account and is already efficient?

              JohanSoloJ Offline
              JohanSoloJ Offline
              JohanSolo
              wrote on last edited by JohanSolo
              #7

              @JonB said in How to increase speed of large for loops:

              @JohanSolo
              However, I don't know whether the code for pow() already takes a simple case like this into account and is already efficient?

              I attended this lecture long time ago, on page 5 it looks like at least at the time the std::pow was not as efficient as it could be. I cannot tell for the current implementation of std::pow though.

              `They did not know it was impossible, so they did it.'
              -- Mark Twain

              JonBJ 1 Reply Last reply
              0
              • JohanSoloJ JohanSolo

                @JonB said in How to increase speed of large for loops:

                @JohanSolo
                However, I don't know whether the code for pow() already takes a simple case like this into account and is already efficient?

                I attended this lecture long time ago, on page 5 it looks like at least at the time the std::pow was not as efficient as it could be. I cannot tell for the current implementation of std::pow though.

                JonBJ Offline
                JonBJ Offline
                JonB
                wrote on last edited by JonB
                #8

                @JohanSolo
                Yep, you may well be right. I believe for many years C compilers have turned multiplication by 2 (or power of 2) into left-shift (or for all I know modern chips' multiplication instructions do this automatically so compiler doesn't have to any more), but that's not the same as a call to function pow(). In any case, the OP should try one of these code optimisations for his squaring and see if it makes much difference.

                He might also trying narrowing down just which instructions are causing time. For example, I don't know whether the cloudB->points[v[p2]] is instantaneous indexed access or what. Unless relying on the compiler to do it for you (which it may do, I don't know), there are a lot of places which could be factored into temporary variable assignments for re-use to guarantee no re-calculation, e.g. cloudB->points, cloudB->points[v[p0]], etc.

                Also, in the loop conditions is ros::ok() instantaneous or costly?

                But it may just be that there are an awful lot of sqrt() to perform, which I imagine is the costliest operation (how does that code calculate square roots? IIRC, at school using Newton's approximation with pen & paper was pretty time-consuming! How does it get done nowadays?). I see @J-Hilk has posted a link which should be examined in this light.

                Finally, the multi-threading. Do you have evidence whether the multiple threads are using separate cores on your machine to do the work, and without waiting on each other or something else? In any case, a spare couple of cores are only going to reduce the time by a factor of 2, which may be of little help for what the OP wants. Verify that the separate threads are not actually slowing the whole calculation down!

                BTW, how often does the result follow the if(p0p1>10) route, causing the inner loop? Is that where it's "slow"? If so, one small possible optimisation: if you are then only interested in the if(p0p2>10 && p1p2>10) route, after you have calculated p0p2 if it is not >10 you don't need to calculate p1p2, don't know how many calculations that would eliminate overall. "Every little helps", as a certain supermarket here says :)

                J.HilkJ 1 Reply Last reply
                3
                • JonBJ JonB

                  @JohanSolo
                  Yep, you may well be right. I believe for many years C compilers have turned multiplication by 2 (or power of 2) into left-shift (or for all I know modern chips' multiplication instructions do this automatically so compiler doesn't have to any more), but that's not the same as a call to function pow(). In any case, the OP should try one of these code optimisations for his squaring and see if it makes much difference.

                  He might also trying narrowing down just which instructions are causing time. For example, I don't know whether the cloudB->points[v[p2]] is instantaneous indexed access or what. Unless relying on the compiler to do it for you (which it may do, I don't know), there are a lot of places which could be factored into temporary variable assignments for re-use to guarantee no re-calculation, e.g. cloudB->points, cloudB->points[v[p0]], etc.

                  Also, in the loop conditions is ros::ok() instantaneous or costly?

                  But it may just be that there are an awful lot of sqrt() to perform, which I imagine is the costliest operation (how does that code calculate square roots? IIRC, at school using Newton's approximation with pen & paper was pretty time-consuming! How does it get done nowadays?). I see @J-Hilk has posted a link which should be examined in this light.

                  Finally, the multi-threading. Do you have evidence whether the multiple threads are using separate cores on your machine to do the work, and without waiting on each other or something else? In any case, a spare couple of cores are only going to reduce the time by a factor of 2, which may be of little help for what the OP wants. Verify that the separate threads are not actually slowing the whole calculation down!

                  BTW, how often does the result follow the if(p0p1>10) route, causing the inner loop? Is that where it's "slow"? If so, one small possible optimisation: if you are then only interested in the if(p0p2>10 && p1p2>10) route, after you have calculated p0p2 if it is not >10 you don't need to calculate p1p2, don't know how many calculations that would eliminate overall. "Every little helps", as a certain supermarket here says :)

                  J.HilkJ Online
                  J.HilkJ Online
                  J.Hilk
                  Moderators
                  wrote on last edited by
                  #9

                  @JonB said in How to increase speed of large for loops:

                  Also, in the loop conditions is ros::ok() instantaneous or costly?

                  that's actually a good point,

                  v.size() and ros::ok() are called each cycle. at least size() is something the op can rationalize away

                  for(size_t p1=0;p1<v.size() && ros::ok();++p1) {

                  to

                  for(size_t p1 (0), end(v.size()); p1<end && ros::ok();++p1) {


                  Be aware of the Qt Code of Conduct, when posting : https://forum.qt.io/topic/113070/qt-code-of-conduct


                  Q: What's that?
                  A: It's blue light.
                  Q: What does it do?
                  A: It turns blue.

                  1 Reply Last reply
                  2
                  • K kane9x

                    This post is deleted!

                    aha_1980A Offline
                    aha_1980A Offline
                    aha_1980
                    Lifetime Qt Champion
                    wrote on last edited by
                    #10

                    @kane9x

                    If your code does not work as fast as expected, you should do two things first:

                    1. Ask yourself if you use the best algorithm for the given problem
                    2. Profile your alogrithm to find out the slowest part. Store the result for later comparism.

                    You cannot start optimizing before these two steps are finished. Next, set up good unit tests that make sure the behavior does not change when refactoring. Then, replace the slowest part with a better implementation.

                    Regards

                    Qt has to stay free or it will die.

                    K 1 Reply Last reply
                    4
                    • aha_1980A aha_1980

                      @kane9x

                      If your code does not work as fast as expected, you should do two things first:

                      1. Ask yourself if you use the best algorithm for the given problem
                      2. Profile your alogrithm to find out the slowest part. Store the result for later comparism.

                      You cannot start optimizing before these two steps are finished. Next, set up good unit tests that make sure the behavior does not change when refactoring. Then, replace the slowest part with a better implementation.

                      Regards

                      K Offline
                      K Offline
                      kane9x
                      Banned
                      wrote on last edited by
                      #11

                      @aha_1980 Thank you, I think what I should do now is to find some algorithm to help me optimize it

                      JonBJ 1 Reply Last reply
                      0
                      • K kane9x

                        @aha_1980 Thank you, I think what I should do now is to find some algorithm to help me optimize it

                        JonBJ Offline
                        JonBJ Offline
                        JonB
                        wrote on last edited by
                        #12

                        @kane9x

                        nearly about 8e+12 iterations

                        I don't know quite what you're trying to do why, but if you mean you have approx a trillion iterations/square roots etc. to calculate that's a very large number to be executing if speed is critical....

                        kshegunovK 1 Reply Last reply
                        1
                        • JonBJ JonB

                          @kane9x

                          nearly about 8e+12 iterations

                          I don't know quite what you're trying to do why, but if you mean you have approx a trillion iterations/square roots etc. to calculate that's a very large number to be executing if speed is critical....

                          kshegunovK Offline
                          kshegunovK Offline
                          kshegunov
                          Moderators
                          wrote on last edited by kshegunov
                          #13

                          @JonB said in How to increase speed of large for loops:

                          I don't know quite what you're trying to do why, but if you mean you have approx a trillion iterations/square roots etc. to calculate that's a very large number to be executing if speed is critical....

                          Maybe I can help with your confusion. The OP is trying to calculate the euclidean distance for a set of three points and do so by using a permutation of those three points from the whole set. Something they should've precalculated and stored and something they should've used the SIMD instructions for.

                          Read and abide by the Qt Code of Conduct

                          kshegunovK JonBJ 2 Replies Last reply
                          3
                          • kshegunovK kshegunov

                            @JonB said in How to increase speed of large for loops:

                            I don't know quite what you're trying to do why, but if you mean you have approx a trillion iterations/square roots etc. to calculate that's a very large number to be executing if speed is critical....

                            Maybe I can help with your confusion. The OP is trying to calculate the euclidean distance for a set of three points and do so by using a permutation of those three points from the whole set. Something they should've precalculated and stored and something they should've used the SIMD instructions for.

                            kshegunovK Offline
                            kshegunovK Offline
                            kshegunov
                            Moderators
                            wrote on last edited by
                            #14

                            PS.
                            Just to be clear, the indirection cloudB->points[v[p0]] is a cache line invalidation every time.

                            Read and abide by the Qt Code of Conduct

                            1 Reply Last reply
                            2
                            • kshegunovK kshegunov

                              @JonB said in How to increase speed of large for loops:

                              I don't know quite what you're trying to do why, but if you mean you have approx a trillion iterations/square roots etc. to calculate that's a very large number to be executing if speed is critical....

                              Maybe I can help with your confusion. The OP is trying to calculate the euclidean distance for a set of three points and do so by using a permutation of those three points from the whole set. Something they should've precalculated and stored and something they should've used the SIMD instructions for.

                              JonBJ Offline
                              JonBJ Offline
                              JonB
                              wrote on last edited by
                              #15

                              @kshegunov said in How to increase speed of large for loops:

                              Maybe I can help with your confusion. The OP is trying to calculate the euclidean distance for a set of three points

                              Yes, I realised it was this sort of thing. However, AFAIK Euclid did not have the aid of a PC and presumably would have struggled to calculate a trillion distances by hand... :)

                              kshegunovK 1 Reply Last reply
                              0
                              • JonBJ JonB

                                @kshegunov said in How to increase speed of large for loops:

                                Maybe I can help with your confusion. The OP is trying to calculate the euclidean distance for a set of three points

                                Yes, I realised it was this sort of thing. However, AFAIK Euclid did not have the aid of a PC and presumably would have struggled to calculate a trillion distances by hand... :)

                                kshegunovK Offline
                                kshegunovK Offline
                                kshegunov
                                Moderators
                                wrote on last edited by kshegunov
                                #16

                                @JonB said in How to increase speed of large for loops:

                                AFAIK Euclid did not have the aid of a PC and presumably would have struggled to calculate a trillion distances by hand...

                                Probably not. But I imagine, him being a smart guy, he'd've tabulated whatever he had already calculated so he didn't need to do it again ... at least seems logical to me.

                                Read and abide by the Qt Code of Conduct

                                JonBJ 1 Reply Last reply
                                0
                                • kshegunovK kshegunov

                                  @JonB said in How to increase speed of large for loops:

                                  AFAIK Euclid did not have the aid of a PC and presumably would have struggled to calculate a trillion distances by hand...

                                  Probably not. But I imagine, him being a smart guy, he'd've tabulated whatever he had already calculated so he didn't need to do it again ... at least seems logical to me.

                                  JonBJ Offline
                                  JonBJ Offline
                                  JonB
                                  wrote on last edited by
                                  #17

                                  @kshegunov
                                  Trouble is, writing down the answers to a trillion square roots takes a lot of space. And with that many even look-up time is going to get considerable....

                                  kshegunovK 1 Reply Last reply
                                  0
                                  • JonBJ JonB

                                    @kshegunov
                                    Trouble is, writing down the answers to a trillion square roots takes a lot of space. And with that many even look-up time is going to get considerable....

                                    kshegunovK Offline
                                    kshegunovK Offline
                                    kshegunov
                                    Moderators
                                    wrote on last edited by kshegunov
                                    #18

                                    @JonB said in How to increase speed of large for loops:

                                    Trouble is, writing down the answers to a trillion square roots takes a lot of space. And with that many even look-up time is going to get considerable....

                                    Mayhaps. I do like the "we create hardware out of software" approach, I admit, unfortunately this rarely works in practice. Leaving the metaphors to rest for a moment, I implore you to really try to imagine how this is supposed to work and do the following:

                                    1. Notice the inner loop is only interesting if the distance between two points is more than some magic number (not having semi-divine in-code numbers is a matter for another discussion).
                                    2. Notice the inner if is checking if two distances (between two pairs of points) are larger than some arbitrary numbers.
                                    3. Notice that the distance between two points is the same no matter which is first and which is second.
                                    4. Notice that distances are recalculated for every conceivable case of point pairing.
                                    5. Finally (and least importantly), notice that the indirection through some permutation vector brakes data locality and thus invalidates the cache.

                                    Now after a quick think, I hallucinate that 1), 2), 3) and 4) can be fixed rather easily in a single step, without throwing recursive template instantiations at pow, mind you. My "genius" idea is as follows:

                                    1. Go through the pairs of points and save in a container only these pairs (and the distance between them) that satisfy the threshold.
                                      1.1) When doing that it's useful to not repeat, thus the distance from A to B is going to be the same as the distance from B to A, unless living in an alternate world. This should help shave off some unnecessary duplication.
                                      1.2) Before doing that it's also useful to throw away the permutation vector if possible, so 5) to be solved by construction.
                                    2. For the resulting container from 1) (probably a vector) one can see that the innermost if is directly satisfied for any pair of elements ...
                                    3. Step 1) can be parallelized very easily for additional yield.
                                    4. Step 1) can make use of SSE/AVX.

                                    Read and abide by the Qt Code of Conduct

                                    JonBJ 1 Reply Last reply
                                    1
                                    • kshegunovK kshegunov

                                      @JonB said in How to increase speed of large for loops:

                                      Trouble is, writing down the answers to a trillion square roots takes a lot of space. And with that many even look-up time is going to get considerable....

                                      Mayhaps. I do like the "we create hardware out of software" approach, I admit, unfortunately this rarely works in practice. Leaving the metaphors to rest for a moment, I implore you to really try to imagine how this is supposed to work and do the following:

                                      1. Notice the inner loop is only interesting if the distance between two points is more than some magic number (not having semi-divine in-code numbers is a matter for another discussion).
                                      2. Notice the inner if is checking if two distances (between two pairs of points) are larger than some arbitrary numbers.
                                      3. Notice that the distance between two points is the same no matter which is first and which is second.
                                      4. Notice that distances are recalculated for every conceivable case of point pairing.
                                      5. Finally (and least importantly), notice that the indirection through some permutation vector brakes data locality and thus invalidates the cache.

                                      Now after a quick think, I hallucinate that 1), 2), 3) and 4) can be fixed rather easily in a single step, without throwing recursive template instantiations at pow, mind you. My "genius" idea is as follows:

                                      1. Go through the pairs of points and save in a container only these pairs (and the distance between them) that satisfy the threshold.
                                        1.1) When doing that it's useful to not repeat, thus the distance from A to B is going to be the same as the distance from B to A, unless living in an alternate world. This should help shave off some unnecessary duplication.
                                        1.2) Before doing that it's also useful to throw away the permutation vector if possible, so 5) to be solved by construction.
                                      2. For the resulting container from 1) (probably a vector) one can see that the innermost if is directly satisfied for any pair of elements ...
                                      3. Step 1) can be parallelized very easily for additional yield.
                                      4. Step 1) can make use of SSE/AVX.
                                      JonBJ Offline
                                      JonBJ Offline
                                      JonB
                                      wrote on last edited by
                                      #19

                                      @kshegunov Indeedy, these are all good points, FAO the OP, @kane9x, not me!

                                      kshegunovK 1 Reply Last reply
                                      0
                                      • JonBJ JonB

                                        @kshegunov Indeedy, these are all good points, FAO the OP, @kane9x, not me!

                                        kshegunovK Offline
                                        kshegunovK Offline
                                        kshegunov
                                        Moderators
                                        wrote on last edited by kshegunov
                                        #20

                                        And I was just thinkin' we are having a nice easygoing conversation ... 'cause when I see this:

                                        template< int exponent, typename T >
                                        T power( T base )
                                        {
                                            // ...
                                        }
                                        

                                        I cringe so badly my face is contorted for a week.

                                        Read and abide by the Qt Code of Conduct

                                        JonBJ 1 Reply Last reply
                                        0
                                        • kshegunovK kshegunov

                                          And I was just thinkin' we are having a nice easygoing conversation ... 'cause when I see this:

                                          template< int exponent, typename T >
                                          T power( T base )
                                          {
                                              // ...
                                          }
                                          

                                          I cringe so badly my face is contorted for a week.

                                          JonBJ Offline
                                          JonBJ Offline
                                          JonB
                                          wrote on last edited by JonB
                                          #21

                                          @kshegunov
                                          No idea what's foul about it, or the bit you've quoted, so you'd better explain? Unless you mean the whole idea of using templates, which of course I never used: C didn't need them, C++ added them as an obfuscation layer, so I'm quite happy without ;-)

                                          Mind you, I looked at @JohanSolo's code above. His definition is a recursive one (return power< exponent / 2 >( base * base ) * base;). I'm surprised. This would be all very well in my old Prolog, but I don't think the C++ compiler is going to recognise & remove tail recursion in the definition. So I don't know what he means by "trivially replaced", why would one want to use such a definition?

                                          kshegunovK 1 Reply Last reply
                                          0

                                          • Login

                                          • Login or register to search.
                                          • First post
                                            Last post
                                          0
                                          • Categories
                                          • Recent
                                          • Tags
                                          • Popular
                                          • Users
                                          • Groups
                                          • Search
                                          • Get Qt Extensions
                                          • Unsolved