How to increase speed of large for loops

@kane9x
the 2nd thing to do would be to reduce the number of std::pow /sqrt calls. Those a very time consuming operations,
or replace it with a faster one than the standard oneI would suggest looking into this article:
https://martin.ankerl.com/2012/01/25/optimizedapproximativepowincandcpp/ 
If I may,
std::pow
for integer exponents can be trivially replaced by this small template:template< int exponent, typename T > T power( T base ) { if ( exponent == 0 ) { return T( 1 ); } else if ( exponent < 0 ) { return T( 1 ) / power< exponent >( base ); } else if ( exponent % 2 == 0 ) { return power< exponent / 2 >( base * base ); } else { return power< exponent / 2 >( base * base ) * base; } }

@JohanSolo
I wondered about this too. Or in the OP's case he is always usingpow()
to square, he could replace with inline single multiply. However, I don't know whether the code forpow()
already takes a simple case like this into account and is already efficient? 
@JonB said in How to increase speed of large for loops:
@JohanSolo
However, I don't know whether the code forpow()
already takes a simple case like this into account and is already efficient?I attended this lecture long time ago, on page 5 it looks like at least at the time the
std::pow
was not as efficient as it could be. I cannot tell for the current implementation ofstd::pow
though. 
@JohanSolo
Yep, you may well be right. I believe for many years C compilers have turned multiplication by 2 (or power of 2) into leftshift (or for all I know modern chips' multiplication instructions do this automatically so compiler doesn't have to any more), but that's not the same as a call to functionpow()
. In any case, the OP should try one of these code optimisations for his squaring and see if it makes much difference.He might also trying narrowing down just which instructions are causing time. For example, I don't know whether the
cloudB>points[v[p2]]
is instantaneous indexed access or what. Unless relying on the compiler to do it for you (which it may do, I don't know), there are a lot of places which could be factored into temporary variable assignments for reuse to guarantee no recalculation, e.g.cloudB>points
,cloudB>points[v[p0]]
, etc.Also, in the loop conditions is
ros::ok()
instantaneous or costly?But it may just be that there are an awful lot of
sqrt()
to perform, which I imagine is the costliest operation (how does that code calculate square roots? IIRC, at school using Newton's approximation with pen & paper was pretty timeconsuming! How does it get done nowadays?). I see @JHilk has posted a link which should be examined in this light.Finally, the multithreading. Do you have evidence whether the multiple threads are using separate cores on your machine to do the work, and without waiting on each other or something else? In any case, a spare couple of cores are only going to reduce the time by a factor of 2, which may be of little help for what the OP wants. Verify that the separate threads are not actually slowing the whole calculation down!
BTW, how often does the result follow the
if(p0p1>10)
route, causing the inner loop? Is that where it's "slow"? If so, one small possible optimisation: if you are then only interested in theif(p0p2>10 && p1p2>10)
route, after you have calculatedp0p2
if it is not>10
you don't need to calculatep1p2
, don't know how many calculations that would eliminate overall. "Every little helps", as a certain supermarket here says :) 
@JonB said in How to increase speed of large for loops:
Also, in the loop conditions is ros::ok() instantaneous or costly?
that's actually a good point,
v.size()
andros::ok()
are called each cycle. at least size() is something the op can rationalize awayfor(size_t p1=0;p1<v.size() && ros::ok();++p1) {
to
for(size_t p1 (0), end(v.size()); p1<end && ros::ok();++p1) {

If your code does not work as fast as expected, you should do two things first:
 Ask yourself if you use the best algorithm for the given problem
 Profile your alogrithm to find out the slowest part. Store the result for later comparism.
You cannot start optimizing before these two steps are finished. Next, set up good unit tests that make sure the behavior does not change when refactoring. Then, replace the slowest part with a better implementation.
Regards


@JonB said in How to increase speed of large for loops:
I don't know quite what you're trying to do why, but if you mean you have approx a trillion iterations/square roots etc. to calculate that's a very large number to be executing if speed is critical....
Maybe I can help with your confusion. The OP is trying to calculate the euclidean distance for a set of three points and do so by using a permutation of those three points from the whole set. Something they should've precalculated and stored and something they should've used the SIMD instructions for.

@kshegunov said in How to increase speed of large for loops:
Maybe I can help with your confusion. The OP is trying to calculate the euclidean distance for a set of three points
Yes, I realised it was this sort of thing. However, AFAIK Euclid did not have the aid of a PC and presumably would have struggled to calculate a trillion distances by hand... :)

@JonB said in How to increase speed of large for loops:
AFAIK Euclid did not have the aid of a PC and presumably would have struggled to calculate a trillion distances by hand...
Probably not. But I imagine, him being a smart guy, he'd've tabulated whatever he had already calculated so he didn't need to do it again ... at least seems logical to me.

@JonB said in How to increase speed of large for loops:
Trouble is, writing down the answers to a trillion square roots takes a lot of space. And with that many even lookup time is going to get considerable....
Mayhaps. I do like the "we create hardware out of software" approach, I admit, unfortunately this rarely works in practice. Leaving the metaphors to rest for a moment, I implore you to really try to imagine how this is supposed to work and do the following:
 Notice the inner loop is only interesting if the distance between two points is more than some magic number (not having semidivine incode numbers is a matter for another discussion).
 Notice the inner
if
is checking if two distances (between two pairs of points) are larger than some arbitrary numbers.  Notice that the distance between two points is the same no matter which is first and which is second.
 Notice that distances are recalculated for every conceivable case of point pairing.
 Finally (and least importantly), notice that the indirection through some permutation vector brakes data locality and thus invalidates the cache.
Now after a quick think, I hallucinate that 1), 2), 3) and 4) can be fixed rather easily in a single step, without throwing recursive template instantiations at
pow
, mind you. My "genius" idea is as follows: Go through the pairs of points and save in a container only these pairs (and the distance between them) that satisfy the threshold.
1.1) When doing that it's useful to not repeat, thus the distance from A to B is going to be the same as the distance from B to A, unless living in an alternate world. This should help shave off some unnecessary duplication.
1.2) Before doing that it's also useful to throw away the permutation vector if possible, so 5) to be solved by construction.  For the resulting container from 1) (probably a vector) one can see that the innermost
if
is directly satisfied for any pair of elements ...  Step 1) can be parallelized very easily for additional yield.
 Step 1) can make use of SSE/AVX.

And I was just thinkin' we are having a nice easygoing conversation ... 'cause when I see this:
template< int exponent, typename T > T power( T base ) { // ... }
I cringe so badly my face is contorted for a week.