```
void thread_function(pcl::PointCloud<pcl::PointXYZRGB>::ConstPtr cloudB,vector<int> v,int p0) {
for(size_t p1=0;p1<v.size() && ros::ok();++p1) {
int p0p1 = sqrt( pow(cloudB->points[v[p1]].x-cloudB->points[v[p0]].x,2)
+pow(cloudB->points[v[p1]].y-cloudB->points[v[p0]].y,2)
+pow(cloudB->points[v[p1]].z-cloudB->points[v[p0]].z,2) ) * 1000;
if(p0p1>10) {
for(size_t p2=0;p2<v.size() && ros::ok();++p2) {
int p0p2 = sqrt( pow(cloudB->points[v[p2]].x-cloudB->points[v[p0]].x,2)
+pow(cloudB->points[v[p2]].y-cloudB->points[v[p0]].y,2)
+pow(cloudB->points[v[p2]].z-cloudB->points[v[p0]].z,2) ) * 1000;
int p1p2 = sqrt( pow(cloudB->points[v[p2]].x-cloudB->points[v[p1]].x,2)
+pow(cloudB->points[v[p2]].y-cloudB->points[v[p1]].y,2)
+pow(cloudB->points[v[p2]].z-cloudB->points[v[p1]].z,2) ) * 1000;
if(p0p2>10 && p1p2>10) {
}
}
}
}
x[p0] = 3;
cout<<"ended thread="<<p0<<endl;
}
```

This task is really important for my algorithm to complete. I need a suggestion how to make this loops run very fast. In above code the thread_function is the main function where i'm putting the for loops currentely. Is their any way to increase its performance in above code?

]]>```
void thread_function(pcl::PointCloud<pcl::PointXYZRGB>::ConstPtr cloudB,vector<int> v,int p0) {
for(size_t p1=0;p1<v.size() && ros::ok();++p1) {
int p0p1 = sqrt( pow(cloudB->points[v[p1]].x-cloudB->points[v[p0]].x,2)
+pow(cloudB->points[v[p1]].y-cloudB->points[v[p0]].y,2)
+pow(cloudB->points[v[p1]].z-cloudB->points[v[p0]].z,2) ) * 1000;
if(p0p1>10) {
for(size_t p2=0;p2<v.size() && ros::ok();++p2) {
int p0p2 = sqrt( pow(cloudB->points[v[p2]].x-cloudB->points[v[p0]].x,2)
+pow(cloudB->points[v[p2]].y-cloudB->points[v[p0]].y,2)
+pow(cloudB->points[v[p2]].z-cloudB->points[v[p0]].z,2) ) * 1000;
int p1p2 = sqrt( pow(cloudB->points[v[p2]].x-cloudB->points[v[p1]].x,2)
+pow(cloudB->points[v[p2]].y-cloudB->points[v[p1]].y,2)
+pow(cloudB->points[v[p2]].z-cloudB->points[v[p1]].z,2) ) * 1000;
if(p0p2>10 && p1p2>10) {
}
}
}
}
x[p0] = 3;
cout<<"ended thread="<<p0<<endl;
}
```

This task is really important for my algorithm to complete. I need a suggestion how to make this loops run very fast. In above code the thread_function is the main function where i'm putting the for loops currentely. Is their any way to increase its performance in above code?

]]>```
auto x = cloudB->points[v[p0]].x;
auto y = cloudB->points[v[p1]].y;
auto z = cloudB->points[v[p1]].z;
```

]]>the 2nd thing to do would be to reduce the number of std::pow /sqrt calls. Those a very time consuming operations,

or replace it with a faster one than the standard one

I would suggest looking into this article:

https://martin.ankerl.com/2012/01/25/optimized-approximative-pow-in-c-and-cpp/

`std::pow`

for ```
template< int exponent, typename T >
T power( T base )
{
if ( exponent == 0 )
{
return T( 1 );
}
else if ( exponent < 0 )
{
return T( 1 ) / power< -exponent >( base );
}
else if ( exponent % 2 == 0 )
{
return power< exponent / 2 >( base * base );
}
else
{
return power< exponent / 2 >( base * base ) * base;
}
}
```

]]>I wondered about this too. Or in the OP's case he is always using

`pow()`

to square, he could replace with in-line single multiply. However, I don't know whether the code for `pow()`

already takes a simple case like this into account and is already efficient?
]]>@JohanSolo

However, I don't know whether the code for`pow()`

already takes a simple case like this into account and is already efficient?

I attended this lecture long time ago, on page 5 it looks like at least at the time the `std::pow`

was not as efficient as it could be. I cannot tell for the current implementation of `std::pow`

though.

Yep, you may well be right. I believe for many years C compilers have turned multiplication by 2 (or power of 2) into left-shift (or for all I know modern chips' multiplication instructions do this automatically so compiler doesn't have to any more), but that's not the same as a call to function

`pow()`

. In any case, the OP should try one of these code optimisations for his squaring and see if it makes much difference.
He might also trying narrowing down just which instructions are causing time. For example, I don't know whether the `cloudB->points[v[p2]]`

is instantaneous indexed access or what. Unless relying on the compiler to do it for you (which it may do, I don't know), there are a lot of places which could be factored into temporary variable assignments for re-use to guarantee no re-calculation, e.g. `cloudB->points`

, `cloudB->points[v[p0]]`

, etc.

Also, in the loop conditions is `ros::ok()`

instantaneous or costly?

But it may just be that there are an *awful* lot of `sqrt()`

to perform, which I imagine is the costliest operation (how does that code calculate square roots? IIRC, at school using Newton's approximation with pen & paper was pretty time-consuming! How does it get done nowadays?). I see @J-Hilk has posted a link which should be examined in this light.

Finally, the multi-threading. Do you have evidence whether the multiple threads are using separate cores on your machine to do the work, and without waiting on each other or something else? In any case, a spare couple of cores are only going to reduce the time by a factor of 2, which may be of little help for what the OP wants. Verify that the separate threads are not actually slowing the whole calculation down!

BTW, how often does the result follow the `if(p0p1>10)`

route, causing the inner loop? Is that where it's "slow"? If so, one small possible optimisation: if you are then only interested in the `if(p0p2>10 && p1p2>10)`

route, after you have calculated `p0p2`

if it is *not* `>10`

you don't need to calculate `p1p2`

, don't know how many calculations that would eliminate overall. "Every little helps", as a certain supermarket here says :)

Also, in the loop conditions is ros::ok() instantaneous or costly?

that's actually a good point,

`v.size()`

and `ros::ok()`

are called each cycle. at least size() is something the op can rationalize away

for(size_t p1=0;p1<v.size() && ros::ok();++p1) {

to

]]>for(size_t p1 (0), end(v.size()); p1<end && ros::ok();++p1) {

If your code does not work as fast as expected, you should do two things first:

- Ask yourself if you use the best algorithm for the given problem
- Profile your alogrithm to find out the slowest part. Store the result for later comparism.

You cannot start optimizing before these two steps are finished. Next, set up good unit tests that make sure the behavior does not change when refactoring. Then, replace the slowest part with a better implementation.

Regards

]]>nearly about 8e+12 iterations

I don't know quite what you're trying to do why, but if you mean you have approx a trillion iterations/square roots etc. to calculate that's a *very large number* to be executing if speed is critical....

I don't know quite what you're trying to do why, but if you mean you have approx a trillion iterations/square roots etc. to calculate that's a very large number to be executing if speed is critical....

Maybe I can help with your confusion. The OP is trying to calculate the euclidean distance for a set of three points and do so by using a permutation of those three points from the whole set. Something they should've precalculated and stored and something they should've used the SIMD instructions for.

]]>Just to be clear, the indirection

`cloudB->points[v[p0]]`

is a cache line invalidation Maybe I can help with your confusion. The OP is trying to calculate the euclidean distance for a set of three points

Yes, I realised it was this sort of thing. However, AFAIK Euclid did not have the aid of a PC and presumably would have struggled to calculate a trillion distances by hand... :)

]]>AFAIK Euclid did not have the aid of a PC and presumably would have struggled to calculate a trillion distances by hand...

Probably not. But I imagine, him being a smart guy, he'd've tabulated whatever he had already calculated so he didn't need to do it again ... at least seems logical to me.

]]>Trouble is, writing down the answers to a trillion square roots takes a lot of space. And with that many even look-up time is going to get considerable.... ]]>

Trouble is, writing down the answers to a trillion square roots takes a lot of space. And with that many even look-up time is going to get considerable....

Mayhaps. I do like the "we create hardware out of software" approach, I admit, unfortunately this rarely works in practice. Leaving the metaphors to rest for a moment, I implore you to really try to imagine how this is supposed to work and do the following:

- Notice the inner loop is only interesting if the distance between two points is more than some magic number (not having semi-divine in-code numbers is a matter for another discussion).
- Notice the inner
`if`

is checking if two distances (between two pairs of points) are larger than some arbitrary numbers. - Notice that the distance between two points is the same no matter which is first and which is second.
- Notice that distances are recalculated for every conceivable case of point pairing.
- Finally (and least importantly), notice that the indirection through some permutation vector brakes data locality and thus invalidates the cache.

Now after a quick think, I hallucinate that 1), 2), 3) and 4) can be fixed rather easily in a single step, **without throwing recursive template instantiations at pow**, mind you. My

- Go through the pairs of points and save in a container only these pairs (and the distance between them) that satisfy the threshold.

1.1) When doing that it's*useful*to not repeat, thus the distance from A to B is going to be the same as the distance from B to A, unless living in an alternate world. This should help shave off some unnecessary duplication.

1.2) Before doing that it's also useful to throw away the permutation vector if possible, so 5) to be solved by construction. - For the resulting container from 1) (probably a vector) one can see that the innermost
`if`

is directly satisfied for any pair of elements ... - Step 1) can be parallelized very easily for additional yield.
- Step 1) can make use of SSE/AVX.

```
template< int exponent, typename T >
T power( T base )
{
// ...
}
```

I cringe so badly my face is contorted for a week.

]]>No idea what's foul about it, or the bit you've quoted, so you'd better explain? Unless you mean the whole idea of using templates, which of course I never used: C didn't need them, C++ added them as an obfuscation layer, so I'm quite happy without ;-)

Mind you, I looked at @JohanSolo's code above. His definition is a recursive one (`return power< exponent / 2 >( base * base ) * base;`

). I'm surprised. This would be all very well in my old Prolog, but I don't think the C++ compiler is going to recognise & remove tail recursion in the definition. So I don't know what he means by "trivially replaced", why would one want to use such a definition?

About the recursive template: the compiler expands it at compile time, therefore leading to `power< 4 >( x )`

being replaced by `x*x * x*x`

, which is apparently (or at least was) way faster than calling `std::pow`

. Therefore, I expect `power< 2 >( something )`

to be faster than `std::pow( something, 2 )`

.

No idea what's foul about it, or the bit you've quoted, so you'd better explain? Unless you mean the whole idea of using templates, which of course I never used: C didn't need them, C++ added them as an obfuscation layer, so I'm quite happy without ;-)

Recurrently instantiating a function for no apparent reason, basically invoking the sophisticated copy-paste machinery that is the compiler's template engine to produce: `x * x`

, especially when the latter would suffice.

Mind you, I looked at @JohanSolo's code above. His definition is a recursive one (

`return power< exponent / 2 >( base * base ) * base;`

). I'm surprised. This would be all very well in my old Prolog, but I don't think the C++ compiler is going to recognise & remove tail recursion in the definition. So I don't know what he means by "trivially replaced", why would one want to use such a definition?

Code inlining is kind of a religion. Surely it has its values in the proper places, and most certainly templates make some things easier, then again ... it's very much like chocolate, when you don't eat it, you want it, when you eat it, you want more of it, but in the ultimate scheme of things it makes you fat ...

The most ugly thing about templates, however, is that everything has to be defined for instantiation to take place, which is of course expected. So you can't have abstractions manifested without spilling the guts of the implementations. And of course there exists no such thing as binary compatibility, as everything is recompiled every time ... such a wonderful idea.

@JohanSolo said in How to increase speed of large for loops:

I never though my little post could produce so much noise...

Well yeah, I'm from eastern europe - all simmering under the hood.

First the snippet is not mine, as I already stated, I took it from a lecture I followed at CERN in 2009.

Yes, I glanced at the slides. FYI even boost's math module doesn't do that kind of nonsense because fast exponentiation algorithms for integral powers was (and is known) for 50+ years. And if the compiler actually inlines all the (unnecessary) instantiations, depending on the optimizations it applies, you could end up in the same `x * x * x * ... * x`

case. The point is computers are rather stupid, they do what we tell them to do, and ultimately everything you write is going to be compiled to **binary**, not to a cool concept from a book (or lecture, or w/e).

The lecturer was Dr Walter Brown, who was presented as: "Dr. Brown has worked for Fermilab since 1996. He is now part of the Computing Division's Future Programs and Experiments Quadrant, specializing in C++ consulting and programming. He participates in the international C++ standardization process and is responsible for several aspects of the forthcoming updated C++ Standard. In addition, he is the Project Editor for the forthcoming C++ Standard on Mathematical Special Functions."

Good for him. I don't know him, nor do I hold people in esteem for their titles. He might be a contemporary Einstein for all I know, but I place merit whenever I judge there to be reason for. In this case, I have not. The lecture, and all the proof of it boiling down to a synthetic test, is not nearly enough for me.

Just as a disclaimer, I've seen quite a lot of "scientific code" to be cynical to the point of not believing academia can (or should) write programs.

About the recursive template: the compiler expands it at compile time, therefore leading to

`power< 4 >( x )`

being replaced by`x*x * x*x`

No it leads to `power<4>(x)`

being replaced by `power<2>(x) * power<2>(x)`

where `power<2>`

is a distinct function. This **may** lead to `x * x * x * x`

in assembly, which of course would have the same performance as multiplying the argument manually, or it may lead to be evaluated as `(x * x)`

, which is then multiplied by itself, where you may gain a multiplication. The point is your template can't tell the compiler how to produce the efficient binary code.

Therefore, I expect

`power< 2 >( something )`

to be faster than`std::pow( something, 2 )`

.

I expect them to be exactly the same up to a couple of `push`

/`pop`

s and a single `call`

.

I did find it rather surprising that `pow`

and `sqrt`

were implicated here. I'd like to top off this missive with a quotation that I love from a fictional character:

*You wake up in the morning, your paint's peeling, your curtains are gone, and the water is boiling. Which problem do you deal with first?*

*...*

*None of them! The building's on fire!*

I never though my little post could produce so much noise...

It's OK, this is all a friendly debate, not a mud-slinging contest!

@JohanSolo , @kshegunov

I don't know what you are going on about with this `power()`

stuff and in-line expansion. Just maybe the compiler is clever enough to in-line expand to avoid recursion if your code goes `power<4>(x)`

, where the `4`

is a compile-time constant. *However*, that definition of `power<>`

takes the exponent as a *variable/parameter*. So if your code calls `power<n>(x)`

where `n`

is a variable, I don't see how any amount of in-lining or optimizations can do anything at all, and you are left with code which will compile to a ridiculously inefficient (time & space) tail-recursive implementation, which you would be mad to use. If you're going to do in-lining, it seems to me it should be done iteratively rather than recursively in C++, no? *That* is what I was commenting on....

@JohanSolo

However, that definition of`power<>`

takes the exponent as avariable/parameter. So if your code calls`power<n>(x)`

where`n`

is a variable

In the `power< n >( x )`

expression, `n`

must be known at compile time, it's a template parameter. If it is a variable it won't compile (I've just checked to be 1000% sure).

Ohhh, I had no idea templates worked like that...! I get it now.

I hope the compiler generated code copies your (first) parameter into a temporary variable/register when it expands that code in-line, else it could actually be slower....

In any case, to belabour the perhaps-obvious: the squaring won't take much time, it's the square-rooting which will be slow....

]]>