Integer vs floating point division

nagesh

It is also likely faster.

I got following thread which discusses about floating arithmatic vs integer. It states floating aritchmatic whether it's multiplication or divisionis faster than integer.

[Forked from QProgressbar not updating ~kshegunov]

jeremy_k

It's the conversion to and from double that caught my attention.

kshegunov

@nagesh said in QProgressbar not updating:

I got following thread which discusses about floating arithmatic vs integer. It states floating aritchmatic whether it's multiplication or divisionis faster than integer.

Your conclusion from that thread is wrong on so many levels, but if you want to discuss it, we should fork this ...

nagesh

@kshegunov fork the thread for discussion.. Whether it's true that floating arithmetic operation (division / multiplication) is faster than integer operation.

Chris Kawa

@nagesh said in Integer vs floating point division:

Whether it's true that floating arithmetic operation (division / multiplication) is faster than integer operation.

There's no one answer to that. It depends on the underlying hardware. For couple of decades now most consumer grade CPUs have dedicated silicon for floating point operations so that's often true, but there's also the embedded and smaller scale market where CPUs could have half precision floats in hardware and floats and doubles emulated in software or no hardware floating point support at all. Performance of floating point vs integer will also vary depending on a CPU vendor and model.

The point though is that, as @jeremy_k noted, the conversions are the expensive part, so don't do something like this:

int foo = 666;
int bar = foo / 42.0;  //foo converted to double -> double division -> result truncation to int

If you want integer results try to stick to integer operations (by rearranging expressions if needed) and if you want floating point results stick to floating point. Also watch out for float to double conversions, as those can be costly too, so don't forget your f postfix when using float constants.

kshegunov

Right, so to start the ball rolling I should note that what you see in C++ isn't what you get translated into asm, and what you see in asm may not be what is executed on the CPU (directly). There are several significant points to cover, so let's make a list (lists are the best, aren't they):

There's no information of how said code was compiled, what flags were passed to the compiler
1.1) As there's no information about the compiler and flags, there's no knowledge of what assembly was generated
1.2) There's an implicit truncation for each FP operation, which may or may not matter
The accepted answer talks about latency, but that's relevant only sometimes and is most certainly not a sole reason to claim performance
The accepted answer claims all integer divisions are done in the natural DWORD (a.k.a. int32), which isn't the case. If the compiler can it will not promote the arguments needlessly

So from the top:

1.1:
The compiler flags matter as they influence the order (and sometimes the types) of instructions generated. It's more relevant in that sense to show the generated asm (with notable mention of the compiler that produced it) instead of reasoning about C++ code.

1.2:
This is possibly hurting the performance of the FP code, as it typically requires a cvttss2si instruction to convert back to an integer before the addition happens. On the other hand if the CPU supports it up to eight float divisions can be done at once (and this usually happens with arrays if the code is vectorized).

Here's an example of unrolling the loop which triggers the compiler to automatically vectorize (SIMD) the division (divps): https://godbolt.org/z/bEEE59PTh

Notice how the four divisions are now a single instruction as the compiler recognized that there's no dependence between them and aggressively optimized for it.

2:
This is a long one. There's a different number of FPU and ALUs on a modern CPU (they're per core), so you should take that into account. This is also why there's a computer inside the computer (ghost in the machine) - what you see as instructions is not necessarily what's executed in that specific order. The CPU may (and it will) reorder the operations based on how full the pipeline is and what can be put where, to fill the empty gaps (in terms of free cycles) - this is the so called out-of-order execution.

Out-of-order execution is also the reason you have atomic semantics with integers, this isn't because the increment isn't atomic as such, but rather because the CPU may reorder it somewhere else, which you don't typically want if it is to be shared between cores.

All that said, latency comes into play whenever the pipeline is full. That's the number of cycles you need to wait before your instruction can get executed, but note, this doesn't happen all the time, hence it's a very poor predictor of "performance".
So back to our specific case. My CPU (a Ryzen) has 4 ALUs and a single FPU per core, meaning that I can get 4 integer operations in parallel out of the box, while I get only one FP operation per turn. This matters very much whenever you crunch numbers, where you want to squeeze out every last bit out of the FPU. For integer operations that's already done for you by the prefetcher (if the CPU decides it's possible) so you get it "for free" so to speak.

3:
The claim that the division is always going to happen in a DWORD (a.k.a. 32 bit integer) is flat wrong, at least on the side of the asm, how the CPU does it under the hood is really an implementation which have no control over.
So here's the example: https://godbolt.org/z/3xjx375b3
(Also very similar for signed integers: https://godbolt.org/z/EnbxcKe8v)

Notice that the div uses a source operand that is of WORD size, thus it operates solely on the 16 bits, not on the whole of the eax register (32 bits). This can also exemplified from the next line, where the compiler reads back the value directly from the low WORD of the eax register: mov WORD PTR [rsp-104+rcx], ax.

So is this faster? Well, unfortunately it really depends, but if the ALUs are free to do it, maybe, even probably. If the FP operation gets vectorized on the other hand, probably not. If the FP results are truncated to integers after the fact, however, probably sticking to integer divisions is going to be better.

As you can see there's a lot of "if"s and "when"s here, but just claiming that an FP operation is faster than an integer one isn't correct, not by a long shot.

nagesh

Thanks @Chris-Kawa and @kshegunov for the detailed answer to the problem statement.

I do agree that there is no one answer to this problem statement, as it depends upon lot of factors such as hardware architecture, compiler flag options, FPU and many more.

From the above discussion it clarifies that outcome of the result is uncertain/depends upon factors/scenario. (well detailed in @kshegunov post)

So my next questions is to broaden the discussion (slight deviation from topic) will be as follows:

What does the programmer needs to follow as the guidelines
by keeping this uncertainty in mind?
Questioning here because since there is uncertainty about the performance
(integer vs float) how can the single guidelines suits here?
If at all some guidelines for performance is set, Does the programmer needs to worry or pay attention to the each instruction for performance point of view? or when it really matters?

Eg: Does the performance really matter in updating the slider value in event driven/timer or in the Computation which is taking place in Time critical domain in real time systems (eg:Missile Fire Control system)
Some times by rearranging expressions the original meaning of the statement is difficult to grasp from the code maintainability point of view.

   stick to integer operations (by rearranging expressions if needed)

Can this be followed as guidelines?

however, probably sticking to integer divisions is going to be better

There was some discussion in thread for multiplication vs division performance

Is it true/universal that multiplication is faster than division?
If the above statements true (multiplication is faster) with many instruction cycles,
then why don't we write the instruction using multiplication.

How about this statements performance?

constexpr float MultFactor = (1/42.0f); //One Time calculation
int foo = 666;
int bar = foo * MultFactor;

code sample to check performance of integer division vs float multiplication as discussed in point(4)

void PerformanceTest()
{

    int randomvalue   = 0;
    int intExpCount   = 0;//keep track of no. of times int division is faster
    int floatExpCount = 0; //keep track of no. of times float multiplication is faster
    int loopCount     = 100;

    constexpr float MultFactor = (1/42.0f); //One Time calculation
    for( int i=0; i<loopCount; ++i )
    {
        randomvalue = qrand();

        QElapsedTimer elTimer;

        elTimer.start();
        int val3 = randomvalue;
        int val4 = val3 * MultFactor; //modified expression
        //int val4 = val3 / 42.0f; //original expression
        quint64 elapsed2 = elTimer.nsecsElapsed();

        elTimer.start();
        int val1 = randomvalue;
        int val2 = val1 / 42;
        quint64 elapsed1 = elTimer.nsecsElapsed();

        //        qDebug()<<"elapsed1 \t"<<elapsed1;
        //        qDebug()<<"elapsed2 \t"<<elapsed2;
        //        qDebug()<<"Diff Time \t"<<DiffTime;
        //        qDebug()<<"\tRandValue is "<<BigValue<<"Answer is "<<val2<<" "<<val4<<"";

        qint64 DiffTime = ( elapsed1 - elapsed2 );
        if(DiffTime >= 0)
            ++floatExpCount;//floating operation takes less time
        else
            ++intExpCount;
    }

    float percentfloat = (floatExpCount * 100)  / loopCount;
    float percentint   = (intExpCount   * 100 ) / loopCount;

    qDebug()<<"(% Performance) ExprMultiplication in Float \t"<<percentfloat;
    qDebug()<<"(% Performance) ExprDivision in int \t"<<percentint;
}

TimeElapsed is used to check, how fast the computation is performed.

kshegunov

@nagesh said in Integer vs floating point division:

So my next questions is to broaden the discussion (slight deviation from topic) will be as follows:

What does the programmer needs to follow as the guidelines
by keeping this uncertainty in mind?
Questioning here because since there is uncertainty about the performance
(integer vs float) how can the single guidelines suits here?

Conventional wisdom applies:
The output type rules them all, so choose the operation type based on that, which is what Chris mentioned. If you need the end result as int do integer division, if you need a float - do a FP division. Don't do type conversions between the two just because you can, be it explicitly or implicitly.

<rant> C/C++ are notorious for their implicit type conversions, don't fall into that trap </rant>

If at all some guidelines for performance is set, Does the programmer needs to worry or pay attention to the each instruction for performance point of view? or when it really matters?

99.999999% of the time you write decent code in C++ and leave the compiler to figure out the instructions, and I really mean it. Compilers are very, very, very smart nowadays. If you're optimizing for performance, algorithmic complexity is king, then cache coherency and optimizing for cache misses is going to matter much more than anything else you can do on the lower-level, after that comes vectorization (SSE/AVX) and last comes fiddling with single instructions.

Eg: Does the performance really matter in updating the slider value in event driven/timer

Not really.

or in the Computation which is taking place in Time critical domain in real time systems (eg:Missile Fire Control system)

This is entirely different universe. Time critical applications don't run on a preemptive scheduler, so there the design is very much different to begin with.

Some times by rearranging expressions the original meaning of the statement is difficult to grasp from the code maintainability point of view.

Write a comment to explain if you feel it's not straightforward. While the compiler is going to optimize many calculations it will not do magic by any measure - it can optimize out something iff (if and only if) it can prove the result is going to be correct.

Can this be followed as guidelines?
Is it true/universal that multiplication is faster than division?

Mostly true, yes.

If the above statements true (multiplication is faster) with many instruction cycles,
then why don't we write the instruction using multiplication.

I assume you mean FP multiplication by the reciprocal. You don't write it as a multiplication because of rounding errors. Even if your reciprocal is exact to the extent of the representable significance of the floating point number, the multiplication introduces half an epsilon of rounding error. This in essence means that your last binary digit might not be correct if you use multiplication by a reciprocal instead of a division.

There are two cases where you can:

The reciprocal is known and exactly representable, e.g. 0.5, 0.25 and all the integers are exactly representable.
You don't care about the last bit of the mantissa.

How about this statements performance?

constexpr float MultFactor = (1/42.0f); //One Time calculation
int foo = 666;
int bar = foo * MultFactor;

The performance of this is absolutely constant. Everything is evaluated at compile time.

code sample to check performance of integer division vs float multiplication as discussed in point(4)

Designing such tests is hard and fiddly. You must be very sure of what exactly the compiler generates and how to break some of the possible optimizations it may do. If you notice in my example code above I used a volatile as an index to read from the array. This I did so the compiler can never realize (as it's obligated to always do a read from memory for each volatile) that it can remove the whole of the for loop and just calculate one single value.

I'll think about how you could conceivably test this, bit it may take some time ...