float problem
-
-
@fkaraokur This is not a float issue, but a wrong usage.
Please first take time to understand what float is in C/C++/Java/etc.
-
@fkaraokur said in float problem:
How do I do 3 digits after.
In addition to @KroMignon, for Qt
qDebug()
also see https://doc.qt.io/qt-5/qstring.html#number-6 . -
I have a question mark on something!
I've examined the difference between float and double.
float -- 32 bit
double -- 64 bitokey but If we give the number 1.872 as an example, isn't this number 32 bits?
with that logic it should work as a float!1.872
Most accurate representation = 1.8719999790191650390625E0
binary 00111111 11101111 10011101 10110010 -> 4 byte or 32 bit true
-
@fkaraokur said in float problem:
@KroMignon @JonB
I've examined the difference between float and double.
float -- 32 bit
double -- 64 bitThis is irrelevant here, both float and double behave the same way, the only difference is the fidelity they can represent numbers with.
okey but If we give the number 1.872 as an example, isn't this number 32 bits?
A number is no bits, it can be stored in 32 bits though, yes.
with that logic it should work as a float!
And it does, quite well in fact. What you expect out of the float is the problem. A
float
is an approximation, that is it can store a fixed set of numbers, and the particular number you've selected is not exactly representable, hence:Most accurate representation = 1.8719999790191650390625E0
That means that the closest number to
1.872
that can be stored as a floating point is 1.8719999..., as it says, and no amount of bits is going to change this. The only thing you're going to see by moving from 32 to 64 bit floating points is the number of nines increasing. It ain't ever going to get to 1.872 exactly! -
@kshegunov said in float problem:
and no amount of bits is going to change this. The only thing you're going to see by moving from 32 to 64 bit floating points is the number of nines increasing. It ain't ever going to get to 1.872 exactly!
How do you know this will be the case for this number? Given that, say, a
double
can represent twice as many numbers as afloat
, how do you know 1.872 won't be one of the extra ones? Or, do they use doubles to just increase the range of numbers representable, to bigger & smaller ones? If that is the case, why couldn't they choose to spend their extra bits within the same range for more accuracy, and happen to arrive at 1.872? -
@JonB said in float problem:
How do you know this will be the case for this number?
Representation.
Given that, say, a double can represent twice as many numbers as a float, how do you know 1.872 won't be one of the extra ones?
Because I'm clever (and humble). More specifically:
1.872 = 1 + 0.5 + 0.25 +0.125 - 0.003The first 4 terms are exactly representable, the last one is never going to have an exact representation. No number of divisions by 2 and additions, which incidentally is what the floating point representation is, is going to give you 0.003(0...)
Or, do they use doubles to just increase the range of numbers representable, to bigger & smaller ones?
Both. The double features larger dynamic range and longer mantissa (i.e. number of representable normalized numbers).
If that is the case, why couldn't they choose to spend their extra bits within the same range for more accuracy, and happen to arrive at 1.872?
Because when Kahan et al. designed the standard they decided, rightfully, that you'd want both reasonable precision and a large dynamic range. You can't have it all. The floating point trades off absolute spacing between numbers for dynamic range and vice versa.
If you start plotting all representable numbers on the number line you're going to realize quickly that they're all clustered around the zero, and get more "sparse" (i.e. distance between them increases) as you go further out. This is rather clever, though, because when you do some sort of calculation you're almost universally interested in the relative error between two numbers. And by choosing such a representation one actually builds exactly this constraint into the standard. Which is to say, scaling does increase the absolute spacing between the numbers, but keeps the spacing relative to the magnitude. Smart, huh?
Furthermore, one'd want to have the implementation be efficient, and there's nothing more efficient than operations with integers, which is what the FPU does (with some significant complications due to the exponent, however).
-
@kshegunov said in float problem:
1.872 = 1 + 0.5 + 0.25 +0.125 - 0.003
The first 4 terms are exactly representable, the last one is never going to have an exact representation. No number of divisions by 2 and additions, which incidentally is what the floating point representation is, is going to give you 0.003(0...)Oh! Is that how it works?! So my floating point number wants to be made by adding
2 ^ -n
values together to be accurately representable? And 1.872 doesn't happen to be. 1.875 does. I kinda thought the numbers it could represent precisely were "randomly" distributed :) -
@JonB said in float problem:
Oh! Is that how it works?! So my floating point number wants to be made by adding 2 ^ -n values together to be accurately representable? And 1.872 doesn't happen to be. 1.875 does. I kinda thought the numbers it could represent precisely were "randomly" distributed :)
I hope randomly isn't an example of your hardly conceivable English sarcasm. ;)
But yes, that's how it works, exactly the same as with the decimal. Here's a but of a fuller story:1.872 in decimal is represented as:
1.872 = 1 * 10^0 + 8 * 10^-1 + 7 * 10^-2 + 2 * 10^-3
The same idea is true for a base 2 number system, however one'd adjust for the base:
1.111 -> 1 * 2^0 + 1 * 2^-1 + 1 * 2^-2 + 1 * 2^-3
, which is incidentally 1.875 in decimal.So that's what the IEEE standard does explicitly:
Representation is split into 2 parts - exponent and mantissa (significand), that's to say each number is represented asm * 2^p
, wherem
is a fractional part in the range [1; 2)* and an exponent, which is a biased integer** (in reality it's unsigned). The leading bit of the mantissa (the one that's responsible for the 0th power) is implicit and is always assumed to be raised*** (i.e. signifying1.(...)
). This means the following: each bit in the mantissa starting from the higher to lower is a division by 2^n, hence my using of the principal values as a sum (principal values here'd mean the specific bits of the mantissa being1
).Now if you think about it the multiplication/division by 2 due to the exponent is equivalent to bit-shifts in the mantissa, which is what the FPU does for you when it renormalizes the numbers during calculations. It's always going to try to keep the higher bits in the mantissa raised if possible so you don't lose the precision at the lower end. Incidentally this is also why in reality the FP operations are done in extended registers (typically 2 times larger****) to allow storage of bits that otherwise'd be lost to be shifted back after normalization; truncation is done at the very end.
* Realistically it's in the range [0.5, 1.0) but for simplicity we roll with a somewhat "wrong" representation.
** It's biased for a specific reason, so when its bits are all 0 the value is the minimum the integer can represent and thus it's implying a denormal FP number.
*** Except when representing a denormal, then the exponent's raw value is 0 (representing the minimum possible value after debiasing) and thus the mantissa is fully explicit. Denormals are a special case to represent numbers very close by absolute value to the zero. The IEEE standard allows this for one specific purpose - to represent numbers that it'd otherwise couldn't in the normalized representation, however loss of precision is traded off for that. (i.e. the leading zeroes in the mantissa are the number of bits of precision lost).
**** In fact some of the operations are done iteratively with infinite precision and renormalized on the fly doing so until the required truncated precision is acquired. One such example is the FMA instruction (std::fma
).