Of course most languages have library functions for this, but suppose I want to do it myself.
Suppose that the float is given like in a C or Java program (except for the 'f' or 'd' suffix), for example "4.2e1", ".42e2" or simply "42". In general, we have the "integer part" before the decimal point, the "fractional part" after the decimal point, and the "exponent". All three are integers.
It is easy to find and process the individual digits, but how do you compose them into a value of type float or double without losing precision?
I'm thinking of multiplying the integer part with 10^n, where n is the number of digits in the fractional part, and then adding the fractional part to the integer part and subtracting n from the exponent. This effectively turns 4.2e1 into 42e0, for example. Then I could use the pow function to compute 10^exponent and multiply the result with the new integer part. The question is, does this method guarantee maximum precision throughout?
Any thoughts on this?
All of the other answers have missed how hard it is to do this properly. You can do a first cut approach at this which is accurate to a certain extent, but until you take into account IEEE rounding modes (et al), you will never have the right answer. I've written naive implementations before with a rather large amount of error.
If you're not scared of math, I highly recommend reading the following article by David Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic. You'll get a better understanding for what is going on under the hood, and why the bits are laid out as such.
My best advice is to start with a working atoi implementation, and move out from there. You'll rapidly find you're missing things, but a few looks at strtod's source and you'll be on the right path (which is a long, long path). Eventually you'll praise insert diety here that there are standard libraries.
/* use this to start your atof implementation */
/* atoi - christopher.watford#gmail.com */
/* PUBLIC DOMAIN */
long atoi(const char *value) {
unsigned long ival = 0, c, n = 1, i = 0, oval;
for( ; c = value[i]; ++i) /* chomp leading spaces */
if(!isspace(c)) break;
if(c == '-' || c == '+') { /* chomp sign */
n = (c != '-' ? n : -1);
i++;
}
while(c = value[i++]) { /* parse number */
if(!isdigit(c)) return 0;
ival = (ival * 10) + (c - '0'); /* mult/accum */
if((n > 0 && ival > LONG_MAX)
|| (n < 0 && ival > (LONG_MAX + 1UL))) {
/* report overflow/underflow */
errno = ERANGE;
return (n > 0 ? LONG_MAX : LONG_MIN);
}
}
return (n>0 ? (long)ival : -(long)ival);
}
The "standard" algorithm for converting a decimal number to the best floating-point approximation is William Clinger's How to read floating point numbers accurately, downloadable from here. Note that doing this correctly requires multiple-precision integers, at least a certain percentage of the time, in order to handle corner cases.
Algorithms for going the other way, printing the best decimal number from a floating-number, are found in Burger and Dybvig's Printing Floating-Point Numbers Quickly and Accurately, downloadable here. This also requires multiple-precision integer arithmetic
See also David M Gay's Correctly Rounded Binary-Decimal and Decimal-Binary Conversions for algorithms going both ways.
I would directly assemble the floating point number using its binary representation.
Read in the number one character after another and first find all digits. Do that in integer arithmetic. Also keep track of the decimal point and the exponent. This one will be important later.
Now you can assemble your floating point number. The first thing to do is to scan the integer representation of the digits for the first set one-bit (highest to lowest).
The bits immediately following the first one-bit are your mantissa.
Getting the exponent isn't hard either. You know the first one-bit position, the position of the decimal point and the optional exponent from the scientific notation. Combine them and add the floating point exponent bias (I think it's 127, but check some reference please).
This exponent should be somewhere in the range of 0 to 255. If it's larger or smaller you have a positive or negative infinite number (special case).
Store the exponent as it into the bits 24 to 30 of your float.
The most significant bit is simply the sign. One means negative, zero means positive.
It's harder to describe than it really is, try to decompose a floating point number and take a look at the exponent and mantissa and you'll see how easy it really is.
Btw - doing the arithmetic in floating point itself is a bad idea because you will always force your mantissa to be truncated to 23 significant bits. You won't get a exact representation that way.
You could ignore the decimal when parsing (except for its location). Say the input was:
156.7834e10... This could easily be parsed into the integer 1567834 followed by e10, which you'd then modify to e6, since the decimal was 4 digits from the end of the "numeral" portion of the float.
Precision is an issue. You'll need to check the IEEE spec of the language you're using. If the number of bits in the Mantissa (or Fraction) is larger than the number of bits in your Integer type, then you'll possibly lose precision when someone types in a number such as:
5123.123123e0 - converts to 5123123123 in our method, which does NOT fit in an Integer, but the bits for 5.123123123 may fit in the mantissa of the float spec.
Of course, you could use a method that takes each digit in front of the decimal, multiplies the current total (in a float) by 10, then adds the new digit. For digits after the decimal, multiply the digit by a growing power of 10 before adding to the current total. This method seems to beg the question of why you're doing this at all, however, as it requires the use of the floating point primitive without using the readily available parsing libraries.
Anyway, good luck!
Yes, you can decompose the construction into floating point operations as long as these operations are EXACT, and you can afford a single final inexact operation.
Unfortunately, floating point operations soon become inexact, when you exceed precision of mantissa, the results are rounded. Once a rounding "error" is introduced, it will be cumulated in further operations...
So, generally, NO, you can't use such naive algorithm to convert arbitrary decimals, this may lead to an incorrectly rounded number, off by several ulp of the correct one, like others have already told you.
BUT LET'S SEE HOW FAR WE CAN GO:
If you carefully reconstruct the float like this:
if(biasedExponent >= 0)
return integerMantissa * (10^biasedExponent);
else
return integerMantissa / (10^(-biasedExponent));
there is a risk to exceed precision both when cumulating the integerMantissa if it has many digits, and when raising 10 to the power of biasedExponent...
Fortunately, if first two operations are exact, then you can afford a final inexact operation * or /, thanks to IEEE properties, the result will be rounded correctly.
Let's apply this to single precision floats which have a precision of 24 bits.
10^8 > 2^24 > 10^7
Noting that multiple of 2 will only increase the exponent and leave the mantissa unchanged, we only have to deal with powers of 5 for exponentiation of 10:
5^11 > 2^24 > 5^10
Though, you can afford 7 digits of precision in the integerMantissa and a biasedExponent between -10 and 10.
In double precision, 53 bits,
10^16 > 2^53 > 10^15
5^23 > 2^53 > 5^22
So you can afford 15 decimal digits, and a biased exponent between -22 and 22.
It's up to you to see if your numbers will always fall in the correct range... (If you are really tricky, you could arrange to balance mantissa and exponent by inserting/removing trailing zeroes).
Otherwise, you'll have to use some extended precision.
If your language provides arbitrary precision integers, then it's a bit tricky to get it right, but not that difficult, I did this in Smalltalk and blogged about it at http://smallissimo.blogspot.fr/2011/09/clarifying-and-optimizing.html and http://smallissimo.blogspot.fr/2011/09/reviewing-fraction-asfloat.html
Note that these are simple and naive implementations. Fortunately, libc is more optimized.
My first thought is to parse the string into an int64 mantissa and an int decimal exponent using only the first 18 digits of the mantissa. For example, 1.2345e-5 would be parsed into 12345 and -9. Then I would keep multiplying the mantissa by 10 and decrementing the exponent until the mantissa was 18 digits long (>56 bits of precision). Then I would look the decimal exponent up in a table to find a factor and binary exponent that can be used to convert the number from decimal n*10^m to binary p*2^q form. The factor would be another int64 so I'd multiply the mantissa by it such that I obtained the top 64-bits of the resulting 128-bit number. This int64 mantissa can be cast to a float losing only the necessary precision and the 2^q exponent can be applied using multiplication with no loss of precision.
I'd expect this to be very accurate and very fast but you may also want to handle the special numbers NaN, -infinity, -0.0 and infinity. I haven't thought about the denormalized numbers or rounding modes.
For that you have to understand the standard IEEE 754 in order for proper binary representation. After that you can use Float.intBitsToFloat or Double.longBitsToDouble.
http://en.wikipedia.org/wiki/IEEE_754
If you want the most precise result possible, you should use a higher internal working precision, and then downconvert the result to the desired precision. If you don't mind a few ULPs of error, then you can just repeatedly multiply by 10 as necessary with the desired precision. I would avoid the pow() function, since it will produce inexact results for large exponents.
It is not possible to convert any arbitrary string representing a number into a double or float without losing precision. There are many fractional numbers that can be represented exactly in decimal (e.g. "0.1") that can only be approximated in a binary float or double. This is similar to how the fraction 1/3 cannot be represented exactly in decimal, you can only write 0.333333...
If you don't want to use a library function directly why not look at the source code for those library functions? You mentioned Java; most JDKs ship with source code for the class libraries so you could look up how the java.lang.Double.parseDouble(String) method works. Of course something like BigDecimal is better for controlling precision and rounding modes but you said it needs to be a float or double.
Using a state machine. It's fairly easy to do, and even works if the data stream is interrupted (you just have to keep the state and the partial result). You can also use a parser generator (if you're doing something more complex).
I agree with terminus. A state machine is the best way to accomplish this task as there are many stupid ways a parser can be broken. I am working on one now, I think it is complete and it has I think 13 states.
The problem is not trivial.
I am a hardware engineer interested designing floating point hardware. I am on my second implementation.
I found this today http://speleotrove.com/decimal/decarith.pdf
which on page 18 gives some interesting test cases.
Yes, I have read Clinger's article, but being a simple minded hardware engineer, I can't get my mind around the code presented. The reference to Steele's algorithm as asnwered in Knuth's text was helpful to me. Both input and output are problematic.
All of the aforementioned references to various articles are excellent.
I have yet to sign up here just yet, but when I do, assuming the login is not taken, it will be broh. (broh-dot).
Clyde
Related
While writing a custom floating point number parser (for speed reasons) and checking the precision against strtod (that I assume to be extremely accurate) I found that sometimes the naive approach of using
number = (int_part + dec_part/pow(10., no_of_decs)) * pow(10., expo)
seems to be actually "more accurate" (when computation is done using long double and then result converted back to a double) than strtod result and that is surprising.
Do official IEEE754 parsing rules actually mandate a less accurate result?
For example with the string
708530856168225829.3221614e9
the naive computation gives
7.08530856168225761e+26
that seems closer than result of strtod
7.08530856168225898e+26
to the "theoretical" result (that cannot be represented exactly by a 64-bit double)
7.085308561682258293221614e+26
(experiments were done with g++ (GCC) 10.2.0 and clang++ 11.1.0 on Arch linux, and they both agree on ...898e+26 for strtod and ...761e+26 for naive computation)
As you note, 7.085308561682258293221614e+26 is not representable in IEEE-754 double precision (binary64). Therefore, it is not a candidate result and plays no role in determining the result.
The two numbers representable binary64 closest to 708530856168225829.3221614e9 are 708530856168225760595673088 and 708530856168225898034626560. Writing out the original fully and lining them up for inspection with original in the middle, we have:
708530856168225760595673088 representable value below original
708530856168225829322161400 original number
708530856168225898034626560 representable value above original
Subtracting gives the absolute differences between the lower and the original and between the original and the higher:
68726488312 distance to lower
68712465160 distance to higher
and therefore the higher number, 708530856168225898034626560, is closer to the original. This is in fact the result you report, and therefore the software is behaving correctly.
Observe that it is a mistake to think of binary64 in decimal without all significant digits. Writing out the partial decimal numerals as we did the full numbers above, we have:
7.08530856168225761e+26 proposed result
7 08530856168225829.3221614e9 original number
7.08530856168225898e+26 reported result of strtod
with differences:
68322161400 distance to lower
68677838600 distance to higher
Thus, rounding the actual values of the floating-point numbers to decimal numerals without all the digits introduced errors and portrayed incorrect values. Binary floating-point numbers are not and do not represent decimal numerals, and displaying them without all significant digits shows incorrect values.
The value 708530856168225829.3221614e9 is between 2 double.
7.08530856168225 760 59567309...e+26 // lower double
7.08530856168225 829 31514982...e+26 // half way
7 08530856168225 829.3221614e9 // OP's code
7.08530856168225 898 03462656...e+26 // upper double
1 23456789012345 678 90 // Significant digit count
It is very nearly halfway between those 2 double.
In this case I say 7.08530856168225 898 03462656...e+26 from strtod() is the better answer and OP's naïve computation is inferior and due to the cumulative rounding errors injected by the division, multiplication and addition.
Note: IEEE754 parsing does not require infinite precision in parsing text. It is required to use at least N+3 significant decimal digits. (I believe N==17) for binary64 AKA double.
When using truncated text to convert, the answer may differ from using more digits in nearly half-way cases. Still, in this case, even truncating to 20 digits, the upper double is the better choice.
Why do some numbers lose accuracy when stored as floating point numbers?
For example, the decimal number 9.2 can be expressed exactly as a ratio of two decimal integers (92/10), both of which can be expressed exactly in binary (0b1011100/0b1010). However, the same ratio stored as a floating point number is never exactly equal to 9.2:
32-bit "single precision" float: 9.19999980926513671875
64-bit "double precision" float: 9.199999999999999289457264239899814128875732421875
How can such an apparently simple number be "too big" to express in 64 bits of memory?
In most programming languages, floating point numbers are represented a lot like scientific notation: with an exponent and a mantissa (also called the significand). A very simple number, say 9.2, is actually this fraction:
5179139571476070 * 2 -49
Where the exponent is -49 and the mantissa is 5179139571476070. The reason it is impossible to represent some decimal numbers this way is that both the exponent and the mantissa must be integers. In other words, all floats must be an integer multiplied by an integer power of 2.
9.2 may be simply 92/10, but 10 cannot be expressed as 2n if n is limited to integer values.
Seeing the Data
First, a few functions to see the components that make a 32- and 64-bit float. Gloss over these if you only care about the output (example in Python):
def float_to_bin_parts(number, bits=64):
if bits == 32: # single precision
int_pack = 'I'
float_pack = 'f'
exponent_bits = 8
mantissa_bits = 23
exponent_bias = 127
elif bits == 64: # double precision. all python floats are this
int_pack = 'Q'
float_pack = 'd'
exponent_bits = 11
mantissa_bits = 52
exponent_bias = 1023
else:
raise ValueError, 'bits argument must be 32 or 64'
bin_iter = iter(bin(struct.unpack(int_pack, struct.pack(float_pack, number))[0])[2:].rjust(bits, '0'))
return [''.join(islice(bin_iter, x)) for x in (1, exponent_bits, mantissa_bits)]
There's a lot of complexity behind that function, and it'd be quite the tangent to explain, but if you're interested, the important resource for our purposes is the struct module.
Python's float is a 64-bit, double-precision number. In other languages such as C, C++, Java and C#, double-precision has a separate type double, which is often implemented as 64 bits.
When we call that function with our example, 9.2, here's what we get:
>>> float_to_bin_parts(9.2)
['0', '10000000010', '0010011001100110011001100110011001100110011001100110']
Interpreting the Data
You'll see I've split the return value into three components. These components are:
Sign
Exponent
Mantissa (also called Significand, or Fraction)
Sign
The sign is stored in the first component as a single bit. It's easy to explain: 0 means the float is a positive number; 1 means it's negative. Because 9.2 is positive, our sign value is 0.
Exponent
The exponent is stored in the middle component as 11 bits. In our case, 0b10000000010. In decimal, that represents the value 1026. A quirk of this component is that you must subtract a number equal to 2(# of bits) - 1 - 1 to get the true exponent; in our case, that means subtracting 0b1111111111 (decimal number 1023) to get the true exponent, 0b00000000011 (decimal number 3).
Mantissa
The mantissa is stored in the third component as 52 bits. However, there's a quirk to this component as well. To understand this quirk, consider a number in scientific notation, like this:
6.0221413x1023
The mantissa would be the 6.0221413. Recall that the mantissa in scientific notation always begins with a single non-zero digit. The same holds true for binary, except that binary only has two digits: 0 and 1. So the binary mantissa always starts with 1! When a float is stored, the 1 at the front of the binary mantissa is omitted to save space; we have to place it back at the front of our third element to get the true mantissa:
1.0010011001100110011001100110011001100110011001100110
This involves more than just a simple addition, because the bits stored in our third component actually represent the fractional part of the mantissa, to the right of the radix point.
When dealing with decimal numbers, we "move the decimal point" by multiplying or dividing by powers of 10. In binary, we can do the same thing by multiplying or dividing by powers of 2. Since our third element has 52 bits, we divide it by 252 to move it 52 places to the right:
0.0010011001100110011001100110011001100110011001100110
In decimal notation, that's the same as dividing 675539944105574 by 4503599627370496 to get 0.1499999999999999. (This is one example of a ratio that can be expressed exactly in binary, but only approximately in decimal; for more detail, see: 675539944105574 / 4503599627370496.)
Now that we've transformed the third component into a fractional number, adding 1 gives the true mantissa.
Recapping the Components
Sign (first component): 0 for positive, 1 for negative
Exponent (middle component): Subtract 2(# of bits) - 1 - 1 to get the true exponent
Mantissa (last component): Divide by 2(# of bits) and add 1 to get the true mantissa
Calculating the Number
Putting all three parts together, we're given this binary number:
1.0010011001100110011001100110011001100110011001100110 x 1011
Which we can then convert from binary to decimal:
1.1499999999999999 x 23 (inexact!)
And multiply to reveal the final representation of the number we started with (9.2) after being stored as a floating point value:
9.1999999999999993
Representing as a Fraction
9.2
Now that we've built the number, it's possible to reconstruct it into a simple fraction:
1.0010011001100110011001100110011001100110011001100110 x 1011
Shift mantissa to a whole number:
10010011001100110011001100110011001100110011001100110 x 1011-110100
Convert to decimal:
5179139571476070 x 23-52
Subtract the exponent:
5179139571476070 x 2-49
Turn negative exponent into division:
5179139571476070 / 249
Multiply exponent:
5179139571476070 / 562949953421312
Which equals:
9.1999999999999993
9.5
>>> float_to_bin_parts(9.5)
['0', '10000000010', '0011000000000000000000000000000000000000000000000000']
Already you can see the mantissa is only 4 digits followed by a whole lot of zeroes. But let's go through the paces.
Assemble the binary scientific notation:
1.0011 x 1011
Shift the decimal point:
10011 x 1011-100
Subtract the exponent:
10011 x 10-1
Binary to decimal:
19 x 2-1
Negative exponent to division:
19 / 21
Multiply exponent:
19 / 2
Equals:
9.5
Further reading
The Floating-Point Guide: What Every Programmer Should Know About Floating-Point Arithmetic, or, Why don’t my numbers add up? (floating-point-gui.de)
What Every Computer Scientist Should Know About Floating-Point Arithmetic (Goldberg 1991)
IEEE Double-precision floating-point format (Wikipedia)
Floating Point Arithmetic: Issues and Limitations (docs.python.org)
Floating Point Binary
This isn't a full answer (mhlester already covered a lot of good ground I won't duplicate), but I would like to stress how much the representation of a number depends on the base you are working in.
Consider the fraction 2/3
In good-ol' base 10, we typically write it out as something like
0.666...
0.666
0.667
When we look at those representations, we tend to associate each of them with the fraction 2/3, even though only the first representation is mathematically equal to the fraction. The second and third representations/approximations have an error on the order of 0.001, which is actually much worse than the error between 9.2 and 9.1999999999999993. In fact, the second representation isn't even rounded correctly! Nevertheless, we don't have a problem with 0.666 as an approximation of the number 2/3, so we shouldn't really have a problem with how 9.2 is approximated in most programs. (Yes, in some programs it matters.)
Number bases
So here's where number bases are crucial. If we were trying to represent 2/3 in base 3, then
(2/3)10 = 0.23
In other words, we have an exact, finite representation for the same number by switching bases! The take-away is that even though you can convert any number to any base, all rational numbers have exact finite representations in some bases but not in others.
To drive this point home, let's look at 1/2. It might surprise you that even though this perfectly simple number has an exact representation in base 10 and 2, it requires a repeating representation in base 3.
(1/2)10 = 0.510 = 0.12 = 0.1111...3
Why are floating point numbers inaccurate?
Because often-times, they are approximating rationals that cannot be represented finitely in base 2 (the digits repeat), and in general they are approximating real (possibly irrational) numbers which may not be representable in finitely many digits in any base.
While all of the other answers are good there is still one thing missing:
It is impossible to represent irrational numbers (e.g. π, sqrt(2), log(3), etc.) precisely!
And that actually is why they are called irrational. No amount of bit storage in the world would be enough to hold even one of them. Only symbolic arithmetic is able to preserve their precision.
Although if you would limit your math needs to rational numbers only the problem of precision becomes manageable. You would need to store a pair of (possibly very big) integers a and b to hold the number represented by the fraction a/b. All your arithmetic would have to be done on fractions just like in highschool math (e.g. a/b * c/d = ac/bd).
But of course you would still run into the same kind of trouble when pi, sqrt, log, sin, etc. are involved.
TL;DR
For hardware accelerated arithmetic only a limited amount of rational numbers can be represented. Every not-representable number is approximated. Some numbers (i.e. irrational) can never be represented no matter the system.
There are infinitely many real numbers (so many that you can't enumerate them), and there are infinitely many rational numbers (it is possible to enumerate them).
The floating-point representation is a finite one (like anything in a computer) so unavoidably many many many numbers are impossible to represent. In particular, 64 bits only allow you to distinguish among only 18,446,744,073,709,551,616 different values (which is nothing compared to infinity). With the standard convention, 9.2 is not one of them. Those that can are of the form m.2^e for some integers m and e.
You might come up with a different numeration system, 10 based for instance, where 9.2 would have an exact representation. But other numbers, say 1/3, would still be impossible to represent.
Also note that double-precision floating-points numbers are extremely accurate. They can represent any number in a very wide range with as much as 15 exact digits. For daily life computations, 4 or 5 digits are more than enough. You will never really need those 15, unless you want to count every millisecond of your lifetime.
Why can we not represent 9.2 in binary floating point?
Floating point numbers are (simplifying slightly) a positional numbering system with a restricted number of digits and a movable radix point.
A fraction can only be expressed exactly using a finite number of digits in a positional numbering system if the prime factors of the denominator (when the fraction is expressed in it's lowest terms) are factors of the base.
The prime factors of 10 are 5 and 2, so in base 10 we can represent any fraction of the form a/(2b5c).
On the other hand the only prime factor of 2 is 2, so in base 2 we can only represent fractions of the form a/(2b)
Why do computers use this representation?
Because it's a simple format to work with and it is sufficiently accurate for most purposes. Basically the same reason scientists use "scientific notation" and round their results to a reasonable number of digits at each step.
It would certainly be possible to define a fraction format, with (for example) a 32-bit numerator and a 32-bit denominator. It would be able to represent numbers that IEEE double precision floating point could not, but equally there would be many numbers that can be represented in double precision floating point that could not be represented in such a fixed-size fraction format.
However the big problem is that such a format is a pain to do calculations on. For two reasons.
If you want to have exactly one representation of each number then after each calculation you need to reduce the fraction to it's lowest terms. That means that for every operation you basically need to do a greatest common divisor calculation.
If after your calculation you end up with an unrepresentable result because the numerator or denominator you need to find the closest representable result. This is non-trivil.
Some Languages do offer fraction types, but usually they do it in combination with arbitary precision, this avoids needing to worry about approximating fractions but it creates it's own problem, when a number passes through a large number of calculation steps the size of the denominator and hence the storage needed for the fraction can explode.
Some languages also offer decimal floating point types, these are mainly used in scenarios where it is imporant that the results the computer gets match pre-existing rounding rules that were written with humans in mind (chiefly financial calculations). These are slightly more difficult to work with than binary floating point, but the biggest problem is that most computers don't offer hardware support for them.
I have code like :
debt:=debt+(p_sum-p_addition-p_paid);
before this line
debt:=0;
p_sum:=36;
p_addition:=3.6;
p_paid:=32.4;
after this line debt variable gets value like : 1.3322676e-15;
You are using binary floating point arithmetic. Your non-integer values, are not representable in binary floating point. And hence you are subject to rounding errors.
You actually want to operate with decimal rather than binary representations. If you do that then your values will all be represented exactly and the arithmetic will be exact.
In Delphi, the decimal real valued data type is Currency. I suggest that you switch to Currency for these calculations.
Required reading on this topic is What Every Computer Scientist Should Know About
Floating-Point Arithmetic.
What you're seeing is simply scientific notation for a floating point value. This is just one of a few different ways of displaying the value in the Watch List. In this case the "normal format" would be: 0.0000000000000013322676 (Note there are 15 zeroes.) The Watch List has a limit of 18 digits to display the value. So in the interests of giving you the programmer a value as accurate as possible, it uses the scientific notation.
Whenever a floating point number is very small (as in this case) or very large, it is much more efficient to display it in scientific notation. The reason your calculation produces this very small result instead of 0 is due to rounding error as explained in David's Answer.
As for getting the value "back to normal format", there's no point. The value is simply a value in memory representing the result of a calculation. Basically a series of bytes (in this particular case $3B $AA $11 $F7 $FF $FF $D7 $3C). You don't want to fiddle with this value, otherwise you might introduce more rounding errors. Also, as a programmer you should be able to read the more accurate scientific notation.
However, at some point you may want to display this value to your users. In which case you will create an appropriate string representation of the value. The most appropriate format will depend on the nature of your application. (I.e. scientific notation will still be best in some cases.)
If you want to use fixed floating point notation, you could use FormatFloat('0.00', Value); In your case the very small number will be rounded to 0.00. (BTW, you can even put that into your Watch List.)
Read the help on FormatFloat for more information. You can choose to include/exclude thousand separators, or even use scientific notation in your own display format.
Reading Scientific Notation
Scientific notation is actually quite easy to read once you understand the format:
NumbereExponent E.g. 1.3322676e-15
This basically means: take the Number and multiply it by 10 to the power of Exponent. (This equates to shifting the decimal point left/right Exponent number of digits.) A negative exponent makes the number smaller and a positive exponent makes the number bigger.
Some examples:
1 nanosecond is 1.0e-9 seconds, or 0.000000001 seconds
1 second is 1.0e+9 nanoseconds, or 100000000 nanoseconds
1.5e0 = 1.5
1.5e1 = 15
1.5e-1 = 0.15
An important convention of scientific notation is that the number before the exponent always uses exactly 1 digit (non-zero) to the left of the decimal point (with the exponent having been adjusted accordingly). This makes it easy to compare the relative size of two numbers just by looking at the exponent. E.g.
1.01846e+7 is bigger than 9.999999999e+6
1.01846e-6 is bigger than 9.999999999e-7
1.234e+250 is much bigger than 9.876e-250
1.234e+3 is bigger than 1.1234e+3 (only when exponents are equal would you need to compare the actual numbers).
I have a program that solves equations and sometimes the solutions x1 and x2 are numbers with a lot of decimal numbers. For example when Δ = 201 (Δ = discriminant) the square root gives me a floating point number.
I need a good approximation of that number because I also have a function that converts it into a fraction. So I thought to do this:
Result := FormatFloat('0.#####', StrToFloat(solx1));
The solx1 is a double. In this way, the number '456,9067896' becomes '456,90679'.
My question is this: if I approximate in this way, the fraction of 456,9067896 will be correct (and the same) if I have 456,90679?
the fraction of 456,9067896 will be correct (and the same) if I have 456,90679?
No, because 0.9067896 is unequal to 0.90679.
But why do you want to round the numbers? Just let them be as they are. Shorten them only for visual representation.
If you are worried about complete correctness of the result, you should not use floating point numbers at all, because floating points are, by definition, a rounding of real numbers. Only the first 5-6 decimal digits of a 32-bit floating point are generally reliable, the following ones are unreliable, due to machine error.
If you want complete precision, you should be using symbolic maths (rational numbers and symbolic representation for irrational/imaginary numbers).
To compare two floating point values with a given precision, just use the SameValue() function from Math unit or its sibbling CompareValue().
if SameValue(456.9067896, 456.90679, 1E-5) then ...
You can specify the precision on which the comparision will take place.
Or you can use a currency value, which has fixed arithmetic precision of 4 digits. So, it won't have rounding issue any more. But you can not do all mathematic computation with it (huge or tiny numbers are not handled properly): its main use is for accounting computations.
You should better never use string representations to compare floats, since it may be very confusing, and do not have good rounding abilities.
In "F# for Scientists" Jon Harrop says:
Roughly speaking, values of type int approximate real
numbers between min-int and max-int with a constant absolute error of +- 1/2
whereas values of the type float have an approximately-constant relative error that
is a tiny fraction of a percent.
Now, what does it mean? Int type is inaccurate?
Why C# for (1 - 0.9) returns 0.1 but F# returns 0.099999999999978 ? Is C# more accurate and suitable for scientific calculations?
Should we use decimal values instead of double/float for scientific calculations?
For an arbitrary real number, either an integral type or a floating point type is only going to provide an approximation. The integral approximation will never be off by more than 0.5 in one direction or the other (assuming that the real number fits within the range of that integral type). The floating point approximation will never be off by more than a small percentage (again, assuming that the real is within the range of values supported by that floating point type). This means that for smaller values, floating point types will provide closer approximations (e.g. storing an approximation to PI in a float is going to be much more accurate than the int approximation 3). However, for very large values, the integral type's approximation will actually be better than the floating point type's (e.g. consider the value 9223372036854775806.7, which is only off by 0.3 when represented as 9223372036854775807 as a long, but which is represented by 9223372036854780000.000000 when stored as a float).
This is just an artifact of how you're printing the values out. 9/10 and 1/10 cannot be exactly represented as floating point values (because the denominator isn't a power of two), just as 1/3 can't be exactly written as a decimal (you get 0.333... where the 3's repeat forever). Regardless of the .NET language you use, the internal representation of this value is going to be the same, but different ways of printing the value may display it differently. Note that if you evaluate 1.0 - 0.9 in FSI, the result is displayed as 0.1 (at least on my computer).
What type you use in scientific calculations will depend on exactly what you're trying to achieve. Your answer is generally only going to be approximately accurate. How accurate do you need it to be? What are your performance requirements? I believe that the decimal type is actually a fixed point number, which may make it inappropriate for calculations involving very small or very large values. Note also that F# includes arbitrary precision rational numbers (with the BigNum type), which may also be appropriate depending on your input.
No, F# and C# uses the same double type. Floating point is almost always inexact. Integers are exact though.
UPDATE:
The reason why you are seeing a difference is due to the printing of the number, not the actual representation.
For the first point, I'd say it says that int can be used to represent any real number in the intger's range, with a constant maximum error in [-0,5, 0.5]. This makes sense. For instance, pi could be represented by the integer value 3, with an error smaller than 0.15.
Floating point numbers don't share this property; their maximum absolute error is not independent of the value you're trying to represent.
3 - This depends on calculations: sometimes float is a good choice, sometimes you can use int. But there are tasks when you lack of precision for any of float and decimal.
The reason against using int:
> 1/2;;
val it : int = 0
The reason against using float (also known as double in C#):
> (1E-10 + 1E+10) - 1E+10;;
val it : float = 0.0
The reason against BCL decimal:
> decimal 1E-100;;
val it : decimal = 0M
Every listed type has it's own drawbacks.