Floating point literals

Floating point literals - z3

I've two questions regarding writing IEEE floating-point constants, as accepted by Z3's FPA logics:
First, in this question, Christoph used the example:
((_ asFloat 11 53) roundTowardZero 0.5 0))
I'm wondering what the final 0 signifies? I've tried:
((_ asFloat 11 53) roundTowardZero 0.5))
And that seems to work as well. Rummer's paper doesn't seem to require the final 0 either; so I'm curious what role it plays.
Second, when I get a model from Z3, it prints floating-point literals like so:
(as +1.0000000000000002220446049250313080847263336181640625p1 (_ FP 11 53))
How do I interpret the p1 suffix? What other suffixes are possible?
Thanks..

Thanks for pointing these issues out. Both of them are because there is no agreed upon standard for floating-point literals in the input or output yet.
The final 0 in the example represents the (binary) exponent, i.e., (... 0.5 1) == 1.0. We added this simply because numbers sometimes would require a lot of space if the exponent cannot be specified separately. This way, we can often specify them quite succinctly.
The p1 suffix in the output represents the binary exponent, i.e., where e8 means 10^8, the suffix p8 would mean 2^8. Z3 currently uses only binary exponents, so there would always be a p-suffix here, but this may change in the future. The rest of the number is given enough decimal digits to represent a precise result.
Note that the output format is not agreed upon yet by the SMT community. This may change in the future. For instance, there are discussions about whether this should be done in IEEE bit-vector format or an intermediate format that lies somewhere between reals and non-IEEE bit-vectors.

Related

Why C++ strtod parses "708530856168225829.3221614e9" to 7.08530856168225898e+26 instead of 7.08530856168225761e+26?

While writing a custom floating point number parser (for speed reasons) and checking the precision against strtod (that I assume to be extremely accurate) I found that sometimes the naive approach of using
number = (int_part + dec_part/pow(10., no_of_decs)) * pow(10., expo)
seems to be actually "more accurate" (when computation is done using long double and then result converted back to a double) than strtod result and that is surprising.
Do official IEEE754 parsing rules actually mandate a less accurate result?
For example with the string
708530856168225829.3221614e9
the naive computation gives
7.08530856168225761e+26
that seems closer than result of strtod
7.08530856168225898e+26
to the "theoretical" result (that cannot be represented exactly by a 64-bit double)
7.085308561682258293221614e+26
(experiments were done with g++ (GCC) 10.2.0 and clang++ 11.1.0 on Arch linux, and they both agree on ...898e+26 for strtod and ...761e+26 for naive computation)

As you note, 7.085308561682258293221614e+26 is not representable in IEEE-754 double precision (binary64). Therefore, it is not a candidate result and plays no role in determining the result.
The two numbers representable binary64 closest to 708530856168225829.3221614e9 are 708530856168225760595673088 and 708530856168225898034626560. Writing out the original fully and lining them up for inspection with original in the middle, we have:
708530856168225760595673088 representable value below original
708530856168225829322161400 original number
708530856168225898034626560 representable value above original
Subtracting gives the absolute differences between the lower and the original and between the original and the higher:
68726488312 distance to lower
68712465160 distance to higher
and therefore the higher number, 708530856168225898034626560, is closer to the original. This is in fact the result you report, and therefore the software is behaving correctly.
Observe that it is a mistake to think of binary64 in decimal without all significant digits. Writing out the partial decimal numerals as we did the full numbers above, we have:
7.08530856168225761e+26 proposed result
7 08530856168225829.3221614e9 original number
7.08530856168225898e+26 reported result of strtod
with differences:
68322161400 distance to lower
68677838600 distance to higher
Thus, rounding the actual values of the floating-point numbers to decimal numerals without all the digits introduced errors and portrayed incorrect values. Binary floating-point numbers are not and do not represent decimal numerals, and displaying them without all significant digits shows incorrect values.

The value 708530856168225829.3221614e9 is between 2 double.
7.08530856168225 760 59567309...e+26 // lower double
7.08530856168225 829 31514982...e+26 // half way
7 08530856168225 829.3221614e9 // OP's code
7.08530856168225 898 03462656...e+26 // upper double
1 23456789012345 678 90 // Significant digit count
It is very nearly halfway between those 2 double.
In this case I say 7.08530856168225 898 03462656...e+26 from strtod() is the better answer and OP's naïve computation is inferior and due to the cumulative rounding errors injected by the division, multiplication and addition.
Note: IEEE754 parsing does not require infinite precision in parsing text. It is required to use at least N+3 significant decimal digits. (I believe N==17) for binary64 AKA double.
When using truncated text to convert, the answer may differ from using more digits in nearly half-way cases. Still, in this case, even truncating to 20 digits, the upper double is the better choice.

Bit encoding for vector of rational numbers

I would like to implement ultra compact storage for structures with rational numbers.
In the book "Theory of Linear and Integer Programming" by Alexander Schrijver, I found the definition of bit sizes (page. 15) of rational number, vector and matrix:
The representation of rational number is clear: single bit for sign and logarithm for quotient and fraction.
I can't figure out how vector can be encoded only in n bits to distinguish between its elements?
For example what if I would like to write vector of two elements:
524 = 1000001100b, 42 = 101010b. How can I use only 2 additional bits to specify when 1000001100 ends and 101010 starts?
The same problem exists with matrix representation.

Of course, it is not possible just to append the integer representations to each other, and add the information about the merging place, since this would take much more bits than given by the formula in the book, which I don't have access to.
I believe this is a problem from coding theory where I am not an expert. But I found something that might point you to the right direction. In this post an "interpolative code" is described among others. If you apply it to your example (524, 42), you get
f (the number of integers to be encoded, all in the range [1,N] = 2
N = 524
The maximum bit length of the encoded 2 integers is then
f • (2.58 + log (N/f)) = 9,99…, i.e. 10 bits
Thus, it is possible to have ultra compact encoding, although one had to spend a lot of time for coding and decoding.

It is impossible to use only two bits to specify when the quotient end and fraction start. At least you will need as big as the length of the quotient or/and the length of the fraction size. Another way is to use a fixed number of bits for both quotient and fraction similar with IEEE 754.

Good way to approximate a floating point number

I have a program that solves equations and sometimes the solutions x1 and x2 are numbers with a lot of decimal numbers. For example when Δ = 201 (Δ = discriminant) the square root gives me a floating point number.
I need a good approximation of that number because I also have a function that converts it into a fraction. So I thought to do this:
Result := FormatFloat('0.#####', StrToFloat(solx1));
The solx1 is a double. In this way, the number '456,9067896' becomes '456,90679'.
My question is this: if I approximate in this way, the fraction of 456,9067896 will be correct (and the same) if I have 456,90679?

the fraction of 456,9067896 will be correct (and the same) if I have 456,90679?
No, because 0.9067896 is unequal to 0.90679.
But why do you want to round the numbers? Just let them be as they are. Shorten them only for visual representation.

If you are worried about complete correctness of the result, you should not use floating point numbers at all, because floating points are, by definition, a rounding of real numbers. Only the first 5-6 decimal digits of a 32-bit floating point are generally reliable, the following ones are unreliable, due to machine error.
If you want complete precision, you should be using symbolic maths (rational numbers and symbolic representation for irrational/imaginary numbers).

To compare two floating point values with a given precision, just use the SameValue() function from Math unit or its sibbling CompareValue().
if SameValue(456.9067896, 456.90679, 1E-5) then ...
You can specify the precision on which the comparision will take place.
Or you can use a currency value, which has fixed arithmetic precision of 4 digits. So, it won't have rounding issue any more. But you can not do all mathematic computation with it (huge or tiny numbers are not handled properly): its main use is for accounting computations.
You should better never use string representations to compare floats, since it may be very confusing, and do not have good rounding abilities.

Delphi Math: Why is 0.7<0.70?

If I have variables a, b, an c of type double, let c:=a/b, and give a and b values of 7 and 10, then c's value of 0.7 registers as being LESS THAN 0.70.
On the other hand, if the variables are all type extended, then c's value of 0.7 does not register as being less than 0.70.
This seems strange. What information am I missing?

First, it needs to be noted that float literals in Delphi are of the Extended type. So when you compare a double to a literal, the double is probably first "expanded" to Extended, and then compared. (Edit : This is true only in 32 bits application. In 64 bits application, Extended is an alias of Double)
Here, all ShowMessage will be displayed.
procedure DoSomething;
var
A, B : Double;
begin
A := 7/10;
B := 0.7; //Here, we lower the precision of "0.7" to double
//Here, A is expanded to Extended... But it has already lost precision. This is (kind of) similar to doing Round(0.7) <> 0.7
if A <> 0.7 then
ShowMessage('Weird');
if A = B then //Here it would work correctly.
ShowMessage('Ok...');
//Still... the best way to go...
if SameValue(A, 0.7, 0.0001) then
ShowMessage('That will never fails you');
end;
Here some literature for you
What Every Computer Scientist Should Know About Floating-Point Arithmetic

There is no representation for the mathematical number 0.7 in binary floating-point. Your statement computes in c the closest double, which (according to what you say, I didn't check) is a little below 0.7.
Apparently in extended precision the closest floating-point number to 0.7 is a little above it. But there still is no exact representation for 0.7. There isn't any at any precision in binary floating-point.
As a rule of thumb, any non-integer number whose last non-zero decimal is not 5 cannot be represented exactly as a binary floating-point number (the converse is not true: 0.05 cannot be represented exactly either).

It has to do with the number of digits of precision in the two different floating point types you're using, and the fact that a lot of numbers cannot be represented exactly, regardless of precision. (From the pure math side: irrational numbers outnumber rationals)
Take 2/3, for example. It' can't be represented exactly in decimal. With 4 significant digits, it would be represented as 0.6667. With 8 significant digits, it would be 0.66666667.
The trailing 7 is roundup reflecting that the next digit would be > 5 if there was room to keep it.
0.6667 is greater than 0.66666667, so the computer will evaluate 2/3 (4 digits) > 2/3 (8 digits).
The same is true with your .7 vs .70 in double and extended vars.
To avoid this specific issue, try to use the same numeric type throughout your code. When working with floating point numbers in general, there are a lot of little things you have to watch out for. The biggest is to not write your code to compare two floats for equality - even if they should be the same value, there are many factors in calculations that can make them end up a very tiny bit different. Instead of comparing for equality, you need to test that the difference between the two numbers is very small. How small the difference has to be is up to you and to the nature of your calculations, and it usually referred to as epsilon, taken from calculus theorem and proof.

You're missing This Thing.
See especially the 'Accuracy problems' chapter. See also the Pascal's answer.
In order to fix your code without using the Extended type, you must add the Math unit and use the SameValue function from there which is especially built for this purpose.
Be sure to use an Epsilon value different than 0 when you use the SameValue in your case.
For example:
var
a, b, c: double;
begin
a:=7; b:=10;
c:=a/b;
if SameValue(c, 0.70, 0.001) then
ShowMessage('Ok')
else
ShowMessage('Wrong!');
end;
HTH

How to manually parse a floating point number from a string

Of course most languages have library functions for this, but suppose I want to do it myself.
Suppose that the float is given like in a C or Java program (except for the 'f' or 'd' suffix), for example "4.2e1", ".42e2" or simply "42". In general, we have the "integer part" before the decimal point, the "fractional part" after the decimal point, and the "exponent". All three are integers.
It is easy to find and process the individual digits, but how do you compose them into a value of type float or double without losing precision?
I'm thinking of multiplying the integer part with 10^n, where n is the number of digits in the fractional part, and then adding the fractional part to the integer part and subtracting n from the exponent. This effectively turns 4.2e1 into 42e0, for example. Then I could use the pow function to compute 10^exponent and multiply the result with the new integer part. The question is, does this method guarantee maximum precision throughout?
Any thoughts on this?

All of the other answers have missed how hard it is to do this properly. You can do a first cut approach at this which is accurate to a certain extent, but until you take into account IEEE rounding modes (et al), you will never have the right answer. I've written naive implementations before with a rather large amount of error.
If you're not scared of math, I highly recommend reading the following article by David Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic. You'll get a better understanding for what is going on under the hood, and why the bits are laid out as such.
My best advice is to start with a working atoi implementation, and move out from there. You'll rapidly find you're missing things, but a few looks at strtod's source and you'll be on the right path (which is a long, long path). Eventually you'll praise insert diety here that there are standard libraries.
/* use this to start your atof implementation */
/* atoi - christopher.watford#gmail.com */
/* PUBLIC DOMAIN */
long atoi(const char *value) {
unsigned long ival = 0, c, n = 1, i = 0, oval;
for( ; c = value[i]; ++i) /* chomp leading spaces */
if(!isspace(c)) break;
if(c == '-' || c == '+') { /* chomp sign */
n = (c != '-' ? n : -1);
i++;
}
while(c = value[i++]) { /* parse number */
if(!isdigit(c)) return 0;
ival = (ival * 10) + (c - '0'); /* mult/accum */
if((n > 0 && ival > LONG_MAX)
|| (n < 0 && ival > (LONG_MAX + 1UL))) {
/* report overflow/underflow */
errno = ERANGE;
return (n > 0 ? LONG_MAX : LONG_MIN);
}
}
return (n>0 ? (long)ival : -(long)ival);
}

The "standard" algorithm for converting a decimal number to the best floating-point approximation is William Clinger's How to read floating point numbers accurately, downloadable from here. Note that doing this correctly requires multiple-precision integers, at least a certain percentage of the time, in order to handle corner cases.
Algorithms for going the other way, printing the best decimal number from a floating-number, are found in Burger and Dybvig's Printing Floating-Point Numbers Quickly and Accurately, downloadable here. This also requires multiple-precision integer arithmetic
See also David M Gay's Correctly Rounded Binary-Decimal and Decimal-Binary Conversions for algorithms going both ways.

I would directly assemble the floating point number using its binary representation.
Read in the number one character after another and first find all digits. Do that in integer arithmetic. Also keep track of the decimal point and the exponent. This one will be important later.
Now you can assemble your floating point number. The first thing to do is to scan the integer representation of the digits for the first set one-bit (highest to lowest).
The bits immediately following the first one-bit are your mantissa.
Getting the exponent isn't hard either. You know the first one-bit position, the position of the decimal point and the optional exponent from the scientific notation. Combine them and add the floating point exponent bias (I think it's 127, but check some reference please).
This exponent should be somewhere in the range of 0 to 255. If it's larger or smaller you have a positive or negative infinite number (special case).
Store the exponent as it into the bits 24 to 30 of your float.
The most significant bit is simply the sign. One means negative, zero means positive.
It's harder to describe than it really is, try to decompose a floating point number and take a look at the exponent and mantissa and you'll see how easy it really is.
Btw - doing the arithmetic in floating point itself is a bad idea because you will always force your mantissa to be truncated to 23 significant bits. You won't get a exact representation that way.

You could ignore the decimal when parsing (except for its location). Say the input was:
156.7834e10... This could easily be parsed into the integer 1567834 followed by e10, which you'd then modify to e6, since the decimal was 4 digits from the end of the "numeral" portion of the float.
Precision is an issue. You'll need to check the IEEE spec of the language you're using. If the number of bits in the Mantissa (or Fraction) is larger than the number of bits in your Integer type, then you'll possibly lose precision when someone types in a number such as:
5123.123123e0 - converts to 5123123123 in our method, which does NOT fit in an Integer, but the bits for 5.123123123 may fit in the mantissa of the float spec.
Of course, you could use a method that takes each digit in front of the decimal, multiplies the current total (in a float) by 10, then adds the new digit. For digits after the decimal, multiply the digit by a growing power of 10 before adding to the current total. This method seems to beg the question of why you're doing this at all, however, as it requires the use of the floating point primitive without using the readily available parsing libraries.
Anyway, good luck!

Yes, you can decompose the construction into floating point operations as long as these operations are EXACT, and you can afford a single final inexact operation.
Unfortunately, floating point operations soon become inexact, when you exceed precision of mantissa, the results are rounded. Once a rounding "error" is introduced, it will be cumulated in further operations...
So, generally, NO, you can't use such naive algorithm to convert arbitrary decimals, this may lead to an incorrectly rounded number, off by several ulp of the correct one, like others have already told you.
BUT LET'S SEE HOW FAR WE CAN GO:
If you carefully reconstruct the float like this:
if(biasedExponent >= 0)
return integerMantissa * (10^biasedExponent);
else
return integerMantissa / (10^(-biasedExponent));
there is a risk to exceed precision both when cumulating the integerMantissa if it has many digits, and when raising 10 to the power of biasedExponent...
Fortunately, if first two operations are exact, then you can afford a final inexact operation * or /, thanks to IEEE properties, the result will be rounded correctly.
Let's apply this to single precision floats which have a precision of 24 bits.
10^8 > 2^24 > 10^7
Noting that multiple of 2 will only increase the exponent and leave the mantissa unchanged, we only have to deal with powers of 5 for exponentiation of 10:
5^11 > 2^24 > 5^10
Though, you can afford 7 digits of precision in the integerMantissa and a biasedExponent between -10 and 10.
In double precision, 53 bits,
10^16 > 2^53 > 10^15
5^23 > 2^53 > 5^22
So you can afford 15 decimal digits, and a biased exponent between -22 and 22.
It's up to you to see if your numbers will always fall in the correct range... (If you are really tricky, you could arrange to balance mantissa and exponent by inserting/removing trailing zeroes).
Otherwise, you'll have to use some extended precision.
If your language provides arbitrary precision integers, then it's a bit tricky to get it right, but not that difficult, I did this in Smalltalk and blogged about it at http://smallissimo.blogspot.fr/2011/09/clarifying-and-optimizing.html and http://smallissimo.blogspot.fr/2011/09/reviewing-fraction-asfloat.html
Note that these are simple and naive implementations. Fortunately, libc is more optimized.

My first thought is to parse the string into an int64 mantissa and an int decimal exponent using only the first 18 digits of the mantissa. For example, 1.2345e-5 would be parsed into 12345 and -9. Then I would keep multiplying the mantissa by 10 and decrementing the exponent until the mantissa was 18 digits long (>56 bits of precision). Then I would look the decimal exponent up in a table to find a factor and binary exponent that can be used to convert the number from decimal n*10^m to binary p*2^q form. The factor would be another int64 so I'd multiply the mantissa by it such that I obtained the top 64-bits of the resulting 128-bit number. This int64 mantissa can be cast to a float losing only the necessary precision and the 2^q exponent can be applied using multiplication with no loss of precision.
I'd expect this to be very accurate and very fast but you may also want to handle the special numbers NaN, -infinity, -0.0 and infinity. I haven't thought about the denormalized numbers or rounding modes.

For that you have to understand the standard IEEE 754 in order for proper binary representation. After that you can use Float.intBitsToFloat or Double.longBitsToDouble.
http://en.wikipedia.org/wiki/IEEE_754

If you want the most precise result possible, you should use a higher internal working precision, and then downconvert the result to the desired precision. If you don't mind a few ULPs of error, then you can just repeatedly multiply by 10 as necessary with the desired precision. I would avoid the pow() function, since it will produce inexact results for large exponents.

It is not possible to convert any arbitrary string representing a number into a double or float without losing precision. There are many fractional numbers that can be represented exactly in decimal (e.g. "0.1") that can only be approximated in a binary float or double. This is similar to how the fraction 1/3 cannot be represented exactly in decimal, you can only write 0.333333...
If you don't want to use a library function directly why not look at the source code for those library functions? You mentioned Java; most JDKs ship with source code for the class libraries so you could look up how the java.lang.Double.parseDouble(String) method works. Of course something like BigDecimal is better for controlling precision and rounding modes but you said it needs to be a float or double.

Using a state machine. It's fairly easy to do, and even works if the data stream is interrupted (you just have to keep the state and the partial result). You can also use a parser generator (if you're doing something more complex).

I agree with terminus. A state machine is the best way to accomplish this task as there are many stupid ways a parser can be broken. I am working on one now, I think it is complete and it has I think 13 states.
The problem is not trivial.
I am a hardware engineer interested designing floating point hardware. I am on my second implementation.
I found this today http://speleotrove.com/decimal/decarith.pdf
which on page 18 gives some interesting test cases.
Yes, I have read Clinger's article, but being a simple minded hardware engineer, I can't get my mind around the code presented. The reference to Steele's algorithm as asnwered in Knuth's text was helpful to me. Both input and output are problematic.
All of the aforementioned references to various articles are excellent.
I have yet to sign up here just yet, but when I do, assuming the login is not taken, it will be broh. (broh-dot).
Clyde

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Floating point literals - z3

Related

Why C++ strtod parses "708530856168225829.3221614e9" to 7.08530856168225898e+26 instead of 7.08530856168225761e+26?

Bit encoding for vector of rational numbers

Good way to approximate a floating point number

Delphi Math: Why is 0.7<0.70?

How to manually parse a floating point number from a string

Categories

Resources