Why do I have big precision errors on GPU?

Why do I have big precision errors on GPU? - ios

I am doing a series of calculations on GPU that requires a good enough precision, but it seems I am getting a much lower precision than when using float on CPU.
For starters, when I load a value of 0.01 in a float buffer, it gets loaded as 0.009995 in the shader. Why is that? I would think 0.01 is a value in range for float vectors (using the simd library available for Metal).
Then, when doing a simple operation like this, the precision gets visibly worse:
simd::float4 p = simd::float4 { -0.04, -0.07, 0, 1 };
simd::float4 v = myMatrix * p;
v *= 1.0 / v.w;
p in the example is what I expect and use in the CPU test; on the GPU it is calculated as { -0.039978, -0.069946, 0.0, 1.0 }, with one integer subtraction and one float multiplication by the already wrong 0.009995.
What I would expect to get from v is { -0.010627, 0.006991, -0.034100 } (calculated with the simd library on CPU, already worse precision than using doubles, { -0.010613, 0.006982, -0.034056 }, but bearable).
What I get instead is { -0.010483, 0.006405, -0.044067 }. This gets much worse with subsequent operations and the result becomes quickly unusable.
Why is the result so different even if using the same precision and why float data is not loaded 1:1? I tried disabling the fast math option for Metal, but it didn't change anything.

Alas, it was not a precision issue, as the way I setup the test wasn't correct, so the GPU wasn't using the numbers I actually thought it used.

You can't have 0.01 in float value because there is no binary representation. That is why 0.009995 was used. I guess SO already has good answers about floating point number representation in binary so you need just search.
Here is good tool for checking how float number looks like in binary. If you enter 0.01 you see this:
Decimal Representation: 0.01
Binary Representation: 00111100001000111101011100001010
After casting to double precision: 0.009999999776482582

Related

Objective C float vs int, CGPoint vs custom int based struct performance

Based on the arguments in this post: Performance of Built-in types, can I conclude that my custom implementation of a int based point structure is faster or more efficient than the float-based CGPoint? I have reviewed many posts concerning the type performance differences but have not found one that includes scenarios further wrapped by a structure.
Thanks.
// Coord
typedef struct {
int x;
int y;
} Coord;
CG_INLINE Coord CoordMake(int x, int y){
Coord coord; coord.x = x; coord.y = y; return coord;
}
CG_INLINE bool CoordEqualToCoord(Coord coord, Coord anotherCoord) {
return coord.x == anotherCoord.x && coord.y == anotherCoord.y;
}
CG_INLINE CGPoint CGPointForCoord(Coord coord) {
return CGPointMake(coord.x, coord.y);
}
EDIT: I have done purely arithmetical tests and the results are really negligible until millions of iterations, which my application will not come close to doing. I will continue to use the Coord typedef but will remove the struct for a few of the reasons #meaning-matters suggests. For the record the tests did show that the int based structure was about 30% faster, but 30% of 0.0001 seconds is not really something anyone should care about. I am still interested in the points and counter-points on which implementation is better.

It depends on what you are doing with it. For ordinary arithmetic, throughput can be similar. Integer latency is usually a bit less. On some processors, the latency to L1 is better for GPRs than FPR. So, for many tests, the results will come out the same or give a small edge for integer computation. The balance will flip the other way for double vs int64_t computation on 32-bit machines. (If you are writing CPU vector code and can get away with 16-bit computation then it would be much faster to use integer.)
However, in the case of calculating coordinates/addresses for purposes of loading or storing data into/from a register, integer is clearly better on a CPU. The reason is that the load or store instruction can take an integer operand as an index into an array, but not a floating point one. To use floating point coordinates, you at minimum have to convert to integer first, then load or store, so it should be always slower. Typically, there will also have to be some rounding mode set as well (e.g. a floor() operation) and maybe some non-trivial operation to account for edging modes, such as a CL_ADDRESS_REPEAT addressing mode. Contrast that to a simple AND operation, which may be all that is necessary to achieve the same thing on integer and it should be clear that integer is a much cheaper format.
On GPUs, which emphasize floating-point computation a bit more and may not invest much in integer computation (even though it is easier), the story is quite different. There you can expect texture unit hardware to use the floating point value directly to find the required data. The floating point arithmetic to find the right coordinate is built in to the hardware and therefore "free" (if we ignore energy consumption considerations) and graphics APIs like GL or CL are built around it.
Generally speaking, though ubiquitous in graphics APIs, floating-point itself is a numerically inferior choice for a coordinate system for sampled data. It lumps too much precision in one corner of the image and may cause quantization errors / inconsistencies at the far corners of the image, leading to reduced precision for linear sampling and unexpected rounding effects. For large enough images, some pixels in the image may become unaddressable by the coordinate system, because no floating-point number exists which references that position. It is probably the case that the default rounding mode, round to nearest ties to even is undesirable for coordinate systems because linear filtering will often place the coordinate half way between two integer values, resulting in a round up for even pixels and round down for odd. This causes pixel duplication rather than the expected result in the worst case where they are all hell ways cases and the stride is 1. It is nice in that it is somewhat easier to use.
A fixed-point coordinate system allows for consistent coordinate precision and rounding across the entire surface and will avoid these problems. Modulo overflow feeds nicely into some common edging modes. Precision is predictable.

Confirmed by a quick search 32-bit int and float operations seem equally fast on ARM processors (and take 1 CPU cycle each). Please look for yourself and do a simple test as Zev Eisenberg correctly suggests.
Then it's not a good idea to start writing your own CGPoint stuff using ints for the following reasons (to name a few):
Incorrect results: Rounding or truncating coordinates to integers will give all kinds of weird/horrible/side effects.
Incompatibility with the multitude of iOS libraries.
A big waste of time.
Not faster.
Creating a messy code base (Knuth is right as Zaph brings in).
As always when trying to optimise: Take a step back and investigate if your current method/algorithm is the best choice (for possibly different scenario's in your application). This is the way to commonly massive improvement of hundreds of percents.

Why does varying float equality test fail in glsl?

If I have a varying float in my shader program:
varying highp float someFloat;
and in the vertex shader, I set it to something.
someFloat = 1.0;
why in my fragment shader does this comparison seem to return false?
someFloat == 1.0 // false
but this returns true?
someFloat > .0 // true
testing on openGL ES in an iPad mini.

It happens on any IEEE 754 floating point number. It is because of the nature of floating point representation. Any language that use the IEEE 754 format will encounter the same problem.
Since 1.0 may not be represented exactly in floating point system as 1.000000000... , hence it is considered dangerous to compare them using ==.
Floating point numbers should always be compared with an epsilon value .
Since floating point calculations involve a bit of uncertainty we can try to allow for this by seeing if two numbers are ‘close’ to each other. If you decide – based on error analysis, testing, or a wild guess – that the result should always be within 0.00001 of the expected result then you can change your comparison to this:
if (fabs(someFloat - 1.0)) < 0.00001)
The maximum error value is typically called epsilon.
Probably you should read What Every Computer Scientist Should Know About Floating-Point Arithmetic

Precision in Erlang

Next code gives me 5.999999999999998 in Result, but right answer is 6.
Alpha = math:acos((4*4 + 5*5 - 3*3) / (2*4*5))
Area = 1/2 * 4 * 5 * math:sin(Alpha)
Is it possible to get 6?

You have run into a problem so common that it has its own web site, What Every Programmer Should Know About Floating-Point Arithmetic. The problem is due to the way floating-point arithmetic works in pretty much every CPU on the market that supports FP arithmetic; it is not specific to Erlang.
If regular floating point arithmetic does not give you the precision or accuracy you need, you can use an arbitrary precision arithmetic library instead of the built-in arithmetic. Perhaps the most well-known such library is GMP, but you'd have to wrap it in NIFs to use it from Erlang.
There is at least one pure-Erlang alternative, but I have no experience with it, so I cannot personally endorse it.

The calculation is done using standard floating point arithmetic on your hardware. Sometimes rounding errors show up.
Do you really need 15 digits of precision?
To get a more "exact" value there are multiple options:
> round(Area). % Will round to integer
6
or you could round to some precision
round(Area * 10000000) / 10000000.
6.0
If the purpose is to print the value, then printing with the default output for floats give you less precision.
io:format("~f~n", [Area]).
6.000000
ok
or with a specific precision
io:format("~.14f~n", [Area]).
6.00000000000000
ok
HTH

Floating point accuracy in F# (and .NET)

In "F# for Scientists" Jon Harrop says:
Roughly speaking, values of type int approximate real
numbers between min-int and max-int with a constant absolute error of +- 1/2
whereas values of the type float have an approximately-constant relative error that
is a tiny fraction of a percent.
Now, what does it mean? Int type is inaccurate?
Why C# for (1 - 0.9) returns 0.1 but F# returns 0.099999999999978 ? Is C# more accurate and suitable for scientific calculations?
Should we use decimal values instead of double/float for scientific calculations?

For an arbitrary real number, either an integral type or a floating point type is only going to provide an approximation. The integral approximation will never be off by more than 0.5 in one direction or the other (assuming that the real number fits within the range of that integral type). The floating point approximation will never be off by more than a small percentage (again, assuming that the real is within the range of values supported by that floating point type). This means that for smaller values, floating point types will provide closer approximations (e.g. storing an approximation to PI in a float is going to be much more accurate than the int approximation 3). However, for very large values, the integral type's approximation will actually be better than the floating point type's (e.g. consider the value 9223372036854775806.7, which is only off by 0.3 when represented as 9223372036854775807 as a long, but which is represented by 9223372036854780000.000000 when stored as a float).
This is just an artifact of how you're printing the values out. 9/10 and 1/10 cannot be exactly represented as floating point values (because the denominator isn't a power of two), just as 1/3 can't be exactly written as a decimal (you get 0.333... where the 3's repeat forever). Regardless of the .NET language you use, the internal representation of this value is going to be the same, but different ways of printing the value may display it differently. Note that if you evaluate 1.0 - 0.9 in FSI, the result is displayed as 0.1 (at least on my computer).
What type you use in scientific calculations will depend on exactly what you're trying to achieve. Your answer is generally only going to be approximately accurate. How accurate do you need it to be? What are your performance requirements? I believe that the decimal type is actually a fixed point number, which may make it inappropriate for calculations involving very small or very large values. Note also that F# includes arbitrary precision rational numbers (with the BigNum type), which may also be appropriate depending on your input.

No, F# and C# uses the same double type. Floating point is almost always inexact. Integers are exact though.
UPDATE:
The reason why you are seeing a difference is due to the printing of the number, not the actual representation.

For the first point, I'd say it says that int can be used to represent any real number in the intger's range, with a constant maximum error in [-0,5, 0.5]. This makes sense. For instance, pi could be represented by the integer value 3, with an error smaller than 0.15.
Floating point numbers don't share this property; their maximum absolute error is not independent of the value you're trying to represent.

3 - This depends on calculations: sometimes float is a good choice, sometimes you can use int. But there are tasks when you lack of precision for any of float and decimal.
The reason against using int:
> 1/2;;
val it : int = 0
The reason against using float (also known as double in C#):
> (1E-10 + 1E+10) - 1E+10;;
val it : float = 0.0
The reason against BCL decimal:
> decimal 1E-100;;
val it : decimal = 0M
Every listed type has it's own drawbacks.

How to manually parse a floating point number from a string

Of course most languages have library functions for this, but suppose I want to do it myself.
Suppose that the float is given like in a C or Java program (except for the 'f' or 'd' suffix), for example "4.2e1", ".42e2" or simply "42". In general, we have the "integer part" before the decimal point, the "fractional part" after the decimal point, and the "exponent". All three are integers.
It is easy to find and process the individual digits, but how do you compose them into a value of type float or double without losing precision?
I'm thinking of multiplying the integer part with 10^n, where n is the number of digits in the fractional part, and then adding the fractional part to the integer part and subtracting n from the exponent. This effectively turns 4.2e1 into 42e0, for example. Then I could use the pow function to compute 10^exponent and multiply the result with the new integer part. The question is, does this method guarantee maximum precision throughout?
Any thoughts on this?

All of the other answers have missed how hard it is to do this properly. You can do a first cut approach at this which is accurate to a certain extent, but until you take into account IEEE rounding modes (et al), you will never have the right answer. I've written naive implementations before with a rather large amount of error.
If you're not scared of math, I highly recommend reading the following article by David Goldberg, What Every Computer Scientist Should Know About Floating-Point Arithmetic. You'll get a better understanding for what is going on under the hood, and why the bits are laid out as such.
My best advice is to start with a working atoi implementation, and move out from there. You'll rapidly find you're missing things, but a few looks at strtod's source and you'll be on the right path (which is a long, long path). Eventually you'll praise insert diety here that there are standard libraries.
/* use this to start your atof implementation */
/* atoi - christopher.watford#gmail.com */
/* PUBLIC DOMAIN */
long atoi(const char *value) {
unsigned long ival = 0, c, n = 1, i = 0, oval;
for( ; c = value[i]; ++i) /* chomp leading spaces */
if(!isspace(c)) break;
if(c == '-' || c == '+') { /* chomp sign */
n = (c != '-' ? n : -1);
i++;
}
while(c = value[i++]) { /* parse number */
if(!isdigit(c)) return 0;
ival = (ival * 10) + (c - '0'); /* mult/accum */
if((n > 0 && ival > LONG_MAX)
|| (n < 0 && ival > (LONG_MAX + 1UL))) {
/* report overflow/underflow */
errno = ERANGE;
return (n > 0 ? LONG_MAX : LONG_MIN);
}
}
return (n>0 ? (long)ival : -(long)ival);
}

The "standard" algorithm for converting a decimal number to the best floating-point approximation is William Clinger's How to read floating point numbers accurately, downloadable from here. Note that doing this correctly requires multiple-precision integers, at least a certain percentage of the time, in order to handle corner cases.
Algorithms for going the other way, printing the best decimal number from a floating-number, are found in Burger and Dybvig's Printing Floating-Point Numbers Quickly and Accurately, downloadable here. This also requires multiple-precision integer arithmetic
See also David M Gay's Correctly Rounded Binary-Decimal and Decimal-Binary Conversions for algorithms going both ways.

I would directly assemble the floating point number using its binary representation.
Read in the number one character after another and first find all digits. Do that in integer arithmetic. Also keep track of the decimal point and the exponent. This one will be important later.
Now you can assemble your floating point number. The first thing to do is to scan the integer representation of the digits for the first set one-bit (highest to lowest).
The bits immediately following the first one-bit are your mantissa.
Getting the exponent isn't hard either. You know the first one-bit position, the position of the decimal point and the optional exponent from the scientific notation. Combine them and add the floating point exponent bias (I think it's 127, but check some reference please).
This exponent should be somewhere in the range of 0 to 255. If it's larger or smaller you have a positive or negative infinite number (special case).
Store the exponent as it into the bits 24 to 30 of your float.
The most significant bit is simply the sign. One means negative, zero means positive.
It's harder to describe than it really is, try to decompose a floating point number and take a look at the exponent and mantissa and you'll see how easy it really is.
Btw - doing the arithmetic in floating point itself is a bad idea because you will always force your mantissa to be truncated to 23 significant bits. You won't get a exact representation that way.

You could ignore the decimal when parsing (except for its location). Say the input was:
156.7834e10... This could easily be parsed into the integer 1567834 followed by e10, which you'd then modify to e6, since the decimal was 4 digits from the end of the "numeral" portion of the float.
Precision is an issue. You'll need to check the IEEE spec of the language you're using. If the number of bits in the Mantissa (or Fraction) is larger than the number of bits in your Integer type, then you'll possibly lose precision when someone types in a number such as:
5123.123123e0 - converts to 5123123123 in our method, which does NOT fit in an Integer, but the bits for 5.123123123 may fit in the mantissa of the float spec.
Of course, you could use a method that takes each digit in front of the decimal, multiplies the current total (in a float) by 10, then adds the new digit. For digits after the decimal, multiply the digit by a growing power of 10 before adding to the current total. This method seems to beg the question of why you're doing this at all, however, as it requires the use of the floating point primitive without using the readily available parsing libraries.
Anyway, good luck!

Yes, you can decompose the construction into floating point operations as long as these operations are EXACT, and you can afford a single final inexact operation.
Unfortunately, floating point operations soon become inexact, when you exceed precision of mantissa, the results are rounded. Once a rounding "error" is introduced, it will be cumulated in further operations...
So, generally, NO, you can't use such naive algorithm to convert arbitrary decimals, this may lead to an incorrectly rounded number, off by several ulp of the correct one, like others have already told you.
BUT LET'S SEE HOW FAR WE CAN GO:
If you carefully reconstruct the float like this:
if(biasedExponent >= 0)
return integerMantissa * (10^biasedExponent);
else
return integerMantissa / (10^(-biasedExponent));
there is a risk to exceed precision both when cumulating the integerMantissa if it has many digits, and when raising 10 to the power of biasedExponent...
Fortunately, if first two operations are exact, then you can afford a final inexact operation * or /, thanks to IEEE properties, the result will be rounded correctly.
Let's apply this to single precision floats which have a precision of 24 bits.
10^8 > 2^24 > 10^7
Noting that multiple of 2 will only increase the exponent and leave the mantissa unchanged, we only have to deal with powers of 5 for exponentiation of 10:
5^11 > 2^24 > 5^10
Though, you can afford 7 digits of precision in the integerMantissa and a biasedExponent between -10 and 10.
In double precision, 53 bits,
10^16 > 2^53 > 10^15
5^23 > 2^53 > 5^22
So you can afford 15 decimal digits, and a biased exponent between -22 and 22.
It's up to you to see if your numbers will always fall in the correct range... (If you are really tricky, you could arrange to balance mantissa and exponent by inserting/removing trailing zeroes).
Otherwise, you'll have to use some extended precision.
If your language provides arbitrary precision integers, then it's a bit tricky to get it right, but not that difficult, I did this in Smalltalk and blogged about it at http://smallissimo.blogspot.fr/2011/09/clarifying-and-optimizing.html and http://smallissimo.blogspot.fr/2011/09/reviewing-fraction-asfloat.html
Note that these are simple and naive implementations. Fortunately, libc is more optimized.

My first thought is to parse the string into an int64 mantissa and an int decimal exponent using only the first 18 digits of the mantissa. For example, 1.2345e-5 would be parsed into 12345 and -9. Then I would keep multiplying the mantissa by 10 and decrementing the exponent until the mantissa was 18 digits long (>56 bits of precision). Then I would look the decimal exponent up in a table to find a factor and binary exponent that can be used to convert the number from decimal n*10^m to binary p*2^q form. The factor would be another int64 so I'd multiply the mantissa by it such that I obtained the top 64-bits of the resulting 128-bit number. This int64 mantissa can be cast to a float losing only the necessary precision and the 2^q exponent can be applied using multiplication with no loss of precision.
I'd expect this to be very accurate and very fast but you may also want to handle the special numbers NaN, -infinity, -0.0 and infinity. I haven't thought about the denormalized numbers or rounding modes.

For that you have to understand the standard IEEE 754 in order for proper binary representation. After that you can use Float.intBitsToFloat or Double.longBitsToDouble.
http://en.wikipedia.org/wiki/IEEE_754

If you want the most precise result possible, you should use a higher internal working precision, and then downconvert the result to the desired precision. If you don't mind a few ULPs of error, then you can just repeatedly multiply by 10 as necessary with the desired precision. I would avoid the pow() function, since it will produce inexact results for large exponents.

It is not possible to convert any arbitrary string representing a number into a double or float without losing precision. There are many fractional numbers that can be represented exactly in decimal (e.g. "0.1") that can only be approximated in a binary float or double. This is similar to how the fraction 1/3 cannot be represented exactly in decimal, you can only write 0.333333...
If you don't want to use a library function directly why not look at the source code for those library functions? You mentioned Java; most JDKs ship with source code for the class libraries so you could look up how the java.lang.Double.parseDouble(String) method works. Of course something like BigDecimal is better for controlling precision and rounding modes but you said it needs to be a float or double.

Using a state machine. It's fairly easy to do, and even works if the data stream is interrupted (you just have to keep the state and the partial result). You can also use a parser generator (if you're doing something more complex).

I agree with terminus. A state machine is the best way to accomplish this task as there are many stupid ways a parser can be broken. I am working on one now, I think it is complete and it has I think 13 states.
The problem is not trivial.
I am a hardware engineer interested designing floating point hardware. I am on my second implementation.
I found this today http://speleotrove.com/decimal/decarith.pdf
which on page 18 gives some interesting test cases.
Yes, I have read Clinger's article, but being a simple minded hardware engineer, I can't get my mind around the code presented. The reference to Steele's algorithm as asnwered in Knuth's text was helpful to me. Both input and output are problematic.
All of the aforementioned references to various articles are excellent.
I have yet to sign up here just yet, but when I do, assuming the login is not taken, it will be broh. (broh-dot).
Clyde

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart