Swift floating-point number and cpu architecture

Swift floating-point number and cpu architecture - ios

I search a lot about CGFloat, Float and Double in swift, but still not fully understand:
From Apple doc: https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/TheBasics.html
Double represents a 64-bit floating-point number.
Float represents a 32-bit floating-point number.
So in 32-bit CPU architecture, i personally understand that only float is the max accuracy available in swift and there will be no Double, isn't? Or they have to do something like sudo-Double? Since Apple recommends using Double over Float.
For CGFloat: https://developer.apple.com/library/content/documentation/Cocoa/Conceptual/Cocoa64BitGuide/64BitChangesCocoa/64BitChangesCocoa.html#//apple_ref/doc/uid/TP40004247-CH4-SW9
Floating point quantities in the Core Graphics framework (Quartz), which are float on 32-bit architectures, are being expanded to double to provide a wider range and accuracy for graphical quantities. Core Graphics declares a new type for floating-point quantities, CGFloat, and declares it conditionally for both 32-bit and 64-bit
It means that Apple want to take advantage of CPU architecture: 32 or 64 bit to maximize accuracy?

Related

Why C++ strtod parses "708530856168225829.3221614e9" to 7.08530856168225898e+26 instead of 7.08530856168225761e+26?

While writing a custom floating point number parser (for speed reasons) and checking the precision against strtod (that I assume to be extremely accurate) I found that sometimes the naive approach of using
number = (int_part + dec_part/pow(10., no_of_decs)) * pow(10., expo)
seems to be actually "more accurate" (when computation is done using long double and then result converted back to a double) than strtod result and that is surprising.
Do official IEEE754 parsing rules actually mandate a less accurate result?
For example with the string
708530856168225829.3221614e9
the naive computation gives
7.08530856168225761e+26
that seems closer than result of strtod
7.08530856168225898e+26
to the "theoretical" result (that cannot be represented exactly by a 64-bit double)
7.085308561682258293221614e+26
(experiments were done with g++ (GCC) 10.2.0 and clang++ 11.1.0 on Arch linux, and they both agree on ...898e+26 for strtod and ...761e+26 for naive computation)

As you note, 7.085308561682258293221614e+26 is not representable in IEEE-754 double precision (binary64). Therefore, it is not a candidate result and plays no role in determining the result.
The two numbers representable binary64 closest to 708530856168225829.3221614e9 are 708530856168225760595673088 and 708530856168225898034626560. Writing out the original fully and lining them up for inspection with original in the middle, we have:
708530856168225760595673088 representable value below original
708530856168225829322161400 original number
708530856168225898034626560 representable value above original
Subtracting gives the absolute differences between the lower and the original and between the original and the higher:
68726488312 distance to lower
68712465160 distance to higher
and therefore the higher number, 708530856168225898034626560, is closer to the original. This is in fact the result you report, and therefore the software is behaving correctly.
Observe that it is a mistake to think of binary64 in decimal without all significant digits. Writing out the partial decimal numerals as we did the full numbers above, we have:
7.08530856168225761e+26 proposed result
7 08530856168225829.3221614e9 original number
7.08530856168225898e+26 reported result of strtod
with differences:
68322161400 distance to lower
68677838600 distance to higher
Thus, rounding the actual values of the floating-point numbers to decimal numerals without all the digits introduced errors and portrayed incorrect values. Binary floating-point numbers are not and do not represent decimal numerals, and displaying them without all significant digits shows incorrect values.

The value 708530856168225829.3221614e9 is between 2 double.
7.08530856168225 760 59567309...e+26 // lower double
7.08530856168225 829 31514982...e+26 // half way
7 08530856168225 829.3221614e9 // OP's code
7.08530856168225 898 03462656...e+26 // upper double
1 23456789012345 678 90 // Significant digit count
It is very nearly halfway between those 2 double.
In this case I say 7.08530856168225 898 03462656...e+26 from strtod() is the better answer and OP's naïve computation is inferior and due to the cumulative rounding errors injected by the division, multiplication and addition.
Note: IEEE754 parsing does not require infinite precision in parsing text. It is required to use at least N+3 significant decimal digits. (I believe N==17) for binary64 AKA double.
When using truncated text to convert, the answer may differ from using more digits in nearly half-way cases. Still, in this case, even truncating to 20 digits, the upper double is the better choice.

Why pixel shader returns float4 when the back buffer format is DXGI_FORMAT_B8G8R8A8_UNORM?

Alright, so this has been bugging me for a while now, and could not find anything on MSDN that goes into the specifics that I need.
This is more of a 3 part question, so here it goes:
1-) When creating the swapchain applications specify backbuffer pixel formats, and most often is either B8G8R8A8 or R8G8B8A8. This gives 8 bit per color channel so a total of 4 bytes is used per pixel....so why does the pixel shader has to return a color as a float4 when float4 is actually 16 bytes?
2-) When binding textures to the Pixel Shader my textures are DXGI_FORMAT_B8G8R8A8_UNORM format, but why does the sampler need a float4 per pixel to work?
3-) Am I missing something here? am I overthinking this or what?
Please provide links to to support your claim. Preferably from MSDN!!!!

GPUs are designed to perform calculations on 32bit floating point data, at least if they want to support D3D11. As of D3D10 you can also perform 32bit signed and unsigned integer operations. There's no requirement or language support for types smaller than 4 bytes in HLSL, so there's no "byte/char" or "short" for 1 and 2 byte integers or lower precision floating point.
Any DXGI formats that use the "FLOAT", "UNORM" or "SNORM" suffix are non-integer formats, while "UINT" and "SINT" are unsigned and signed integer. Any reads performed by the shader on the first three types will be provided to the shader as 32 bit floating point irrespective of whether the original format was 8 bit UNORM/SNORM or 10/11/16/32 bit floating point. Data in vertices is usually stored at a lower precision than full-fat 32bit floating point to save memory, but by the time it reaches the shader it has already been converted to 32bit float.
On output (to UAVs or Render Targets) the GPU compresses the "float" or "uint" data to whatever format the target was created at. If you try outputting float4(4.4, 5.5, 6.6, 10.1) to a target that is 8-bit normalised then it'll simply be truncated to (1.0,1.0,1.0,1.0) and only consume 4 bytes per pixel.
So to answer your questions:
1) Because shaders only operate on 32 bit types, but the GPU will compress/truncate your output as necessary to be stored in the resource you currently have bound according to its type. It would be madness to have special keywords and types for every format that the GPU supported.
2) The "sampler" doesn't "need a float4 per pixel to work". I think you're mixing your terminology. The declaration that the texture is a Texture2D<float4> is really just stating that this texture has four components and is of a format that is not an integer format. "float" doesn't necessarily mean the source data is 32 bit float (or actually even floating point) but merely that the data has a fractional component to it (eg 0.54, 1.32). Equally, declaring a texture as Texture2D<uint4> doesn't mean that the source data is 32 bit unsigned necessarily, but more that it contains four components of unsigned integer data. However, the data will be returned to you and converted to 32 bit float or 32 bit integer for use inside the shader.
3) You're missing the fact that the GPU decompresses textures / vertex data on reads and compresses it again on writes. The amount of storage used for your vertices/texture data is only as much as the format that you create the resource in, and has nothing to do with the fact that the shader is operating on 32 bit floats / integers.

Objective C float vs int, CGPoint vs custom int based struct performance

Based on the arguments in this post: Performance of Built-in types, can I conclude that my custom implementation of a int based point structure is faster or more efficient than the float-based CGPoint? I have reviewed many posts concerning the type performance differences but have not found one that includes scenarios further wrapped by a structure.
Thanks.
// Coord
typedef struct {
int x;
int y;
} Coord;
CG_INLINE Coord CoordMake(int x, int y){
Coord coord; coord.x = x; coord.y = y; return coord;
}
CG_INLINE bool CoordEqualToCoord(Coord coord, Coord anotherCoord) {
return coord.x == anotherCoord.x && coord.y == anotherCoord.y;
}
CG_INLINE CGPoint CGPointForCoord(Coord coord) {
return CGPointMake(coord.x, coord.y);
}
EDIT: I have done purely arithmetical tests and the results are really negligible until millions of iterations, which my application will not come close to doing. I will continue to use the Coord typedef but will remove the struct for a few of the reasons #meaning-matters suggests. For the record the tests did show that the int based structure was about 30% faster, but 30% of 0.0001 seconds is not really something anyone should care about. I am still interested in the points and counter-points on which implementation is better.

It depends on what you are doing with it. For ordinary arithmetic, throughput can be similar. Integer latency is usually a bit less. On some processors, the latency to L1 is better for GPRs than FPR. So, for many tests, the results will come out the same or give a small edge for integer computation. The balance will flip the other way for double vs int64_t computation on 32-bit machines. (If you are writing CPU vector code and can get away with 16-bit computation then it would be much faster to use integer.)
However, in the case of calculating coordinates/addresses for purposes of loading or storing data into/from a register, integer is clearly better on a CPU. The reason is that the load or store instruction can take an integer operand as an index into an array, but not a floating point one. To use floating point coordinates, you at minimum have to convert to integer first, then load or store, so it should be always slower. Typically, there will also have to be some rounding mode set as well (e.g. a floor() operation) and maybe some non-trivial operation to account for edging modes, such as a CL_ADDRESS_REPEAT addressing mode. Contrast that to a simple AND operation, which may be all that is necessary to achieve the same thing on integer and it should be clear that integer is a much cheaper format.
On GPUs, which emphasize floating-point computation a bit more and may not invest much in integer computation (even though it is easier), the story is quite different. There you can expect texture unit hardware to use the floating point value directly to find the required data. The floating point arithmetic to find the right coordinate is built in to the hardware and therefore "free" (if we ignore energy consumption considerations) and graphics APIs like GL or CL are built around it.
Generally speaking, though ubiquitous in graphics APIs, floating-point itself is a numerically inferior choice for a coordinate system for sampled data. It lumps too much precision in one corner of the image and may cause quantization errors / inconsistencies at the far corners of the image, leading to reduced precision for linear sampling and unexpected rounding effects. For large enough images, some pixels in the image may become unaddressable by the coordinate system, because no floating-point number exists which references that position. It is probably the case that the default rounding mode, round to nearest ties to even is undesirable for coordinate systems because linear filtering will often place the coordinate half way between two integer values, resulting in a round up for even pixels and round down for odd. This causes pixel duplication rather than the expected result in the worst case where they are all hell ways cases and the stride is 1. It is nice in that it is somewhat easier to use.
A fixed-point coordinate system allows for consistent coordinate precision and rounding across the entire surface and will avoid these problems. Modulo overflow feeds nicely into some common edging modes. Precision is predictable.

Confirmed by a quick search 32-bit int and float operations seem equally fast on ARM processors (and take 1 CPU cycle each). Please look for yourself and do a simple test as Zev Eisenberg correctly suggests.
Then it's not a good idea to start writing your own CGPoint stuff using ints for the following reasons (to name a few):
Incorrect results: Rounding or truncating coordinates to integers will give all kinds of weird/horrible/side effects.
Incompatibility with the multitude of iOS libraries.
A big waste of time.
Not faster.
Creating a messy code base (Knuth is right as Zaph brings in).
As always when trying to optimise: Take a step back and investigate if your current method/algorithm is the best choice (for possibly different scenario's in your application). This is the way to commonly massive improvement of hundreds of percents.

Packing a 16-bit floating point variable into two 8-bit variables (aka half into two bytes)

I code on XNA and only has access to shader model 3, hence no bitshift operators. I need to pack two random 16-bit floating point variables (meaning NOT in range [0,1] but ANY RANDOM FLOAT VARIABLE) into two 8-bit variables. There is no way to normalize them.
I thought about doing bitshifting manually but I can't find a good article on how to convert a random decimal float (not [0,1]) into binary and back.
Thanks

This is not really a good idea - a 16-bit float already has very limited range and precision. Remember that 8-bits leaves you with just 256 possible values!
Getting an 8-bit value into a shader is trivial. As a colour is one method. You can use each channel as a normalised range, from 0 to 1.
Of course, you say you don't want to normalise your values. So I assume you want to maintain the nice floating-point property of a wide range with better precision closer to zero.
(Now would be a good time to read some background info on floating-point. Especially about half-precision floating-point and minifloats and microfloats.)
One way to do that would be to encode your values using a logarithm and an exponent (to encode and decode, respectivly). This is basically exactly what the floating-point format itself does. The exact maths will depend on the precision and the range that you desire - (which 256 values will you represent?) - so I will leave it as an exercise.

Floating point accuracy in F# (and .NET)

In "F# for Scientists" Jon Harrop says:
Roughly speaking, values of type int approximate real
numbers between min-int and max-int with a constant absolute error of +- 1/2
whereas values of the type float have an approximately-constant relative error that
is a tiny fraction of a percent.
Now, what does it mean? Int type is inaccurate?
Why C# for (1 - 0.9) returns 0.1 but F# returns 0.099999999999978 ? Is C# more accurate and suitable for scientific calculations?
Should we use decimal values instead of double/float for scientific calculations?

For an arbitrary real number, either an integral type or a floating point type is only going to provide an approximation. The integral approximation will never be off by more than 0.5 in one direction or the other (assuming that the real number fits within the range of that integral type). The floating point approximation will never be off by more than a small percentage (again, assuming that the real is within the range of values supported by that floating point type). This means that for smaller values, floating point types will provide closer approximations (e.g. storing an approximation to PI in a float is going to be much more accurate than the int approximation 3). However, for very large values, the integral type's approximation will actually be better than the floating point type's (e.g. consider the value 9223372036854775806.7, which is only off by 0.3 when represented as 9223372036854775807 as a long, but which is represented by 9223372036854780000.000000 when stored as a float).
This is just an artifact of how you're printing the values out. 9/10 and 1/10 cannot be exactly represented as floating point values (because the denominator isn't a power of two), just as 1/3 can't be exactly written as a decimal (you get 0.333... where the 3's repeat forever). Regardless of the .NET language you use, the internal representation of this value is going to be the same, but different ways of printing the value may display it differently. Note that if you evaluate 1.0 - 0.9 in FSI, the result is displayed as 0.1 (at least on my computer).
What type you use in scientific calculations will depend on exactly what you're trying to achieve. Your answer is generally only going to be approximately accurate. How accurate do you need it to be? What are your performance requirements? I believe that the decimal type is actually a fixed point number, which may make it inappropriate for calculations involving very small or very large values. Note also that F# includes arbitrary precision rational numbers (with the BigNum type), which may also be appropriate depending on your input.

No, F# and C# uses the same double type. Floating point is almost always inexact. Integers are exact though.
UPDATE:
The reason why you are seeing a difference is due to the printing of the number, not the actual representation.

For the first point, I'd say it says that int can be used to represent any real number in the intger's range, with a constant maximum error in [-0,5, 0.5]. This makes sense. For instance, pi could be represented by the integer value 3, with an error smaller than 0.15.
Floating point numbers don't share this property; their maximum absolute error is not independent of the value you're trying to represent.

3 - This depends on calculations: sometimes float is a good choice, sometimes you can use int. But there are tasks when you lack of precision for any of float and decimal.
The reason against using int:
> 1/2;;
val it : int = 0
The reason against using float (also known as double in C#):
> (1E-10 + 1E+10) - 1E+10;;
val it : float = 0.0
The reason against BCL decimal:
> decimal 1E-100;;
val it : decimal = 0M
Every listed type has it's own drawbacks.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart