How are numbers represented in a computer and what is the role of floating-point and twos-complement? - memory

I have very general question about how computers work with numbers.
In general computer systems only know binary - 0 and 1. So in memory any number is a sequence of bits. It does not matter if the number represented is a int or float.
But when does things like floating-point-numbers based on IEEE 754 standard and the twos-complement enter the game? Is this only a thing of the compilers (C/C++,...) and VMs (.NET/Java)?
Is it true that all integer numbers are represented by using the twos-complement?
I have read about CPUs that use co-processors for performing the floating-point-arithmetic. To tell a CPU to use it special assembler commands exists like add.s (single precision) and add.d (double precision). When I have some C++ code where a float is use, will such assembler commands be in the output?
I am totally confused at the moment. Would be great if you can help me with that.
Thank you!
Stefan

In general computer systems only know binary - 0 and 1. So in memory any number is a sequence of bits. It does not matter if the number represented is a int or float.
This is correct for the representation in memory. But computers execute instructions and store data currently being worked on in registers. Both instructions and registers are specialized, for some of them, for representations of signed integers in two's complement, and for others, for IEEE 754 binary32 and binary64 arithmetics (on a typical computer).
So to answer your first question:
But when does things like floating-point-numbers based on IEEE 754 standard and the twos-complement enter the game? Is this only a thing of the compilers (C/C++,...) and VMs (.NET/Java)?
Two's complement and IEEE 754 binary floating-point are very much choices made by the ISA, which provides specialized instructions and registers to deal with these formats in particular.
Is it true that all integer numbers are represented by using the twos-complement?
You can represent integers however you want. But if you represent your signed integers using two's complement, the typical ISA will provide instructions to operate efficiently on them. If you make another choice, you will be on your own.

Related

Performance cost of float ↔︎ half conversion in Metal

I have a Metal-based Core Image convolution kernel that was using half precision variables for keeping track of sums and weights. However, I now figured that the range of 16-bit half is not enough in some cases, which means I need 32-bit float for some variables.
Now I'm wondering what's more performant:
use half as much as possible (for the samplers and most local vars) and only convert to float when needed (which means quite a lot, inside the loop)
or change all samplers and local vars to float type so that no conversion is necessary.
The former would mean that all arithmetic is performed in 32-bit precision, though it would only be needed for some operations.
Is there any documentation or benchmark I can run to find the cost of float ↔︎ half conversion in Metal?
I believe you should go with option A:
use half as much as possible (for the samplers and most local vars) and only convert to float when needed (which means quite a lot, inside the loop)
based on the discussion in the WWDC 2016 talk entitled "Advanced Metal Shader Optimization" linked here.
Between around 17:17-18:58 is the relevant section for this topic. The speaker Fiona mentions a couple of things of importance:
A8 and later GPUs have 16-bit registers, which means that 32-bit floating-point formats (like float) use twice as many registers, which means twice as much bandwidth, energy, etc. So using half saves registers (which is always good) and energy
On A8 and later GPUs, "data type conversions are typically free, even between float and half [emphasis added]." Fiona even poses questions you might be asking yourself covering what I believe you are concerned about with all of the conversions and says that it's still probably fast because the conversions are free. Furthermore, according to the Metal Shading Language Specification Version 2.3 (pg. 218)
For textures that have half-precision floating-point pixel color values, the conversions from half to float are lossless
so that you don't have to worry about precision being lost as well.
There are some other relevant points that are worth looking into as well in that section, but I believe this is enough to justify going with option A

Precision of Q31 and SP for FFT ARM Cortex-M7

I would like to understand whether using fixed point Q31 is better than floating-point (single precision) for DSP applications where accuracy is important.
More details, I am currently working with ARM Cortex-M7 microcontroller and I need to perform FFT with high accuracy using CMSIS library. I understand that the SP has 24 bits for the mantissa while the Q31 has 31 bits, therefore, the precision of the Q31 should be better, but I read everywhere that for algorithms that require multiplication and so on, the floating-point representation should be used, which I do not understand why.
Thanks in advance.
Getting maximum value out of fixed point (that extra 6 or 7 bits of mantissa accuracy), as well as avoiding a ton of possible underflow and overflow problems, requires knowing precisely the bounds (min and max) of every arithmetic operation in your CMSIS algorithms for every valid set of input data.
In practice, both a complete error analysis turns out to be difficult, and the added operations needed to rescale all intermediate values to optimal ranges reduces performance so much, that only a narrower set of cases seems worth the effort, over using either IEEE signal or double, which the M7 supports in hardware, and where the floating point exponent range hides an enormous amount (but not all !!) of intermediate result numerical scaling issues.
But for some more simple DSP algorithms, sometimes analyzing and fixing the scaling isn't a problem. Hard to tell which without disassembling the numeric range of every arithmetic operation in your needed algorithm. Sometimes the work required to use integer arithmetic needs to be done because the processors available don't support floating point arithmetic well or at all.

What are real numbers in Dafny?

What are real numbers in Dafny. Are they represented as IEEE 754-2008 floating point numbers? If not, then what are they? I.e., what is the specification of the real type in Dafny?
Dafny's real numbers are not floating point numbers.
From a verification perspective, they are the mathematical real numbers, and Dafny reasons about them using Z3's theory of real arithmetic.
From a compilation perspective, Dafny actually compiles them to BigRationals, which is made possible by the fact that Dafny doesn't have any builtin operations for creating irrational real numbers.

Lua: subtracting decimal numbers doesn't return correct precision

I am using Lua 5.1
print(10.08 - 10.07)
Rather than printing 0.01, above prints 0.0099999999999998.
Any idea how to get 0.01 form this subtraction?
You got 0.01 from the subtraction. It is just in the form of a repeating decimal with a tiny amount of lost precision.
Lua uses the C type double to represent numbers. This is, on nearly every platform you will use, a 64-bit binary floating point value with approximately 23 decimal digits of precision. However, no amount of precision is sufficient to represent 0.01 exactly in binary. The situation is similar to attempting to write 1/3 in decimal.
Furthermore, you are subtracting two values that are very similar in magnitude. That all by itself causes an additional loss of precision.
The solution depends on what your context is. If you are doing accounting, then I would strongly recommend that you not use floating point values to represent account values because these small errors will accumulate and eventually whole cents (or worse) will appear or disappear. It is much better to store accounts in integer cents, and divide by 100 for display.
In general, the answer is to be aware of the issues that are inherent to floating point, and one of them is this sort of small loss of precision. It is easily handled by rounding answers to an appropriate number of digits for display, and never comparing results of calculations for equality.
Some resources for background:
The semi-official explanation at the Lua Users Wiki
This great page of IEEE floating point calculators where you can enter values in decimal and see how they are represented, and vice-versa.
Wiki on IEEE floating point.
Wiki on floating point numbers in general.
What Every Computer Scientist Should Know About Floating-Point Arithmetic is the most complete discussion of the fine details.
Edit: Added the WECSSKAFPA document after all. It really is worth the study, although it will likely seem a bit overwhelming on the first pass. In the same vein, Knuth Volume 2 has extensive coverage of arithmetic in general and a large section on floating point. And, since lhf reminded me of its existence, I inserted the Lua wiki explanation of why floating point is ok to use as the sole numeric type as the first item on the list.
Use Lua's string.format function:
print(string.format('%.02f', 10.08 - 10.07))
Use an arbitrary precision library.
Use a Decimal Number Library for Lua.

digital complements

what is the application of complements such as 1's complement and 2's complement?
To express negative numbers in binary format.
From Wikipedia:
The two's complement of the number
then behaves like the negative of the
original number in most arithmetic,
and it can coexist with positive
numbers in a natural way.

Resources