x87 FPU and integer arithmetic? - x87

I'm trying to understand using the FPU for 64-bit integer arithmetic. I write this (ATT syntax):
fildq A
fildq B
fistpq C
The result in C is A + B + 1. If I start with an "finit" instruction, it gives me the correct value A + B. I thought that the unwanted +1 was maybe because it was adding in a carry bit, but using gdb I see no difference at all in the FPU control registers when I use finit from when I don't -- in both cases the control register starts off as 0x27F, the tag register is 0xFFFF (= stack empty), and all the others (including the status register, where all the condition bits are located) are zero.
Using finit seems a bit of a blunt instrument here, and I'm also wondering where the extra +1 is coming from if I don't use it, given that all the FPU registers seem to have the same values in both cases. Can anyone shed any light on this for me?

[…] I see no difference at all in the FPU control registers when I use finit from when I don't -- in both cases the control register starts off as 0x27F […]
Are you sure?
finit is supposed to load 0x37F, one additional bit set in comparison to 0x27F.
The difference is in the precision control field.
The default value uses 80‑bits whilst your observed value is using 64‑bits.
The result in C is A + B + 1. […]
Using finit seems a bit of a blunt instrument here, and I'm also wondering where the extra +1 is coming from if I don't use it, […]
With sufficiently large A and B you’re likely seeing a loss in precision from fadd.
Unmasking the precision exception will confirm this.
I think you were using the inline assembly capabilities of your favorite compiler.
This is certainly convenient if you don’t wanna bother about menial tasks, yet apparently your compiler’s run-time system loads 0x27F at startup for compatibility considerations.
Study its manual (and possibly source code) for details.


Is it possible to clear the FPU?

I'm using Delphi XE6 to perform a complicated floating point calculation. I realize the limitations of floating point numbers so understand the inaccuracies inherent in FP numbers. However this particular case, I always get 1 of 2 different values at the end of the calculation.
The first value and after a while (I haven't figured out why and when), it flips to the second value, and then I can't get the first value again unless I restart my application. I can't really be more specific as the calculation is very complicated. I could almost understand if the value was somewhat random, but just 2 different states is a little confusing. This only happens in the 32-bit compiler, the 64 bit compiler gives one single answer no matter how many times I try it. This number is different from the 2 from the 32-bit calculation, but I understand why that is happening and I'm fine with it. I need consistency, not total accuracy.
My one suspicion is that perhaps the FPU is being left in a state after some calculation that affects subsequent calculations, hence my question about clearing all registers and FPU stack to level out the playing field. I'd call this CLEARFPU before I start of the calculation.
After some more investigation I realized I was looking in the wrong place. What you see is not what you get with floating point numbers. I was looking at the string representation of the numbers and thinking here are 4 numbers going into a calculation ALL EQUAL and the result is different. Turns out the numbers only seemed to be the same. I started logging the hex equivalent of the numbers, worked my way back and found an external dll used for matrix multiplication the cause of the error. I replaced the matrix multiplication with a routine written in Delphi and all is well.
Floating point calculations are deterministic. The inputs are the input data and the floating point control word. With the same input, the same calculation will yield repeatable output.
If you have unpredictable results, then there will be a reason for it. Either the input data or the floating point control word is varying. You have to diagnose what that reason for that is. Until you understand the problem fully, you should not be looking for a problem. Do not attempt to apply a sticking plaster without understanding the disease.
So the next step is to isolate and reproduce the problem in a simple piece of code. Once you can reproduce the issue you can solve the problem.
Possible explanations include using uninitialized data, or external code modifying the floating point control word. But there could be other reasons.
Uninitialized data is plausible. Perhaps more likely is that some external code is modifying the floating point control word. Instrument your code to log the floating point control word at various stages of execution, to see if it ever changes unexpectedly.
You've probably been bitten by combination of optimization and excess x87 FPU precision resulting in the same bit of floating-point code in your source code being duplicated with different assembly code implementations with different rounding behaviour.
The problem with x87 FPU math
The basic problem is that while x87 FPU the supports 32-bit, 64-bit and 80-bit floating-point value, it only has 80-bit registers and the precision of operations is determined by the state of the bits in the floating point control word, not the instruction used. Changing the rounding bits is expensive, so most compilers don't, and so all floating point operations end being be performed at the same precision regardless of the data types involved.
So if the compiler sets the FPU to use 80-bit rounding and you add three 64-bit floating point variables, the code generated will often add the first two variables keeping the unrounded result in a 80-bit FPU register. It would then add the third 64-bit variable to 80-bit value in the register resulting in another unrounded 80-bit value in a FPU register. This can result in a different value being calculated than if the result was rounded to 64-bit precision after each step.
If that resulting value is then stored in a 64-bit floating-point variable then the compiler might write it to memory, rounding it to 64 bits at this point. But if the value is used in later floating point calculations then the compiler might keep it in a register instead. This means what rounding occurs at this point depends on the optimizations the compiler performs. The more its able to keep values in a 80-bit FPU register for speed, the more the result will differ from what you'd get if all floating point operation were rounded according to the size of actual floating point types used in the code.
Why SSE floating-point math is better
With 64-bit code the x87 FPU isn't normally used, instead equivalent scalar SSE instructions are used. With these instructions the precision of the operation used is determined by the instruction used. So with the adding three numbers example, the compiler would emit instructions that added the numbers using 64-bit precision. It doesn't matter if the result gets stored in memory or stays in register, the value remains the same, so optimization doesn't affect the result.
How optimization can turn deterministic FP code into non-deterministic FP code
So far this would explain why you'd get a different result with 32-bit code and 64-bit code, but it doesn't explain why you can get a different result with the same 32-bit code. The problem here is that optimizations can change the your code in surprising ways. One thing the compiler can do is duplicate code for various reasons, and this can cause the same floating point code being executed in different code paths with different optimizations applied.
Since optimization can affect floating point results this can mean the different code paths can give different results even though there's only one code path in the source code. If the code path chosen at run time is non-deterministic then this can cause non-deterministic results even when the in the source code the result isn't dependent on any non-deterministic factor.
An example
So for example, consider this loop. It performs a long running calculation, so every few seconds it prints a message letting the user know how many iterations have been completed so far. At the end of the loop there's simple summation performed using floating-point arithmetic. While there's non-deterministic factor in the loop, the floating-point operation isn't dependent on it. It's always performed regardless of whether progress updated is printed or not.
while ... do
if TimerProgress() then
count := 0
count := count + 1;
sum := sum + value
As optimization the compiler might move the last summing statement into the end of both blocks of the if statement. This lets both blocks finish by jumping back to the start of the loop, saving a jump instruction. Otherwise one of the blocks has to end with a jump to the summing statement.
This transforms the code into this:
while ... do
if TimerProgress() then
count := 0;
sum := sum + value
count := count + 1;
sum := sum + value
This can result in the two summations being optimized differently. It may be in one code path the variable sum can be kept in a register, but in the other path its forced out in to memory. If x87 floating point instructions are used here this can cause sum to be rounded differently depending on a non-deterministic factor: whether or not its time to print the progress update.
Possible solutions
Whatever the source of your problem, clearing the FPU state isn't going to solve it. The fact that the 64-bit version works, provides an possible solution, using SSE math instead x87 math. I don't know if Delphi supports this, but it's common feature of C compilers. It's very hard and expensive to make x87 based floating-point math conforming to the C standard, so many C compilers support using SSE math instead.
Unfortunately, a quick search of the Internet suggests the Delphi compiler doesn't have option for using SSE floating-point math in 32-bit code. In that case your options would be more limited. You can try disabling optimization, that should prevent the compiler from creating differently optimized versions of the same code. You could also try to changing the rounding precision in the x87 floating-point control word. By default it uses 80-bit precision, but all your floating point variables are 64-bit then changing the FPU to use 64-bit precision should significantly reduce the effect optimization has on rounding.
To do the later you can probably use the Set8087CW procedure MBo mentioned, or maybe System.Math.SetPrecisionMode.

Delphi 64-bit: finding incorrect casts?

I'm working on adapting a large Delphi code base to 64-bits. In many cases there are lines where pointers are casted to/from 32-bit values similar to this:
p1,p2 : pointer;
p2 := Pointer(Integer(p1) + 42);
Where I can find these casts I have replaced them with NativeInt-casts instead to make them correct in 64-bit mode.
However I'm not sure I have found them all. Sometimes the casts are more subtle so just text-searching for the string "integer(" is not sufficient either.
Since the "integer(" casts will fail in 64-bit if the pointer value is above the range of integer type I have an idea: what if I could force the memory manager to allocate memory above 4gb (so the pointer values are using more than 32-bits)? Then I would get runtime errors and can more easily find the casts that are wrong. Is this possible? Or can anyone recommend some other technique?
There's no magic trick to finding these casts beyond the sort of text search that you are using. It would be really nice if the compiler warned of such a cast. I find it very disappointing that it doesn't.
When you do find such a problem, don't change to NativeInt. Change the pointers to be typed pointers, and use pointer arithmetic.
p1, p2: PByte;
inc(p1, 10);
p2 := p2;
inc(p2, 42);
Then your code will be safe forever.
There are still some situations where you need to cast to integers. For example when passing addresses to SendMessage. But cast these to either WPARAM or LPARAM as appropriate.
Your idea of forcing runtime errors is sound and, thankfully for you, not original! You should use the full version of FastMM and define AlwaysAllocateTopDown. This forces the calls that FastMM makes to VirtualAlloc to pass the MEM_TOP_DOWN flag. This will flush out most of your erroneous casts as runtime pointer truncation errors.
However, that will only force top down allocation for memory allocated by your memory manager. Other modules in your process will use the default policy of bottom up. You can set a machine wide setting to change that default policy. Set HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management\AllocationPreference to REG_DWORD with value 0x100000 and reboot.
Note that this might cause your machine to have stability problems. Many applications cannot cope with this. In particular there are very few anti-virus products that can cope with this setting. MSE is the one that I found works with machine wide top down allocation. What's more the 64 bit debugger does not run under top down allocation! So you have to do this kind of testing without the debugger. My QC report is still open and this problem has not been addressed, even in XE3.

Heap overflow exploit

I understand that overflow exploitation requires three steps:
1.Injecting arbitrary code (shellcode) into target process memory space.
2.Taking control over eip.
3.Set eip to execute arbitrary code.
I read ben hawkens articles about heap exploitation and understood few tactics about how to ultimatly override a function pointer to point to my code.
In other words, I understand step 2.
I do not understand step 1 and 3.
How do I inject my code to the process memory space ?
During step 3 I override a function pointer with a
Pointer to my shellcode, How can I calculate\know what address
Was my injected code injected into ? (This problem is solved
In stackoverflow by using "jmp esp).
In a heap overflow, supposing that the system does not have ASLR activated, you will know the address of the memory chunks (aka, the buffers) you use in the overflow.
One option is to place the shellcode where the buffer is, given that you can control the contents of the buffer (as the application user). Once you have placed the shellcode bytes in the buffer, you only have to jump to that buffer address.
One way to perform that jump is by, for example, overwriting a .dtors entry. Once the vulnerable program finishes, the shellcode - placed in the buffer - will be executed. The complicated part is the .dtors overwriting. For that you will have to use the published heap exploiting techniques.
The prerequisites are that ASLR is deactivated (to know the address of the buffer before executing the vulnerable program) and that the memory region where the buffer is placed must be executable.
On more thing, steps 2 and 3 are the same. If you control eip, it's logic that you will point it to the shellcode (the arbitrary code).
P.S.: Bypassing ASLR is more complex.
Step 1 requires a vulnerability in the attacked code.
Common vulnerabilites include:
buffer overflow (common i C code, happens if the program reads an arbitrary long string into a fixed buffer)
evaluation of unsanitized data (common in SQL and script languages, but can occur in other languages as well)
Step 3 requires detailed knowledge of the target architecture.
How do I inject my code into process space?
This is quite a statement/question. It requires an 'exploitable' region of code in said process space. For example, Windows is currently rewriting most strcpy() to strncpy() if at all possible. I say if possible
because not all areas of code that use strcpy can successfully be changed over to strncpy. Why? BECAUSE ~# of this crux in difference shown below;
strcpy($buffer, $copied);
strncpy($buffer, $copied, sizeof($copied));
This is what makes strncpy so difficult to implement in real world scenarios. There has to be installed a 'magic number' on most strncpy operations (the sizeof() operator creates this magic number)
As coders' we are taught using hard coded values such as a strict compliance with a char buffer[1024]; is really bad coding practise.
BUT ~ in comparison - using buffer[]=""; or buffer[1024]=""; is the heart of the exploit. HOWEVER, if for example we change this code to the latter we get another exploit introduced into the system...
char * buffer;
char * copied;
strcpy(buffer, copied);//overflow this right here...
int size = 1024;
char buffer[size];
char copied[size];
strncpy(buffer,copied, size);
This will stop overflows, but introduce a exploitable region in RAM due to size being predictable and structured into 1024 blocks of code/data.
Therefore, original poster, looking for strcpy for example, in a program's address space, will make the program exploitable if strcpy is present.
There are many reasons why strcpy is favoured by programmers over strncpy. Magic numbers, variable input/output data size...programming styles...etc...
Check various hacker books for examples of this ~
BUT, try;
pop eax
pop eax
call pointer
jmp label
mov esp, eax
jmp $
This is an example that is non-working due to the fact that I do NOT want to be held responsible for writing the next Morris Worm! But, any decent programmer will get the jist of this code and know immediately what I am talking about here.
I hope your overflow techniques work in the future, my son!

Why does this code causes the machine to crash?

I am trying to run this code but it keeps crashing:
for n from 30 thru 1 step -1 do Ibwd:append([[n-1,rnd(1/(2*n-1)-a*last(first(Ibwd)),d)]],Ibwd);
Maxima crashes when it evaluates the last line. Any ideas why it may happen?
Thank you so much.
The problem is that the difference becomes negative and your rounding function dies horribly with a negative argument. To find this out, I changed your loop to:
for n from 30 thru 1 step -1 do
print (1/(2*n-1)-a*last(first(Ibwd))),
print (a*last(first(Ibwd))),
Ibwd: append([[n-1,rnd(1/(2*n-1)-a*last(first(Ibwd)),d)]],Ibwd),
print (Ibwd));
The last difference printed before everything fails miserably is -316539/6125000. So now try
and see the same problem. This all stems from the fact that you're taking the log of a negative number, which Maxima interprets as a complex number by analytic continuation. Maxima doesn't evaluate this until it absolutely has to and, somewhere in the evaluation code, something's dying horribly.
I don't know the "fix" for your specific example, since I'm not exactly sure what you're trying to do, but hopefully this gives you enough info to find it yourself.
If you want to deconstruct a floating point number, let's first make sure that it is a bigfloat.
say z: 34.1
You can access the parts of a bigfloat by using lisp, and you can also access the mantissa length in bits by ?fpprec.
Thus ?second(z)*2^(?third(z)-?fpprec) gives you :
and bfloat(%) gives you :
If you want the mantissa of z as an integer, look at ?second(z)
Now I am not sure what it is that you are trying to accomplish in base 10, but Maxima
does not do internal arithmetic in base 10.
If you want more bits or fewer, you can set fpprec,
which is linked to ?fpprec. fpprec is the "approximate base 10" precision.
Thus fpprec is initially 16
?fpprec is correspondingly 56.
You can easily change them both, e.g. fpprec:100
corresponds to ?fpprec of 335.
If you are diddling around with float representations, you might benefit from knowing
that you can look at any of the lisp by typing, for example,
which prints the internal form using the Lisp print function.
You can also trace any function, your own or system function, by trace.
For example you could consider doing this:
If you want to use machine floats, I suggest you use, for the last line,
for n from 30 thru 1 step -1 do :
Ibwd:append([[n-1,rnd(1/(2.0*n- 1.0)-a*last(first(Ibwd)),d)]],Ibwd);
Note the decimal points. But even that is not quite enough, because integration
inserts exact structures like atan(10). Trying to round these things, or compute log
of them is probably not what you want to do. I suspect that Maxima is unhappy because log is given some messy expression that turns out to be negative, even though it initially thought otherwise. It hands the number to the lisp log program which is perfectly happy to return an appropriate common-lisp complex number object. Unfortunately, most of Maxima was written BEFORE LISP HAD COMPLEX NUMBERS.
Thus the result (log -0.5)= #C(-0.6931472 3.1415927) is entirely unexpected to the rest of Maxima. Maxima has its own form for complex numbers, e.g. 3+4*%i.
In particular, the Maxima display program predates the common lisp complex number format and does not know what to do with it.
The error (stack overflow !!!) is from the display program trying to display a common lisp complex number.
How to fix all this? Well, you could try changing your program so it computes what you really want, in which case it probably won't trigger this error. Maxima's display program should be fixed, too. Also, I suspect there is something unfortunate in simplification of logs of numbers that are negative but not obviously so.
This is probably waaay too much information for the original poster, but maybe the paragraph above will help out and also possibly improve Maxima in one or more places.
It appears that your program triggers an error in Maxima's simplification (algebraic identities) code. We are investigating and I hope we have a bug fix soon.
In the meantime, here is an idea. Looks like the bug is triggered by rnd(x, d) when x < 0. I guess rnd is supposed to round x to d digits. To handle x < 0, try this:
rnd(x, d) := if x < 0 then -rnd1(-x, d) else rnd1(x, d);
rnd1(x, d) := (... put the present definition of rnd here ...);
When I do that, the loop runs to completion and Ibwd is a list of values, but I don't know what values to expect.

How does a stackless language work?

I've heard of stackless languages. However I don't have any idea how such a language would be implemented. Can someone explain?
The modern operating systems we have (Windows, Linux) operate with what I call the "big stack model". And that model is wrong, sometimes, and motivates the need for "stackless" languages.
The "big stack model" assumes that a compiled program will allocate "stack frames" for function calls in a contiguous region of memory, using machine instructions to adjust registers containing the stack pointer (and optional stack frame pointer) very rapidly. This leads to fast function call/return, at the price of having a large, contiguous region for the stack. Because 99.99% of all programs run under these modern OSes work well with the big stack model, the compilers, loaders, and even the OS "know" about this stack area.
One common problem all such applications have is, "how big should my stack be?". With memory being dirt cheap, mostly what happens is that a large chunk is set aside for the stack (MS defaults to 1Mb), and typical application call structure never gets anywhere near to using it up. But if an application does use it all up, it dies with an illegal memory reference ("I'm sorry Dave, I can't do that"), by virtue of reaching off the end of its stack.
Most so-called called "stackless" languages aren't really stackless. They just don't use the contiguous stack provided by these systems. What they do instead is allocate a stack frame from the heap on each function call. The cost per function call goes up somewhat; if functions are typically complex, or the language is interpretive, this additional cost is insignificant. (One can also determine call DAGs in the program call graph and allocate a heap segment to cover the entire DAG; this way you get both heap allocation and the speed of classic big-stack function calls for all calls inside the call DAG).
There are several reasons for using heap allocation for stack frames:
If the program does deep recursion dependent on the specific problem it is solving,
it is very hard to preallocate a "big stack" area in advance because the needed size isn't known. One can awkwardly arrange function calls to check to see if there's enough stack left, and if not, reallocate a bigger chunk, copy the old stack and readjust all the pointers into the stack; that's so awkward that I don't know of any implementations.
Allocating stack frames means the application never has to say its sorry until there's
literally no allocatable memory left.
The program forks subtasks. Each subtask requires its own stack, and therefore can't use the one "big stack" provided. So, one needs to allocate stacks for each subtask. If you have thousands of possible subtasks, you might now need thousands of "big stacks", and the memory demand suddenly gets ridiculous. Allocating stack frames solves this problem. Often the subtask "stacks" refer back to the parent tasks to implement lexical scoping; as subtasks fork, a tree of "substacks" is created called a "cactus stack".
Your language has continuations. These require that the data in lexical scope visible to the current function somehow be preserved for later reuse. This can be implemented by copying parent stack frames, climbing up the cactus stack, and proceeding.
The PARLANSE programming language I implemented does 1) and 2). I'm working on 3). It is amusing to note that PARLANSE allocates stack frames from a very fast-access heap-per-thread; it costs typically 4 machine instructions. The current implementation is x86 based, and the allocated frame is placed in the x86 EBP/ESP register much like other conventional x86 based language implementations. So it does use the hardware "contiguous stack" (including pushing and poppping) just in chunks. It also generates "frame local" subroutine calls the don't switch stacks for lots of generated utility code where the stack demand is known in advance.
Stackless Python still has a Python stack (though it may have tail call optimization and other call frame merging tricks), but it is completely divorced from the C stack of the interpreter.
Haskell (as commonly implemented) does not have a call stack; evaluation is based on graph reduction.
There is a nice article about the language framework Parrot. Parrot does not use the stack for calling and this article explains the technique a bit.
In the stackless environments I'm more or less familiar with (Turing machine, assembly, and Brainfuck), it's common to implement your own stack. There is nothing fundamental about having a stack built into the language.
In the most practical of these, assembly, you just choose a region of memory available to you, set the stack register to point to the bottom, then increment or decrement to implement your pushes and pops.
EDIT: I know some architectures have dedicated stacks, but they aren't necessary.
Call me ancient, but I can remember when the FORTRAN standards and COBOL did not support recursive calls, and therefore didn't require a stack. Indeed, I recall the implementations for CDC 6000 series machines where there wasn't a stack, and FORTRAN would do strange things if you tried to call a subroutine recursively.
For the record, instead of a call-stack, the CDC 6000 series instruction set used the RJ instruction to call a subroutine. This saved the current PC value at the call target location and then branches to the location following it. At the end, a subroutine would perform an indirect jump to the call target location. That reloaded saved PC, effectively returning to the caller.
Obviously, that does not work with recursive calls. (And my recollection is that the CDC FORTRAN IV compiler would generate broken code if you did attempt recursion ...)
There is an easy to understand description of continuations on this article: http://www.defmacro.org/ramblings/fp.html
Continuations are something you can pass into a function in a stack-based language, but which can also be used by a language's own semantics to make it "stackless". Of course the stack is still there, but as Ira Baxter described, it's not one big contiguous segment.
Say you wanted to implement stackless C. The first thing to realize is that this doesn't need a stack:
a == b
But, does this?
isequal(a, b) { return a == b; }
No. Because a smart compiler will inline calls to isequal, turning them into a == b. So, why not just inline everything? Sure, you will generate more code but if getting rid of the stack is worth it to you then this is easy with a small tradeoff.
What about recursion? No problem. A tail-recursive function like:
bang(x) { return x == 1 ? 1 : x * bang(x-1); }
Can still be inlined, because really it's just a for loop in disguise:
bang(x) {
for(int i = x; i >=1; i--) x *= x-1;
return x;
In theory a really smart compiler could figure that out for you. But a less-smart one could still flatten it as a goto:
ax = x;
if(ax > 1) {
x = x*(--ax);
There is one case where you have to make a small trade off. This can't be inlined:
fib(n) { return n <= 2 ? n : fib(n-1) + fib(n-2); }
Stackless C simply cannot do this. Are you giving up a lot? Not really. This is something normal C can't do well very either. If you don't believe me just call fib(1000) and see what happens to your precious computer.
Please feel free to correct me if I'm wrong, but I would think that allocating memory on the heap for each function call frame would cause extreme memory thrashing. The operating system does after all have to manage this memory. I would think that the way to avoid this memory thrashing would be a cache for call frames. So if you need a cache anyway, we might as well make it contigous in memory and call it a stack.
