Masking floating point exceptions with Set8087CW, SetMXCSR and TWebBrowser - delphi

As I am receiving "Floating point division by zero" exception when using TWebBrowser and TEmbeddedWB from time to time, I discovered that I need to mask division by zero exceptions Set8087CW or SetMXCSR.
Q1: What would be the best approach to do this:
to mask such exceptions early in the application startup and never touch them again (the app is multithreaded)?
to use OnBeforeNavigate and OnDocumentComplete events to mask / unmask exceptions? (is there a chance that exception could occur after the document is loaded?)
Q2: What would be the best "command" to mask only "division by zero" and nothing else - if application is 32-bit is there a need to mask 64 bit exception too?
The application I am using it it has TWebBrowser control available all the time for displaying email contents.
Also, if anyone can clarify - is this a particular bug with TWebBrowser control from Microsoft or just difference between Delphi/C++ Builder and Microsoft tools? What would happen if I would host TWebBrowser inside Visual C++ application if division by zero error would appear - it wouldn't be translated into exception but what would happen then - how would Visual C++ handle "division by zero" exception then?
It is kind of strange that Microsoft didn't notice this problem for such a long time - also it is strange that Embarcardero never noticed it too. Because masking floating point exception effectively also masks your own program exception for that particular purpose.
UPDATE
My final solution after some examination is:
SetExceptionMask(GetExceptionMask() << exZeroDivide);
The default state from GetExceptionMask() returns: TFPUExceptionMask() << exDenormalized << exUnderflow << exPrecision. So obviously, some exceptions are already masked - this just adds exZeroDivide to the masked exceptions.
As a result every division by zero now results with +INF in floating point instead of exception. I can live with that - for the production version of the code it will me masked to avoid errors and for the debug version it will be unmasked to detect floating point division by zero.

Assuming that you have no need for floating point exceptions to be unmasked in your application code, far and away the simplest thing to do is to mask exceptions at some point in your initialization code.
The best way to do this is as so:
SetExceptionMask(exAllArithmeticExceptions);
This will set the 8087 control word on 32 bit targets, and the MXCSR on 64 bit targets. You will find SetExceptionMask in the Math unit.
If you wish to have floating point exceptions unmasked in your code then it gets tricky. One strategy would be to run your floating point code in a dedicated thread that unmasks the exceptions. This certainly can work, but not if you rely on the RTL functions Set8087CW and SetMXCSR. Note that everything in the RTL that controls FP units routes through these functions. For example SetExceptionMask does.
The problem is that Set8087CW and SetMXCSR are not threadsafe. It seems hard to believe that Embarcadero could be so inept as to produce fundamental routines that operate on thread context and yet fail to be threadsafe. But that is what they have done.
It's surprisingly hard to undo the mess that they have left, and to do so involves quite a bit of code patching. The lack of thread safety is down to the (mis)use of the global variables Default8087CW and DefaultMXCSR. If two threads call Set8087CW or SetMXCSR at the same time then these global variables can have the effect of leaking the value from one thread to the other.
You could replace Set8087CW and SetMXCSR with versions that did not change global state, but it's sadly not that simple. The global state is used in various other places. This may seem immodest, but if you want to learn more about this matter, read my document attached to this QC report: http://qc.embarcadero.com/wc/qcmain.aspx?d=107411

Related

Passing a string to an oleobject using Delphi [duplicate]

As I am receiving "Floating point division by zero" exception when using TWebBrowser and TEmbeddedWB from time to time, I discovered that I need to mask division by zero exceptions Set8087CW or SetMXCSR.
Q1: What would be the best approach to do this:
to mask such exceptions early in the application startup and never touch them again (the app is multithreaded)?
to use OnBeforeNavigate and OnDocumentComplete events to mask / unmask exceptions? (is there a chance that exception could occur after the document is loaded?)
Q2: What would be the best "command" to mask only "division by zero" and nothing else - if application is 32-bit is there a need to mask 64 bit exception too?
The application I am using it it has TWebBrowser control available all the time for displaying email contents.
Also, if anyone can clarify - is this a particular bug with TWebBrowser control from Microsoft or just difference between Delphi/C++ Builder and Microsoft tools? What would happen if I would host TWebBrowser inside Visual C++ application if division by zero error would appear - it wouldn't be translated into exception but what would happen then - how would Visual C++ handle "division by zero" exception then?
It is kind of strange that Microsoft didn't notice this problem for such a long time - also it is strange that Embarcardero never noticed it too. Because masking floating point exception effectively also masks your own program exception for that particular purpose.
UPDATE
My final solution after some examination is:
SetExceptionMask(GetExceptionMask() << exZeroDivide);
The default state from GetExceptionMask() returns: TFPUExceptionMask() << exDenormalized << exUnderflow << exPrecision. So obviously, some exceptions are already masked - this just adds exZeroDivide to the masked exceptions.
As a result every division by zero now results with +INF in floating point instead of exception. I can live with that - for the production version of the code it will me masked to avoid errors and for the debug version it will be unmasked to detect floating point division by zero.
Assuming that you have no need for floating point exceptions to be unmasked in your application code, far and away the simplest thing to do is to mask exceptions at some point in your initialization code.
The best way to do this is as so:
SetExceptionMask(exAllArithmeticExceptions);
This will set the 8087 control word on 32 bit targets, and the MXCSR on 64 bit targets. You will find SetExceptionMask in the Math unit.
If you wish to have floating point exceptions unmasked in your code then it gets tricky. One strategy would be to run your floating point code in a dedicated thread that unmasks the exceptions. This certainly can work, but not if you rely on the RTL functions Set8087CW and SetMXCSR. Note that everything in the RTL that controls FP units routes through these functions. For example SetExceptionMask does.
The problem is that Set8087CW and SetMXCSR are not threadsafe. It seems hard to believe that Embarcadero could be so inept as to produce fundamental routines that operate on thread context and yet fail to be threadsafe. But that is what they have done.
It's surprisingly hard to undo the mess that they have left, and to do so involves quite a bit of code patching. The lack of thread safety is down to the (mis)use of the global variables Default8087CW and DefaultMXCSR. If two threads call Set8087CW or SetMXCSR at the same time then these global variables can have the effect of leaking the value from one thread to the other.
You could replace Set8087CW and SetMXCSR with versions that did not change global state, but it's sadly not that simple. The global state is used in various other places. This may seem immodest, but if you want to learn more about this matter, read my document attached to this QC report: http://qc.embarcadero.com/wc/qcmain.aspx?d=107411

Delphi XE TMediaPlayer "invalid floating point operation" on Windows 8/10 [duplicate]

As I am receiving "Floating point division by zero" exception when using TWebBrowser and TEmbeddedWB from time to time, I discovered that I need to mask division by zero exceptions Set8087CW or SetMXCSR.
Q1: What would be the best approach to do this:
to mask such exceptions early in the application startup and never touch them again (the app is multithreaded)?
to use OnBeforeNavigate and OnDocumentComplete events to mask / unmask exceptions? (is there a chance that exception could occur after the document is loaded?)
Q2: What would be the best "command" to mask only "division by zero" and nothing else - if application is 32-bit is there a need to mask 64 bit exception too?
The application I am using it it has TWebBrowser control available all the time for displaying email contents.
Also, if anyone can clarify - is this a particular bug with TWebBrowser control from Microsoft or just difference between Delphi/C++ Builder and Microsoft tools? What would happen if I would host TWebBrowser inside Visual C++ application if division by zero error would appear - it wouldn't be translated into exception but what would happen then - how would Visual C++ handle "division by zero" exception then?
It is kind of strange that Microsoft didn't notice this problem for such a long time - also it is strange that Embarcardero never noticed it too. Because masking floating point exception effectively also masks your own program exception for that particular purpose.
UPDATE
My final solution after some examination is:
SetExceptionMask(GetExceptionMask() << exZeroDivide);
The default state from GetExceptionMask() returns: TFPUExceptionMask() << exDenormalized << exUnderflow << exPrecision. So obviously, some exceptions are already masked - this just adds exZeroDivide to the masked exceptions.
As a result every division by zero now results with +INF in floating point instead of exception. I can live with that - for the production version of the code it will me masked to avoid errors and for the debug version it will be unmasked to detect floating point division by zero.
Assuming that you have no need for floating point exceptions to be unmasked in your application code, far and away the simplest thing to do is to mask exceptions at some point in your initialization code.
The best way to do this is as so:
SetExceptionMask(exAllArithmeticExceptions);
This will set the 8087 control word on 32 bit targets, and the MXCSR on 64 bit targets. You will find SetExceptionMask in the Math unit.
If you wish to have floating point exceptions unmasked in your code then it gets tricky. One strategy would be to run your floating point code in a dedicated thread that unmasks the exceptions. This certainly can work, but not if you rely on the RTL functions Set8087CW and SetMXCSR. Note that everything in the RTL that controls FP units routes through these functions. For example SetExceptionMask does.
The problem is that Set8087CW and SetMXCSR are not threadsafe. It seems hard to believe that Embarcadero could be so inept as to produce fundamental routines that operate on thread context and yet fail to be threadsafe. But that is what they have done.
It's surprisingly hard to undo the mess that they have left, and to do so involves quite a bit of code patching. The lack of thread safety is down to the (mis)use of the global variables Default8087CW and DefaultMXCSR. If two threads call Set8087CW or SetMXCSR at the same time then these global variables can have the effect of leaking the value from one thread to the other.
You could replace Set8087CW and SetMXCSR with versions that did not change global state, but it's sadly not that simple. The global state is used in various other places. This may seem immodest, but if you want to learn more about this matter, read my document attached to this QC report: http://qc.embarcadero.com/wc/qcmain.aspx?d=107411

Delphi 64-bit: finding incorrect casts?

I'm working on adapting a large Delphi code base to 64-bits. In many cases there are lines where pointers are casted to/from 32-bit values similar to this:
var
p1,p2 : pointer;
begin
inc(Integer(p1),10);
p2 := Pointer(Integer(p1) + 42);
Where I can find these casts I have replaced them with NativeInt-casts instead to make them correct in 64-bit mode.
However I'm not sure I have found them all. Sometimes the casts are more subtle so just text-searching for the string "integer(" is not sufficient either.
Since the "integer(" casts will fail in 64-bit if the pointer value is above the range of integer type I have an idea: what if I could force the memory manager to allocate memory above 4gb (so the pointer values are using more than 32-bits)? Then I would get runtime errors and can more easily find the casts that are wrong. Is this possible? Or can anyone recommend some other technique?
There's no magic trick to finding these casts beyond the sort of text search that you are using. It would be really nice if the compiler warned of such a cast. I find it very disappointing that it doesn't.
When you do find such a problem, don't change to NativeInt. Change the pointers to be typed pointers, and use pointer arithmetic.
var
p1, p2: PByte;
....
inc(p1, 10);
p2 := p2;
inc(p2, 42);
Then your code will be safe forever.
There are still some situations where you need to cast to integers. For example when passing addresses to SendMessage. But cast these to either WPARAM or LPARAM as appropriate.
Your idea of forcing runtime errors is sound and, thankfully for you, not original! You should use the full version of FastMM and define AlwaysAllocateTopDown. This forces the calls that FastMM makes to VirtualAlloc to pass the MEM_TOP_DOWN flag. This will flush out most of your erroneous casts as runtime pointer truncation errors.
However, that will only force top down allocation for memory allocated by your memory manager. Other modules in your process will use the default policy of bottom up. You can set a machine wide setting to change that default policy. Set HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management\AllocationPreference to REG_DWORD with value 0x100000 and reboot.
Note that this might cause your machine to have stability problems. Many applications cannot cope with this. In particular there are very few anti-virus products that can cope with this setting. MSE is the one that I found works with machine wide top down allocation. What's more the 64 bit debugger does not run under top down allocation! So you have to do this kind of testing without the debugger. My QC report is still open and this problem has not been addressed, even in XE3.

False autovectorization in Intel C compiler (icc)

I need to vectorize with SSE a some huge loops in a program. In order to save time I decided to let ICC deal with it. For that purpose, I prepare properly the data, taking into account the alignment and I make use of the compiler directives #pragma simd, #pragma aligned, #pragma ivdep. When compiling with the several -vec-report options, compiler tells me that loops were vectorized. A quick look to the assembly generated by the compiler seems to confirm that, since you can find there plenty of vectorial instructions that works with packed single precision operands (all operations in the serial code handler float operands).
The problem is that when I take hardware counters with PAPI the number of FP operations I get (PAPI_FP_INS and PAPI_FP_OPS) is pretty the same in the auto-vectorized code and the original one, when one would expect to be significantly less in the auto-vectorized code. What's more, a vectorized by-hand a simplified problem of the one that concerns and in this case I do get something like 3 times less of FP operations.
Has anyone experienced something similar with this?
Spills may destroy the advantage of vectorization, thus 64-bit mode may gain significantly over 32-bit mode. Also, icc may version a loop and you may be hitting a scalar version even though there is a vector version present. icc versions issued in the last year or 2 have fixed some problems in this area.

How does a stackless language work?

I've heard of stackless languages. However I don't have any idea how such a language would be implemented. Can someone explain?
The modern operating systems we have (Windows, Linux) operate with what I call the "big stack model". And that model is wrong, sometimes, and motivates the need for "stackless" languages.
The "big stack model" assumes that a compiled program will allocate "stack frames" for function calls in a contiguous region of memory, using machine instructions to adjust registers containing the stack pointer (and optional stack frame pointer) very rapidly. This leads to fast function call/return, at the price of having a large, contiguous region for the stack. Because 99.99% of all programs run under these modern OSes work well with the big stack model, the compilers, loaders, and even the OS "know" about this stack area.
One common problem all such applications have is, "how big should my stack be?". With memory being dirt cheap, mostly what happens is that a large chunk is set aside for the stack (MS defaults to 1Mb), and typical application call structure never gets anywhere near to using it up. But if an application does use it all up, it dies with an illegal memory reference ("I'm sorry Dave, I can't do that"), by virtue of reaching off the end of its stack.
Most so-called called "stackless" languages aren't really stackless. They just don't use the contiguous stack provided by these systems. What they do instead is allocate a stack frame from the heap on each function call. The cost per function call goes up somewhat; if functions are typically complex, or the language is interpretive, this additional cost is insignificant. (One can also determine call DAGs in the program call graph and allocate a heap segment to cover the entire DAG; this way you get both heap allocation and the speed of classic big-stack function calls for all calls inside the call DAG).
There are several reasons for using heap allocation for stack frames:
If the program does deep recursion dependent on the specific problem it is solving,
it is very hard to preallocate a "big stack" area in advance because the needed size isn't known. One can awkwardly arrange function calls to check to see if there's enough stack left, and if not, reallocate a bigger chunk, copy the old stack and readjust all the pointers into the stack; that's so awkward that I don't know of any implementations.
Allocating stack frames means the application never has to say its sorry until there's
literally no allocatable memory left.
The program forks subtasks. Each subtask requires its own stack, and therefore can't use the one "big stack" provided. So, one needs to allocate stacks for each subtask. If you have thousands of possible subtasks, you might now need thousands of "big stacks", and the memory demand suddenly gets ridiculous. Allocating stack frames solves this problem. Often the subtask "stacks" refer back to the parent tasks to implement lexical scoping; as subtasks fork, a tree of "substacks" is created called a "cactus stack".
Your language has continuations. These require that the data in lexical scope visible to the current function somehow be preserved for later reuse. This can be implemented by copying parent stack frames, climbing up the cactus stack, and proceeding.
The PARLANSE programming language I implemented does 1) and 2). I'm working on 3). It is amusing to note that PARLANSE allocates stack frames from a very fast-access heap-per-thread; it costs typically 4 machine instructions. The current implementation is x86 based, and the allocated frame is placed in the x86 EBP/ESP register much like other conventional x86 based language implementations. So it does use the hardware "contiguous stack" (including pushing and poppping) just in chunks. It also generates "frame local" subroutine calls the don't switch stacks for lots of generated utility code where the stack demand is known in advance.
Stackless Python still has a Python stack (though it may have tail call optimization and other call frame merging tricks), but it is completely divorced from the C stack of the interpreter.
Haskell (as commonly implemented) does not have a call stack; evaluation is based on graph reduction.
There is a nice article about the language framework Parrot. Parrot does not use the stack for calling and this article explains the technique a bit.
In the stackless environments I'm more or less familiar with (Turing machine, assembly, and Brainfuck), it's common to implement your own stack. There is nothing fundamental about having a stack built into the language.
In the most practical of these, assembly, you just choose a region of memory available to you, set the stack register to point to the bottom, then increment or decrement to implement your pushes and pops.
EDIT: I know some architectures have dedicated stacks, but they aren't necessary.
Call me ancient, but I can remember when the FORTRAN standards and COBOL did not support recursive calls, and therefore didn't require a stack. Indeed, I recall the implementations for CDC 6000 series machines where there wasn't a stack, and FORTRAN would do strange things if you tried to call a subroutine recursively.
For the record, instead of a call-stack, the CDC 6000 series instruction set used the RJ instruction to call a subroutine. This saved the current PC value at the call target location and then branches to the location following it. At the end, a subroutine would perform an indirect jump to the call target location. That reloaded saved PC, effectively returning to the caller.
Obviously, that does not work with recursive calls. (And my recollection is that the CDC FORTRAN IV compiler would generate broken code if you did attempt recursion ...)
There is an easy to understand description of continuations on this article: http://www.defmacro.org/ramblings/fp.html
Continuations are something you can pass into a function in a stack-based language, but which can also be used by a language's own semantics to make it "stackless". Of course the stack is still there, but as Ira Baxter described, it's not one big contiguous segment.
Say you wanted to implement stackless C. The first thing to realize is that this doesn't need a stack:
a == b
But, does this?
isequal(a, b) { return a == b; }
No. Because a smart compiler will inline calls to isequal, turning them into a == b. So, why not just inline everything? Sure, you will generate more code but if getting rid of the stack is worth it to you then this is easy with a small tradeoff.
What about recursion? No problem. A tail-recursive function like:
bang(x) { return x == 1 ? 1 : x * bang(x-1); }
Can still be inlined, because really it's just a for loop in disguise:
bang(x) {
for(int i = x; i >=1; i--) x *= x-1;
return x;
}
In theory a really smart compiler could figure that out for you. But a less-smart one could still flatten it as a goto:
ax = x;
NOTDONE:
if(ax > 1) {
x = x*(--ax);
goto NOTDONE;
}
There is one case where you have to make a small trade off. This can't be inlined:
fib(n) { return n <= 2 ? n : fib(n-1) + fib(n-2); }
Stackless C simply cannot do this. Are you giving up a lot? Not really. This is something normal C can't do well very either. If you don't believe me just call fib(1000) and see what happens to your precious computer.
Please feel free to correct me if I'm wrong, but I would think that allocating memory on the heap for each function call frame would cause extreme memory thrashing. The operating system does after all have to manage this memory. I would think that the way to avoid this memory thrashing would be a cache for call frames. So if you need a cache anyway, we might as well make it contigous in memory and call it a stack.

Resources