Virtual memory without hardware support - memory

While reading this question and its answer I couldn't help but think why is it obligatory for the hardware to support virtual memory?
For example can't I simulate this behavior with software only (e.g the o.s with represent all the memory as some table, intercept all memory related actions and and do the mapping itself)?
Is there any OS that implements such techniques?

As far as I know, No.
intercept all memory related actions? It doesn't look impossible, but I must be very very slow.
For example, suppose this code:
int f(int *a1, int *b1, int *c1, int *d1)
{
const int n=100000;
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
}
(From here >o<)
This simple loop is compiled into following by gcc -std=c99 -O3 with gcc 4.8.3:
push %edi ; 1
xor %eax,%eax
push %esi ; 2
push %ebx ; 3
mov 0x10(%esp),%ecx ; 4
mov 0x14(%esp),%esi ; 5
mov 0x18(%esp),%edx ; 6
mov 0x1c(%esp),%ebx ; 7
mov (%esi,%eax,4),%edi ; 8
add %edi,(%ecx,%eax,4) ; 9
mov (%ebx,%eax,4),%edi ; 10
add %edi,(%edx,%eax,4) ; 11
add $0x1,%eax
cmp $0x186a0,%eax
jne 15 <_f+0x15> ; 12
pop %ebx ; 13
pop %esi ; 14
pop %edi ; 15
ret ; 16
Even this really really simple function has 16 machine codes that access memory. Probably OS's simulate code has hundreds of machine codes, Then we can guess this memory-accessing codes' speed will slows down hundreds times at least.
Moreover, It's when you can watch only memory-accessing codes. Probably your processor doesn't have this feature, you should use step-by-step debugging, like x86's Trap Flag, and check every command every time.
Things goes worse - It's not enough to check codes. You may want IP (Instruction Pointer) to follow your OS's virtual memory rule as well, so you must check whether IP is over page's boundary after each codes was run. You also must very carefully deal with codes which can change IP, such as jmp, call, ret, ...
I don't think it can be implemented efficiently. It cannot be implemented efficiently. Speed is one of the most important part of operating system. If OS become a bit slow, all system is affected. In this case, it's not a bit - your computer gets slow lots and lots. Moreover, implementing this is very difficult as I say above - I'd rather write an emulator of a processor which has hardware-suppported virtual memory than do this crazy job!

Related

How to detect a Xeon Phi (Knights Landing)

Intel engineers wrote that we should use VZEROUPPER/VZEROALL to avoid costly transition to non-VEX state on all processors, including future Xeon processor, but not on Xeon Phi: https://software.intel.com/pt-br/node/704023
People have also measured and found out that VZEROUPPER and VZEROALL are expensive on Knights Landing:
36 clock cycles for both instructions in 64-bit mode (30 clock in 32-bit mode).
See the above link.
So my code will be the following, if I have just used ymm0 and ymm1:
if [we are running on a Xeon Phi]
vpxor ymm0,ymm0,ymm0
vpxor ymm1,ymm1,ymm1
else
vzeroall
endif
How can I detect Xeon Phi (Knights Landing and later Xeon Phi processors) to implement the above code?
We now have the following situation now about the VZEROUPPER/VZEROALL:
These instructions are not needed and are very costly on Xeon Phi Knight Landing 36 clock cycles for both instructions in 64-bit mode (30 clock in 32-bit mode).
These instructions are very cheap and are needed on Xeon and Core processors (Skylake/Kaby Lake) and will be needed for Xeon in the foreseeble future, to avoid costly transition to non-VEX state.
The advertising materials claim that Xeon Phi (Knights Landing) is fully compatible with other Xeon processors.
Is there a reliable way to detect Xeon Phi, for the purpose of avoiding VZEROUPPER/VZEROALL?
There is an article "How to detect Knights Landing AVX-512 support (Intel® Xeon Phi™ processor)" by James R., Updated February 22, 2016, but it only focuses specific new instructions that became available on the Knights Landing. So it is still not very clear about the VEX transitions.
It would have been good to know whether Intel plans to implement a CPUID bit to show whether non-VEX state are costly? For example:
Bit is set to 0 - VEX state transitions are costly, but VZEROUPPER/VZEROALL are cheap and should be used to clear the state;
Bit is set to 1 – there is no transition penalty, VZEROUPPER/VZEROALL is not needed.
The above mentioned article about detecting Knights Landing suggests to check the bits AVX-512F+CD+ER+PF as introduced in Knights Landing.
So the code suggests to check all these bits at once, and if all are set, then we are on the Knights Landing:
uint32_t avx2_bmi12_mask = (1 << 16) | // AVX-512F
(1 << 26) | // AVX-512PF
(1 << 27) | // AVX-512ER
(1 << 28); // AVX-512CD
It would have been good to know whether Intel plans to add these all bits to a simple Xeon (non Phi) or Core processors in the near future, so they will also support the AVX-512F+CD+ER+PF features introduced in the Knight Landding?
In case that Xeon and Core processor will support AVX-512F+CD+ER+PF, we won’t be able to distinguish Xeon from Xeon Phi.
Please advise.
If you specifically want to check for being on a KNL (rather than the more general "Does the CPU I am running on have feature X?") you can do that by looking at the "Extended Family", "Family" and "Model" fields in %eax after calling cpuid with %eax==1 and %ecx == 0. C++ code something like that below will do the job.
However, as others are implicitly pointing out, this is a very specific test, and will, for instance, fail on future Knights cores, so you would likely be better doing as has been suggested and checking for AVX-512 features that are not in Xeon, so AVX512-ER and AVX512-PF. (Of course, such instructions could appear in future Xeons, so this is not guaranteed in the long term, but, quoting Keynes: "In the long term we're all dead" :-))
class cpuidState
{
uint32_t orig_eax; /* Values sent in to the cpuid instruction */
uint32_t orig_ecx;
uint32_t eax; /* Values received back from it. */
uint32_t ebx;
uint32_t ecx;
uint32_t edx;
void cpuid()
{
__asm__ __volatile__("cpuid"
: "+a" (eax), "=b" (ebx), "+c" (ecx), "=d" (edx));
}
void update (uint32_t eaxVal, uint32_t ecxVal)
{
orig_eax = eaxVal;
orig_ecx = ecxVal;
eax = eaxVal;
ecx = ecxVal;
cpuid();
}
void ensureCorrectLeaf(uint32_t eaxVal, uint32_t ecxVal)
{
if (orig_eax != eaxVal || orig_ecx != ecxVal)
update (eaxVal, ecxVal);
}
public:
cpuidState() : orig_eax (-1), orig_ecx(-1) { }
// Include the Extended Model in the test. Without it we see some Xeons as KNL :-(
bool onKNL() { ensureCorrectLeaf(1,0); return (eax & 0x0f0ff0) == 0x50670; }
};

x86 Assembly: Writing a Program to Test Memory Functionality for Entire 1MB of Memory

Goal:
I need to write a program that tests the write functionality of an entire 1MB of memory on a byte by byte basis for a system using an Intel 80186 microprocessor. In other words, I need to write a 0 to every byte in memory and then check if a 0 was actually written. I need to then repeat the process using a value of 1. Finally, any memory locations that did not successfully have a 0 or 1 written to them during their respective write operation needs to be stored on the stack.
Discussion:
I am an Electrical Engineering student in college (Not Computer Science) and am relatively new to x86 assembly language and MASM611. I am not looking for a complete solution. However, I am going to need some guidance.
Earlier in the semester, I wrote a program that filled a portion of memory with 0's. I believe that this will be a good starting point for my current project.
Source Code For Early Program:
;****************************************************************************
;Program Name: Zeros
;File Name: PROJ01.ASM
;DATE: 09/16/14
;FUNCTION: FILL A MEMORY SEGMENT WITH ZEROS
;HISTORY:
;AUTHOR(S):
;****************************************************************************
NAME ZEROS
MYDATA SEGMENT
MYDATA ENDS
MYSTACK SEGMENT STACK
DB 0FFH DUP(?)
End_Of_Stack LABEL BYTE
MYSTACK ENDS
ASSUME SS:MYSTACK, DS:MYDATA, CS:MYCODE
MYCODE SEGMENT
START: MOV AX, MYSTACK
MOV SS, AX
MOV SP, OFFSET End_Of_Stack
MOV AX, MYDATA
MOV DS, AX
MOV AX, 0FFFFh ;Moves a Hex value of 65535 into AX
MOV BX, 0000h ;Moves a Hex value of 0 into BX
CALL Zero_fill ;Calls procedure Zero_fill
MOV AX, 4C00H ;Performs a clean exit
INT 21H
Zero_fill PROC NEAR ;Declares procedure Zero_fill with near directive
MOV DX, 0000h ;Moves 0H into DX
MOV CX, 0000h ;Moves 0H into CX. This will act as a counter.
Start_Repeat: INC CX ;Increments CX by 1
MOV [BX], DX ;Moves the contents of DX to the memory address of BX
INC BX ;Increments BX by 1
CMP CX, 10000h ;Compares the value of CX with 10000H. If equal, Z-flag set to one.
JNE Start_Repeat ;Jumps to Start_Repeat if CX does not equal 10000H.
RET ;Removes 16-bit value from stack and puts it in IP
Zero_fill ENDP ;Ends procedure Zero_fill
MYCODE ENDS
END START
Requirements:
1. Employ explicit segment structure.
2. Use the ES:DI register pair to address the test memory area.
3. Non destructive access: Before testing each memory location, I need to store the original contents of the byte. Which needs to be restored after testing is complete.
4. I need to store the addresses of any memory locations that fail the test on the stack.
5. I need to determine the highest RAM location.
Plan:
1. In a loop: Write 0000H to memory location, Check value at that mem location, PUSH values of ES and DI to the stack if check fails.
2. In a loop: Write FFFFH to memory location, Check value at that mem location, PUSH values of ES and DI to the stack if check fails.
Source Code Implementing Preliminary Plan:
;****************************************************************************
;Program Name: Memory Test
;File Name: M_TEST.ASM
;DATE: 10/7/14
;FUNCTION: Test operational status of each byte of memory between a starting
; location and an ending location
;HISTORY: Template code from Assembly Project 1
;AUTHOR(S):
;****************************************************************************
NAME M_TEST
MYDATA SEGMENT
MYDATA ENDS
MYSTACK SEGMENT STACK
DB 0FFH DUP(?)
End_Of_Stack LABEL BYTE
MYSTACK ENDS
ESTACK SEGMENT COMMON
ESTACK ENDS
ASSUME SS:MYSTACK, DS:MYDATA, CS:MYCODE, ES:ESTACK
MYCODE SEGMENT
START: MOV AX, MYSTACK
MOV SS, AX
MOV SP, OFFSET End_Of_Stack
MOV AX, MYDATA
MOV DS, AX
MOV AX, FFFFH ;Moves a Hex value of 65535 into AX
MOV BX, 0000H ;Moves a Hex value of 0 into BX
CALL M_TEST ;Calls procedure M_TEST
MOV AX, 4C00H ;Performs a clean exit
INT 21H
M_TEST PROC NEAR ;Declares procedure M_TEST with near directive
MOV DX, 0000H ;Fill DX with 0's
MOV AX, FFFFH ;Fill AX with 1's
MOV CX, 0000H ;Moves 0H into CX. This will act as a counter.
Start_Repeat: MOV [BX], DX ;Moves the contents of DX to the memory address of BX
CMP [BX], 0000H ;Compare value at memory location [BX] with 0H. If equal, Z-flag set to one.
JNE SAVE ;IF Z-Flag NOT EQUAL TO 0, Jump TO SAVE
MOV [BX], AX ;Moves the contents of AX to the memory address of BX
CMP [BX], FFFFH ;Compare value at memory location [BX] with FFFFH. If equal, Z-flag set to one.
JNE SAVE ;IF Z-Flag NOT EQUAL TO 0, Jump TO SAVE
INC CX ;Increments CX by 1
INC BX ;Increments BX by 1
CMP CX, 10000H ;Compares the value of CX with 10000H. If equal, Z-flag set to one.
JNE Start_Repeat ;Jumps to Start_Repeat if CX does not equal 10000H.
SAVE: PUSH ES
PUSH DI
RET ;Removes 16-bit value from stack and puts it in IP
M_TEST ENDP ;Ends procedure Zero_fill
MYCODE ENDS
END START
My commenting might not be accurate.
Questions:
1. How do I use ES:DI to address the test memory area?
2. What is the best way to hold on to the initial memory value so that I can replace it when I'm done testing a specific memory location? I believe registers AX - DX are already in use.
Also, if I have updated code and questions, should I post it on this same thread, or should I create a new post with a link to this one?
Any other advice would be greatly appreciated.
Thanks in advance.
How do I use ES:DI to address the test memory area?
E.g. mov al, es:[di]
What is the best way to hold on to the initial memory value so that I can replace it when I'm done testing a specific memory location? I believe registers AX - DX are already in use.
Right. You could use al to store the original value and have 0 and 1 pre-loaded in bl and cl and then do something like this (off the top of my head):
mov al, es:[di] // load/save original value
mov es:[di], bl // store zero
cmp bl, es:[di] // check that it sticks
jne #pushbad // jump if it didn't
mov es:[di], cl // same for 'one'
cmp cl, es:[di]
jne #pushbad
mov es:[di], al // restore original value
jmp #nextAddr
#pushbad:
mov es:[di], al // restore original value (may be redundant as the mem is bad)
push es
push di
#nextAddr:
...
Some words about for to test also the memory location that our own routine is claimed. We can copy and run our routine into the framebuffer of the display device.
..
Note: If we want to store or compare a memory location with an immediate value, then we have to specify how many bytes we want to access. (But in opposite of it with using a register as a source, or a target, the assembler already knows the size of it, so we do not need to specify.)
Accessing one byte of one address (with an immediate value):
CMP BYTE[BX], 0 ; with NASM (Netwide Assembler)
MOV BYTE[BX], 0
CMP BYTE PTR[BX], 0 ; with MASM (Microsoft Macro Assembler)
MOV BYTE PTR[BX], 0
Accessing two bytes of two adresses together
(executing faster, if the target address is even aligned):
CMP WORD[BX], 0 ; with NASM
MOV WORD[BX], 0
CMP WORD PTR[BX], 0 ; with MASM
MOV WORD PTR[BX], 0
If you start with the assumption that any location in RAM might be faulty; then this means you can't use RAM to store your code or your data. This includes temporary usage - for example, you can't temporarily store your code in RAM and then copy it to display memory, because you risk copying corrupted code from RAM to display memory.
With this in mind; the only case where this makes sense is code in ROM testing the RAM - e.g. during the firmware's POST (Power On Self Test). Furthermore; this means that you can't use the stack at all - not for keeping track of faulty areas, or even for calling functions/routines.
Note that you might assume that you can test a small area (e.g. find the first 1 KiB that isn't faulty) and then use that RAM for storing results, etc. This would be a false assumption.
For RAM faults there are many causes. The first set of causes is "open connection" and "shorted connection" on either the address bus or the data bus. For a simple example, if address line 12 happens to be open circuit, the end result will be that the first 4 KiB always has identical contents to the second 4 KiB of RAM. You can test the first 4 KiB of RAM as much as you like and decide it's "good", but then when you test the second 4 KiB of RAM you trash the contents of the first 4 KiB of RAM.
There is a "smart sequence" of tests. Specifically, test the address lines from highest to lowest (e.g. write different values to 0x000000 and 0x800000 and check that they're both correct; then do the same for 0x000000 and 0x400000, then 0x000000 and 0x200000, and so on until you get to addresses 0x000000 and 0x000001). However, the way RAM chips are connected to the CPU is not necessarily as simple as a direct mapping. For example, maybe the highest bit of the address selects which bank of RAM; and in that case you'd have to test both 0x000000 and 0x400000 and also 0x800000 and 0xC00000 to test both banks.
Once you're sure the address lines work; then you can do similar for data lines and the RAM itself. The most common test is called a "walking ones" test; where you store 0x01, then 0x02, and so on (up to 0x80). This detects things like "sticky bits" (e.g. where a bit's state happens to be "stuck" to its neighbour's state). If you only write (e.g.) 0x00 and test it then write 0xFF and test it, then you will miss most RAM faults.
Also; be very careful with "open connection". On some machines bus capacitance can play tricks on you, where you write a value and the bus capacitance "stores" the previous value, so that when you read it back it looks like it's correct even when there's no connection. To avoid this risk, you need to write a different value in between - e.g. write 0x55 to the address you're testing, then write 0xAA somewhere else, then read the original value back (and hope you get 0x55 because the RAM works, and not 0xAA). With this in mind (for performance) you may consider doing "walking ones" in one area of RAM while also doing "walking zeros" in the next area of RAM; so that you're always alternating between reading a value from one area and reading the inverted value from the other.
Finally, some RAM problems depend on noise, temperature, etc. In these cases you can do extremely thorough RAM tests, say that it's all perfect, then suffer from RAM corruption 2 minutes afterwards. This is why (e.g.) the typical advice is to run something like "memtest" for 8 hours or so if you really want to test RAM properly.

ELEVENWORDINLINE when to use it?

I was always wondering what can I do with things like that:
ONEWORDINLINE(w1)
TWOWORDINLINE(w1, w2)
THREEWORDINLINE(w1, w2, w3)
up to
TENWORDINLINE(w1, w2, w3, w4, w5, w6, w7, w8, w9, w10)
ELEVENWORDINLINE(w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11)
TWELVEWORDINLINE(w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11, w12)
How to use this macros?
When to use them?
Why from 1 to 12 and
not to...100 for example?
That is long-dead technology of the ancient Mac programmers, sentry to a tomb best left untouched. But, if you're interested, off we go, brave adventurer!
Here are the relevant #define-s straight from Apple:
#define ONEWORDINLINE(w1) = w1
#define TWOWORDINLINE(w1,w2) = {w1,w2}
#define THREEWORDINLINE(w1,w2,w3) = {w1,w2,w3}
/* ...etc... */
#define TWELVEWORDINLINE(w1,w2,w3,w4,w5,w6,w7,w8,w9,w10,w11,w12) = {w1,w2,w3,w4,w5,w6,w7,w8,w9,w10,w11,w12}
Now to some explaining.
A little history lesson: back in the ancient times days when Mac was implemented for the (just as ancient) Motorola 68k, Apple set up its system-calls in the following highly compact and very usable way: they mapped them to words starting with 1010b (0xA), as they were reserved by the Motorola devs for this use. These sys-calls and their mappings were called A-Traps for this hex value (no relation to "IT'S A TRAP!", honestly). They looked like this in hex: 0xA869 (this example is the A-Trap for the FixRatio(short numer, short denom) system-call). This technology was originally created for the Mac Toolbox API.
When a Mac-on-68k-targeting compiler (these should set the TARGET_OS_MAC and TARGET_CPU_68K macros as 1 or TRUE and TARGET_RT_MAC_CFM as 0 or FALSE, BTW) saw a function prototype with an assignment (=) after it, it treated the prototype as referring to an A-Trap system call indicated by the A-Trap value to the right of the assignment operator, which could be a single integer literal one-word value starting with 0xA (0xA???). So, ONEWORDINLINE was basically a stylish macro-way of saying "it's an A-Trap!".
So, here's what a sys-call function prototype declaration for 68k would look like:
EXTERN_API(Fixed) FixRatio(short numer, short denom) ONEWORDINLINE(0xA869);
This would be preprocessed to something like this:
extern Fixed FixRatio(short numer, short denom) = 0xA869;
Now, you might be thinking: if we're indexing system calls by one word, and one-fourth that word is taken by the huge 0xA, that'd be only 4096 functions at maximum (there was much less in reality, as many A-Traps actually mapped to the same system-call subroutines, but with different parameters), how is that enough? Well, obviously, it isn't. That's where selectors come in.
A-Traps like _HFSDispatch (0xA260) were called "selectors" because they had the job of selecting and calling another subroutine determined by values on the stack. So, when a function prototype was "assigned" an "array" of one-word integer literals, all but the last one (called the selector code(s)) got pushed onto the stack, and the last one was treated as an A-Trap selector that grabbed the words pushed onto the stack and called the appropriate subroutine. The maximum number of words in such an array was 12 because that was enough for the Mac Toolbox.
The macros TWOWORDINLINE through TWELVEWORDINLINE handled selector A-Traps. For example:
EXTERN_API(OSErr) ActivateTSMDocument(TSMDocumentID idocID) TWOWORDINLINE(0x7002, 0xAA54);
would be preprocessed to something like
extern OSErr ActivateTSMDocument(TSMDocumentID idocID) = {0x7002, 0xAA54};
Here, 0x7002 is the selector code, and 0xAA54 is the selector A-Trap.
So, to sum it all up, you only need it if you want to do some coding for pre-1994 Mac-s running on a Motorola 68k. So ios isn't really in place here ;)
Disclaimer: I know this stuff in theory only and I may have made a mistake somewhere. If there are any old-timers experienced with this stuff, please correct me if I got something wrong!

How to Switch Xcode 4.2's iOS disassembly from Thumb to ARM?

My iOS App is build with the Apple LLVM 3.0 compiler in Thumb mode. For armv7, I'm pretty sure that's actually Thumb-2.
I'm reimplementing my two most time-consuming functions in ARM assembly code. The callers of these functions are Thumb, so I use Thumb to ARM interworking instructions to switch to ARM in the prologue of my functions so I have acesss to ARM's richer instruction set and greater number of registers. At function exit I use ARM to Thumb interworking to return to ARM mode.
GDB's disassembly is correct for the Thumb code, but when I am in ARM mode, it disassembles the ARM instructions as if each one were a pair of completely nonsensical Thumb instructions. Is there some way I can tell GDB to switch to ARM disassembly, then upon returning to Thumb code, use the Thumb disassembler?
Google is no help. There are apparently other forks of GDB that can do that, but I haven't figured out a way to do it with GDB.
LLDB apparently supports ARM debugging, but it does not yet work on iOS devices in Xcode 4.2. When I choose the LLDB debugger in Product -> Edit Scheme, then set a breakpoint in my code, my App hangs before hitting the breakpoint.
It's been a long time since I've done any assembly of any sort, so I am brushing up on the ARM calling conventions by implementing functions that take various parameters and return various types of results in both C and ARM assembly. The lower_case functions are C and the CamelCase functions are assembly. I call abiTest the very first thing from main(), and use assert() to ensure that it returns YES
BOOL abiTest( void )
{
void_no_args();
VoidNoArgs();
if ( 42 != int_no_args() )
return NO;
if ( 42 != IntNoArgs() )
return NO;
return YES;
}
Here is the source for IntNoArgs. .thumb_func is a directive for the linker. My research seems to indicate that you want it even for ARM functions, if one mixes the two types of code
.globl _IntNoArgs
.align 1
.code 16
.thumb_func _IntNoArgs
_IntNoArgs:
# int IntNoArgs( void );
.loc 1 __LINE__ 0
adr r0, Larm1 # Larm1 is a PC-relative address. r0's low bit will be cleared
bx r0 # Switch to ARM mode then branch to Larm1. That's the next instruction
.align 2
.code 32
Larm1:
stmfd sp!, { r7, lr }
mov r0, #42
ldmfd sp!, { r7, lr }
bx lr
Here is how GDB disassembles the _IntNoArgs. The first two lines are correct, the remainder are completely wrong
0x000172c8 <+0000> add r0, pc, #0 (adr r0, 0x172cc <VoidNoArgs+4>)
0x000172ca <+0002> bx r0
0x000172cc <+0004> lsls r0, r0
0x000172ce <+0006> stmdb sp!, {r1, r3, r5}
0x000172d2 <+0010> b.n 0x17a16
0x000172d4 <+0012> lsls r0, r0
0x000172d6 <+0014> ldmia.w sp!, {r1, r2, r3, r4, r8, r9, r10, r11, r12, sp, lr, pc}
The disassembly stops here because the ldmia.w instruction appears to be putting a new value into the program counter after taking it from the stack, thereby returning from the subroutine. After I step over this instruction with "si" the disassembly pane show:
0x000172d8 <+0016> vrhadd.u16 d14, d14, d31
The si instruction always does the right thing, by advancing just one instruction whether we are in Thumb or ARM mode. So GDB must know the current instruction set architecture, it's just that the disassembler is not getting that information.
There is a bit in one of the ARM's register that indicates the current mode. Some forks of GDB have the ability to use the value of that bit when determining which ISA to disassemble as, but this is apparently not the case with Xcode 4.2's GDB.
Xcode's GDB has a command "set arm disassembler" and its corresponding "show arm disassembler" that looks like it would help, but it doesn't. That apparently is meant to support other kinds of ARM variants than what the iOS devices use.
"set fallback-mode" can be set to arm, thumb or auto in other forks of GDB, but not Xcodes. Ditto for "set disassembler-flavor".
What I would REALLY REALLY REALLY like is a machine debugger that worked just like MacsBug did on the Classic Mac OS. While GDB is generally capable of doing assembler debugging, it totally sucks for that purpose. That's not anyone's fault, really, because it is designed for source debugging. A good assembly debugger is designed to do it that way from the ground up.
The ABI function call guide states that switching between ARM and Thumb mode can be done only at function boundaries in iOS. Make sure your functions are either ARM or Thumb-only, and the debugger will work fine.

How slow is NaN arithmetic in the Intel x64 FPU?

Hints and allegations abound that arithmetic with NaNs can be 'slow' in hardware FPUs. Specifically in the modern x64 FPU, e.g on a Nehalem i7, is that still true? Do FPU multiplies get churned out at the same speed regardless of the values of the operands?
I have some interpolation code that can wander off the edge of our defined data, and I'm trying to determine whether it's faster to check for NaNs (or some other sentinel value) here there and everywhere, or just at convenient points.
Yes, I will benchmark my particular case (it could be dominated by something else entirely, like memory bandwidth), but I was surprised not to see a concise summary somewhere to help with my intuition.
I'll be doing this from the CLR, if it makes a difference as to the flavor of NaNs generated.
For what it's worth, using the SSE instruction mulsd with NaN is pretty much exactly as fast as with the constant 4.0 (chosen by a fair dice roll, guaranteed to be random).
This code:
for (unsigned i = 0; i < 2000000000; i++)
{
double j = doubleValue * i;
}
generates this machine code (inside the loop) with clang (I assume the .NET virtual machine uses SSE instructions when it can too):
movsd -16(%rbp), %xmm0 ; gets the constant (NaN or 4.0) into xmm0
movl -20(%rbp), %eax ; puts i into a register
cvtsi2sdq %rax, %xmm1 ; converts i to a double and puts it in xmm1
mulsd %xmm0, %xmm1 ; multiplies xmm0 (the constant) with xmm1 (i)
movsd %xmm1, -32(%rbp) ; puts the result somewhere on the stack
And with two billion iterations, the NaN (as defined by the C macro NAN from <math.h>) version took about 0.017 less seconds to execute on my i7. The difference was probably caused by the task scheduler.
So to be fair, they're exactly as fast.

Resources