I have seen code
procedure FillDWord(var Dest; Count, What: dword); assembler ;
asm
PUSH EDI
MOV EDI, Dest
MOV EAX, What
MOV ECX, Count
CLD
REP STOSD
POP EDI
end;
I googled CLD and it says it clears the direction flag... so is it important here? after I removed it, the function seems working fine.
The direction flag controls if - during the execution of REP STOSD the EDI register will be incremented or decremented.
In case of a cleared direction flag (e.g. after execution of CLD) the pointer will be incremented, so the function does a memory fill.
The CLD is in this code because the programmer probably was not able to guarantee that the direction flag was cleared. Therefore he made sure that it is cleared before executing REP STOSD.
If the code works when CLD is removed, then the direction flag was clear at the entry of the function. Since the direction flag is not part of the calling conventions that was just by luck. It could be the other way next time, and in this case your program will very likely crash.
Clearing/setting the flag is a very fast operation, so it's good practice to add them to the assembler code. This also makes it easy for other programmers to understand your function because the state of the direction flag is explicitly defined.
The stosd command can either work down the memory, incrementing EDI, or up the memory, decrementing it. This depends on the value of the direction ("D") flag. If the flag is set to 1 upon function entrance and never explicitly cleared, it'll misbehave wildly. There's no convention on the default value of that flag; so the function plays it safe.
EDIT: Egor says Delphi has a convention :) Still, better safe than sorry.
Related
I'm studying computer architecture in my university and I guess I don't know the basic of computer system and C language concepts, few things really confuse me and I was kept searching bout it but couldn't find answer what I want and make me more confuse so upload question here.
1. I thought register is holding an instruction, storage address or any kind of data in CPU. And I also learned memory layout.
------------------
stack
dynamic data
static data
text
reserved part
------------------
Then register is having this memory layout in CPU? Or am I just confusing it with computer's 5 components(input, output, memory, control, datapath)'s memory's layout. I thought this is one of this 5 component's layout.
RISC-V (while loop in C)
Loop:
slli x10, x22, 3
add x10, x10, x25
ld x9, 0(x10)
bne x9, x24, Exit
addi x22, x22, 1
beq x0, x0, Loop
Exit:...
Then where does this operation happens? Register?
I learned RISC-V Registers like below.
x0: the constant value 0
x1: return address
...
x5-x7, x28-x31: temporaries
...
If register is in that memory layout what I draw above, then that x0, x1 stuffs are contained in where? It doesn't make sense from here. So I'm confusing how do I have to think register looks like.
Everything is so abstract in my mind so I guess question sounds bit weird. If anything is not cleared, comment me please.
Then register is having this memory layout in CPU?
No, that makes zero sense, your thinking is on the wrong track here.
The register file is its own separate space, not part of memory address space. It's not indexable with a variable, only by hard-coding register numbers into instructions, so there's not really any sense in which x2 is the "next register after x1" or anything. e.g. you can't loop over registers. They're just two separate 32 or 64-bit data-storage spaces that software can use however they want.
The natural categories to break them up are based on software / calling conventions:
stack pointer
call-preserved registers (function calls don't modify them, or conversely if you want to use one in a function you have to save/restore it)
call-clobbered registers (function calls must be assumed to step on them, and conversely can be used without saving/restoring)
the zero register.
Also arg-passing vs. return-value registers.
I wrote this function to round singles to integers:
function Round(const Val: Single): Integer;
begin
asm
cvtss2si eax,Val
mov Result,eax
end;
end;
It works, but I need to change the rounding mode. Apparently, per this, I need to set the MXCSR register.
How do I do this in Delphi?
The reason I am doing this in the first place is I need "away-from-zero" rounding (like in C#), which is not possible even via SetRoundingMode.
On modern Delphi, to set MXCSR you can call SetMXCSR from the System unit. To read the current value use GetMXCSR.
Do beware that SetMXCSR, just like Set8087CW is not thread-safe. Despite my efforts to persuade Embarcadero to change this, it seems that this particular design flaw will remain with us forever.
On older versions of Delphi you use the LDMXCSR and STMXCSR opcodes. You might write your own versions like this:
function GetMXCSR: LongWord;
asm
PUSH EAX
STMXCSR [ESP].DWord
POP EAX
end;
procedure SetMXCSR(NewMXCSR: LongWord);
//thread-safe version that does not abuse the global variable DefaultMXCSR
var
MXCSR: LongWord;
asm
AND EAX, $FFC0 // Remove flag bits
MOV MXCSR, EAX
LDMXCSR MXCSR
end;
These versions are thread-safe and I hope will compile and work on older Delphi versions.
Do note that using the name Round for your function is likely to cause a lot of confusion. I would advise that you do not do that.
Finally, I checked in the Intel documentation and both of the Intel floating point units (x87, SSE) offer just the rounding modes specified by the IEEE754 standard. They are:
Round to nearest (even)
Round down (toward −∞)
Round up (toward +∞)
Round toward zero (Truncate)
So, your desired rounding mode is not available.
Here I have a few lines of delphi code running in delphi 7:
var
ptr:Pointer ;
begin
ptr:=AllocMem(40);
ptr:=Pchar('OneNationUnderGod');
if ptr<>nil then
FreeMem(ptr);
end;
Upon running this code snippet, FreeMem(ptr)will raise an error:'invalid pointer operation'. If I delete the sentence:
ptr:=Pchar('OneNationUnderGod');
then no error will occur. Now I have two questions,
1.Why is this happening? 2. If I have to use the Pchar sentence, how should I free the memory allocated earlier?
Much appreciation for your help!
The problem is that you are modifying the address held in the variable ptr.
You call AllocMem to allocate a buffer, which you refer to using ptr. That much is fine. But you must never change the value of ptr, the address of the buffer. And you do change it.
You wrote:
ptr:=AllocMem(40);
ptr:=Pchar('OneNationUnderGod');
The second line is the problem. You have modified ptr and now ptr refers to something else (a string literal held in read-only memory as it happens). You have now lost track of the buffer allocated by your call to AllocMem. You asked AllocMem for a new block of memory and then immediately discarded that block of memory.
What you presumably mean to do is to copy the string. Perhaps like this:
ptr := AllocMem(40);
StrCopy(ptr, 'OneNationUnderGod');
Now we are fine to call FreeMem, because ptr still contains the address that the call to AllocMem provided.
ptr := AllocMem(40);
try
StrCpy(ptr, 'OneNationUnderGod');
// do stuff with ptr
finally
FreeMem(ptr);
end;
Clearly in real code you would find a better and more robust way to specify buffer lengths than a hard-coded value.
In your code, assuming the above fix is applied, the test for ptr being nil is needless. AllocMem never returns nil. Failure of AllocMem results in an exception being raised.
Having said all that, it's not usual to operate on string buffers in this way. It is normal to use Delphi strings. If you need a PChar, for instance to use with interop, make one with PChar(str) where str is of type string.
You say that you must use dynamically allocated PChar buffers. Perhaps that is so, but I very much doubt it.
It crashes because you are freeing static memory that was not dynamically allocated. There is no need to free memory used by literals at all. Only free memory that is dynamically allocated.
I'd try to make what you've done more explicit. You seem to mistake the name of a variable with its value. However actually - considering values - what you did was
ptrA:=AllocMem(40);
ptrB:=Pchar('OneNationUnderGod');
if ptrB<>nil then
FreeMem(ptrB);
Every new assignment changes the value overwriting the previous one, thus you freeing another pointer that was allocated.
You may read documentation for functions like StrNew, StrDispose, StrCopy, StrLCopy and sample codes with those to see some patterns of working with PChar strings.
I included the iOS tag, but I'm running in the simulator on a Core i7 MacBook Pro (x86-64, right?), so I think that's immaterial.
I'm currently debugging a crash in Flurry's video ads. I have a breakpoint set on Objective-C exceptions. When the breakpoint is hit I am in objc_msgSend. The callstack contains a mix of private Flurry and iOS methods, nothing public and nothing that I've written. Calling register read from the objc_msgSend stack frame outputs the following:
(lldb) register read
General Purpose Registers:
eax = 0x1ac082d0
ebx = 0x009600b5 "spaceWillDismiss:interstitial:"
ecx = 0x03e2cddb "makeKeyAndVisible"
edx = 0x0000003f
edi = 0x0097c6f3 "removeWindow"
esi = 0x00781e65 App`-[FlurryAdViewController removeWindow] + 12
ebp = 0xbfffd608
esp = 0xbfffd5e8
ss = 0x00000023
eflags = 0x00010202 App`-[FeedTableCell setupVisibleCommentAndLike] + 1778 at FeedTableCell.m:424
eip = 0x049bd09b libobjc.A.dylib`objc_msgSend + 15
cs = 0x0000001b
ds = 0x00000023
es = 0x00000023
fs = 0x00000000
gs = 0x0000000f
I've got a few questions about this output.
I assumed $ebx contains the selector that caused the crash and $edi is the last executing method. Is that the case?
$eip is where I crashed. Is that usually the case?
$eflags references an instance method that, as far as I know, has nothing to do with this crash. What is that?
Is there any other information I can pry out of these registers?
I can't speak to iOS/Objective-C frame layouts specifically, so I can't answer your question about EBX and EDI. But I can help you regarding EIP and EFLAGS and give you some general hints about ESP/EBP and the selector registers. (By the way, the simulator is simulating a 32-bit x86 environment; you can tell because your registers are 32 bits long.)
EIP is the instruction pointer register, also known as the program counter, which contains the address of the currently executing machine instruction. Thus it will point to where your program crashed, or more generally, where your program is when it hits a breakpoint, dumps core etc.
EIP is saved and restored to implement function calls (at the machine code level -- inlining may result in high-level language calls not performing actual calls). In memory-unsafe languages, a stack buffer overflow can overwrite the saved value of the instruction pointer, causing the return instruction to return to the wrong place. If you're lucky, the overwritten value will trigger a segfault on the next memory fetch, but the value of EIP will be arbitrary and unhelpful in debugging the problem. If you're unlucky, an attacker crafted the new EIP to point to useful code, so many environments use "stack cookies" or "canaries" to detect these overwrites before restoring the saved/overwritten EIP, in which case the EIP value may be useful.
EFLAGS isn't a memory address, and arguably isn't a general purpose register. Each bit of EFLAGS is a flag that can be set or tested by various instructions. The most important flags are the carry, zero and sign flags, which are set by arithmetic instructions and used for conditional branching. Your debugger is misinterpreting it as a memory address and displaying it as the closest function, but that isn't actually related to your crash. (The + 1778 is the giveaway: this means EFLAGS points 1778 bytes into the function, but the function is unlikely to actually be 1778 bytes long.)
ESP is the stack pointer and EBP is (usually) the frame pointer (also called the base pointer). These registers bound the current frame on the call stack. Your debugger usually can show you the values of stack variables and the current call stack based on these pointers. In case of corruption, sometimes you can manually inspect the stack to recover EBP and manually unwind the call stack. Note that code can be compiled without frame pointers (frame pointer omission), freeing EBP for other uses; this is common on x86 because there are so few general-purpose registers.
SS, CS, DS, ES, FS and GS hold segment selectors, used in the bad old days before paging to implement segmentation. Today FS and GS are commonly used by operating systems for process and thread state blocks; they were the only selector registers carried forward into x86-64. The selector registers are generally not helpful for debugging.
I am in need of the fastest hash function possible in Delphi 2009 that will create hashed values from a Unicode string that will distribute fairly randomly into buckets.
I originally started with Gabr's HashOf function from GpStringHash:
function HashOf(const key: string): cardinal;
asm
xor edx,edx { result := 0 }
and eax,eax { test if 0 }
jz #End { skip if nil }
mov ecx,[eax-4] { ecx := string length }
jecxz #End { skip if length = 0 }
#loop: { repeat }
rol edx,2 { edx := (edx shl 2) or (edx shr 30)... }
xor dl,[eax] { ... xor Ord(key[eax]) }
inc eax { inc(eax) }
loop #loop { until ecx = 0 }
#End:
mov eax,edx { result := eax }
end; { HashOf }
But I found that this did not produce good numbers from Unicode strings. I noted that Gabr's routines have not been updated to Delphi 2009.
Then I discovered HashNameMBCS in SysUtils of Delphi 2009 and translated it to this simple function (where "string" is a Delphi 2009 Unicode string):
function HashOf(const key: string): cardinal;
var
I: integer;
begin
Result := 0;
for I := 1 to length(key) do
begin
Result := (Result shl 5) or (Result shr 27);
Result := Result xor Cardinal(key[I]);
end;
end; { HashOf }
I thought this was pretty good until I looked at the CPU window and saw the assembler code it generated:
Process.pas.1649: Result := 0;
0048DEA8 33DB xor ebx,ebx
Process.pas.1650: for I := 1 to length(key) do begin
0048DEAA 8BC6 mov eax,esi
0048DEAC E89734F7FF call $00401348
0048DEB1 85C0 test eax,eax
0048DEB3 7E1C jle $0048ded1
0048DEB5 BA01000000 mov edx,$00000001
Process.pas.1651: Result := (Result shl 5) or (Result shr 27);
0048DEBA 8BCB mov ecx,ebx
0048DEBC C1E105 shl ecx,$05
0048DEBF C1EB1B shr ebx,$1b
0048DEC2 0BCB or ecx,ebx
0048DEC4 8BD9 mov ebx,ecx
Process.pas.1652: Result := Result xor Cardinal(key[I]);
0048DEC6 0FB74C56FE movzx ecx,[esi+edx*2-$02]
0048DECB 33D9 xor ebx,ecx
Process.pas.1653: end;
0048DECD 42 inc edx
Process.pas.1650: for I := 1 to length(key) do begin
0048DECE 48 dec eax
0048DECF 75E9 jnz $0048deba
Process.pas.1654: end; { HashOf }
0048DED1 8BC3 mov eax,ebx
This seems to contain quite a bit more assembler code than Gabr's code.
Speed is of the essence. Is there anything I can do to improve either the pascal code I wrote or the assembler that my code generated?
Followup.
I finally went with the HashOf function based on SysUtils.HashNameMBCS. It seems to give a good hash distribution for Unicode strings, and appears to be quite fast.
Yes, there is a lot of assembler code generated, but the Delphi code that generates it is so simple and uses only bit-shift operations, so it's hard to believe it wouldn't be fast.
ASM output is not a good indication of algorithm speed. Also, from what I can see, the two pieces of code are doing almost the identical work. The biggest difference seem to be the memory access strategy and the first is using roll-left instead of the equivalent set of instructions (shl | shr -- most higher-level programming languages leave out the "roll" operators). The latter may pipeline better than the former.
ASM optimization is black magic and sometimes more instructions execute faster than fewer.
To be sure, benchmark both and pick the winner. If you like the output of the second but the first is faster, plug the second's values into the first.
rol edx,5 { edx := (edx shl 5) or (edx shr 27)... }
Note that different machines will run the code in different ways, so if speed is REALLY of the essence then benchmark it on the hardware that you plan to run the final application on. I'm willing to bet that over megabytes of data the difference will be a matter of milliseconds -- which is far less than the operating system is taking away from you.
PS. I'm not convinced this algorithm creates even distribution, something you explicitly called out (have you run the histograms?). You may look at porting this hash function to Delphi. It may not be as fast as the above algorithm but it appears to be quite fast and also gives good distribution. Again, we're probably talking on the order of milliseconds of difference over megabytes of data.
We held a nice little contest a while back, improving on a hash called "MurmurHash"; Quoting Wikipedia :
It is noted for being exceptionally
fast, often two to four times faster
than comparable algorithms such as
FNV, Jenkins' lookup3 and Hsieh's
SuperFastHash, with excellent
distribution, avalanche behavior and
overall collision resistance.
You can download the submissions for that contest here.
One thing we learned was, that sometimes optimizations don't improve results on every CPU. My contribution was tweaked to run good on AMD, but performed not-so-good on Intel. The other way around happened too (Intel optimizations running sub-optimal on AMD).
So, as Talljoe said : measure your optimizations, as they might actually be detrimental to your performance!
As a side-note: I don't agree with Lee; Delphi is a nice compiler and all, but sometimes I see it generating code that just isn't optimal (even when compiling with all optimizations turned on). For example, I regularly see it clearing registers that had already been cleared just two or three statements before. Or EAX is put into EBX, only to have it shifted and put back into EAX. That sort of thing. I'm just guessing here, but hand-optimizing that sort of code will surely help in tight spots.
Above all though; First analyze your bottleneck, then see if a better algorithm or datastructure can be used, then try to optimize the pascal code (like: reduce memory-allocations, avoid reference counting, finalization, try/finally, try/except blocks, etc), and then, only as a final resort, optimize the assembly code.
I've written two assembly "optimized" functions in Delphi, or more implemented known fast hash algorithms in both fine-tuned Pascal and Borland Assembler. The first was a implementation of SuperFastHash, and the second was a MurmurHash2 implementation triggered by a request from Tommi Prami on my blog to translate my c# version to a Pascal implementation. This spawned a discussion continued on the Embarcadero Discussion BASM Forums, that in the end resulted in about 20 implementations (check the latest benchmark suite) which ultimately showed that it would be difficult to select the best implementation due to the big differences in cycle times per instruction between Intel and AMD.
So, try one of those, but remember, getting the fastest every time would probably mean changing the algorithm to a simpler one which would hurt your distribution. Fine-tuning an implementation takes lots of time and better create a good validation and benchmarking suite to make check your implementations.
There has been a bit of discussion in the Delphi/BASM forum that may be of interest to you. Have a look at the following:
http://forums.embarcadero.com/thread.jspa?threadID=13902&tstart=0