I want to obtain the ASM (Intel assembly) code generated by Julia - delphi

I wrote a simple for loop in Delphi and translated it to Julia. The execution time of the Delphi program, compared with the Julia one, is just pathetic: Julia is 7 times faster - see the program and the results.
I am trying to figure out how this is possible, because Delphi was supposed to be one of the fastest languages on the planet!
I want to compare the ASM code generated by Julia with the ASM code generated by Delphi. In Delphi, I need only one click to get that code. Where I can see the ASM code for a specific function in Julia?
using BenchmarkTools
println("----------- Test for loops")
# test for loops
function for_fun(a)
total = 0
big = 0
small = 0
for i in 1:a
total += 1
if i > 500000
big += 1
else
small += 1
end
end
return (total, small, big)
end
res_for = for_fun(1000000000)
println(res_for)
#btime for_fun(1000000000)

You use the #code_native macro applied to the function call. Here's an example
julia> #code_native 1+1
.text
.file "+"
.globl "julia_+_13305" # -- Begin function julia_+_13305
.p2align 4, 0x90
.type "julia_+_13305",#function
"julia_+_13305": # #"julia_+_13305"
; ┌ # int.jl:87 within `+`
.cfi_startproc
# %bb.0: # %top
leaq (%rdi,%rsi), %rax
retq
.Lfunc_end0:
.size "julia_+_13305", .Lfunc_end0-"julia_+_13305"
.cfi_endproc
; └
# -- End function
.section ".note.GNU-stack","",#progbits

Related

How can I store the value 2^128-1 in memory (16 bytes)?

According to this link What are the sizes of tword, oword and yword operands? we can store a number using this convention:
16 bytes (128 bit): oword, DO, RESO, DDQ, RESDQ
I tried the following:
section .data
number do 2538
Unfortunately the following error returns:
Integer supplied to a DT, DO or DY instruction
I don't understand why it doesn't work
If your assembler does not support 128 bit integer constants with do then you can achieve the same thing with dq by splitting the constant into two 64 bit halves, e.g.
section .data
number do 0x000102030405060708090a0b0c0d0e0f
could be implemented as
section .data
number dq 0x08090a0b0c0d0e0f,0x0001020304050607
Unless some other code needs it in memory, it's cheaper to generate on the fly a vector with all 128 bits set to 1 = 0xFF... repeating = 2^128-1:
pcmpeqw xmm0, xmm0 ; xmm0 = 0xFF... repeating
;You can store to memory if you want, e.g. to set a bitmap to all-ones.
movups [rdx], xmm0
See also What are the best instruction sequences to generate vector constants on the fly?
For the use-case you described in comments, there's no reason to mess with static data in .data or .rodata, or static storage in .bss. Just make space on the stack and pass pointers to that.
call_something_by_ref:
sub rsp, 24
pcmpeqw xmm0, xmm0 ; xmm0 = 0xFF... repeating
mov rdi, rsp
movaps [rdi], xmm0 ; one byte shorter than movaps [rsp], xmm0
lea rsi, [rdi+8]
call some_function
add rsp, 24
ret
Notice that this code has no immediate constants larger than 8 bits (for data or addresses), and it only touches memory that's already hot in cache (the bottom of the stack). And yes, store-forwarding does work from wide vector stores to integer loads when some_function dereferences RDI and RSI separately.

Forth and processor flags

Why doesn't Forth use processor flags for conditional execution?
Instead the result of a comparison is placed on the parameter stack. Is it because the inner interpreter loop may alter flags when going to the next instruction? Or is it simply to abstract conditional logic?
E.g. on x86 the flags register holds results of a comparison as most processors if not all will have a flags register.
As Forth is a stack-based language, in order to define the operations inside the language, you must define the result to alter something that is inside the language. The flags register isn't in the language. Obviously in case of an optimizing compiler, whatever approach that gives the same final result is equally acceptable.
It depends on the Forth, and the level of optimization.
: tt 0 if ." true" else ." false" then ;
In SwiftForth (x86_64 GNU/Linux):
see tt
808376F 4 # EBP SUB 83ED04
8083772 EBX 0 [EBP] MOV 895D00
8083775 0 # EBX MOV BB00000000
808377A EBX EBX OR 09DB
808377C 0 [EBP] EBX MOV 8B5D00
808377F 4 [EBP] EBP LEA 8D6D04
8083782 808379D JZ 0F8415000000
8083788 804D06F ( (S") ) CALL E8E298FCFF
808378D "true"
8083793 804C5BF ( TYPE ) CALL E8278EFCFF
8083798 80837AE JMP E911000000
808379D 804D06F ( (S") ) CALL E8CD98FCFF
80837A2 "false"
80837A9 804C5BF ( TYPE ) CALL E8118EFCFF
80837AE RET C3 ok
In Gforth:
see tt
: tt
0
IF .\" true"
ELSE .\" false"
THEN ; ok

asm usage of memory location operands

I am in trouble with the definition 'memory location'. According to the 'Intel 64 and IA-32 Software Developer's Manual' many instruction can use a memory location as operand.
For example MOVBE (move data after swapping bytes):
Instruction: MOVBE m32, r32
The question is now how a memory location is defined;
I tried to use variables defined in the .bss section:
section .bss
memory: resb 4 ;reserve 4 byte
memorylen: equ $-memory
section .text
global _start
_start:
MOV R9D, 0x6162630A
MOV [memory], R9D
SHR [memory], 1
MOVBE [memory], R9D
EDIT:->
MOV EAX, 0x01
MOV EBX, 0x00
int 0x80
<-EDIT
If SHR is commented out yasm (yasm -f elf64 .asm) compiles without problems but when executing stdio shows: Illegal Instruction
And if MOVBE is commented out the following error occurs when compiling: error: invalid size for operand 1
How do I have to allocate memory for using the 'm' option shown by the instruction set reference?
[CPU=x64, Compiler=yasm]
If that is all your code, you are falling off at the end into uninitialized region, so you will get a fault. That has nothing to do with allocating memory, which you did right. You need to add code to terminate your program using an exit system call, or at least put an endless loop so you avoid the fault (kill your program using ctrl+c or equivalent).
Update: While the above is true, the illegal instruction here is more likely caused by the fact that your cpu simply does not support the MOVBE instruction, because not all do. If you look in the reference, you can see it says #UD If CPUID.01H:ECX.MOVBE[bit 22] = 0. That is trying to tell you that a particular flag bit in the ECX register returned by the 01 leaf of the CPUID instruction shows support of this instruction. If you are on linux, you can conveniently check in /proc/cpuinfo whether you have the movbe flag or not.
As for the invalid operand size: you should generally specify the operand size when it can not be deduced from the instruction. That said, SHR accepts all sizes (byte, word, dword, qword) so you should really not get that error at all, but you might get an operation of unexpected default size. You should use SHR dword [memory], 1 in this case, and that also makes yasm happy.
Oh, and +1 for reading the intel manual ;)

Re-configurable Memory Instance in verilog with DATA-IN and DATA-OUT are passed as parameter

How can I make a memory module in which DATA bus width are passed as parameter to each instances and my design re-configure itself according to the parameter? For example, assuming I have byte addressable memory and DATA-IN bus width is 32 bit (4 bytes written in each cycle) and DATA-OUT is 16 bits (2 bytes read each cycle). For other instance DATA-IN is 64 bits and DATA-OUT is 16 bits. For all such instances my design should work.
What I have tried is to generate write pointer values according to design parameters, e.g. DATA-IN 32 bit, write pointer will increment 4 every cycle while writing. For 64 bit -increment will be by 8 and so on.
Problem is: how to make 4 or 8 or 16 bytes to be written in single cycle according to parameters passed to instance?
//Something as following I want to implement. This memory instance can be considered as internal memory of FIFO having different datawidth for reading and writing in case you think of an application of such memory
module mem#(parameter DIN=16, parameter DOUT=8, parameter ADDR=4,parameter BYTE=8)
(
input [DIN-1:0] din,
output [DOUT-1:0] dout,
input wen,ren,clk
);
localparam DEPTH = (1<<ADDR);
reg [BYTE-1:0] mem [0:DEPTH-1];
reg wpointer=5'b00000;
reg rpointer=5'b00000;
reg [BYTE-1:0] tmp [0:DIN/BYTE-1];
function [ADDR:0] ptr;
input [4:0] index;
integer i;
begin
for(i=0;i<DIN/BYTE;i=i+1) begin
mem[index] = din[(BYTE*(i+1)-1):BYTE*(i)]; // something like this I want to implement, I know this line is not allowed in verilog, but is there any alternative to this?
index=index+1;
end
ptr=index;
end
endfunction
always #(posedge clk) begin
if(wen==1)
wpointer <= wptr(wpointer);
end
always #(posedge clk) begin
if(ren==1)
rpointer <= ptr(rpointer);
end
endmodule
din[(BYTE*(i+1)-1):BYTE*(i)] will not compile in Verilog because the MSB and LSB select bits are both variables. Verilog requires a known range. +: is for part-select (also known as a slice) allows a variable select index and a constant range value. It was introduced in IEEE Std 1364-2001 § 4.2.1. You can also read more about it in IEEE Std 1800-2012 § 11.5.1, or refer to previously asked questions: What is `+:` and `-:`? and Indexing vectors and arrays with +:.
din[BYTE*i +: BYTE] should work for you, alternatively you can use din[BYTE*(i+1)-1 -: BYTE].
Also, you should use non-blocking assignments (<=) to mem. In your code read and write can happen at the same time. With blocking there is a race condition between when accessing the same byte. It may synthesize, but your RTL and gate simulation may generated different results. I also strongly advice agent using functions for assigning memory. Functions in synthesizable code without nasty surprises need to self contained without references on anything outside of the function and any internal variables are always reset to a static constant at the start of the function.
With the guidelines mentioned above, I'd recommend recoding to something like the below. This is a template to start with, not a free lunch. I left out the out-of-range index compensation for you to figure out on your own.
...
localparam DEPTH = (1<<ADDR);
reg [BYTE-1:0] mem [0:DEPTH-1];
reg [ADDR-1:0] wpointer, rpointer;
integer i;
initial begin // init values for pointers (FPGA, not ASIC)
wpointer = {ADDR{1'b0}};
rpointer = {ADDR{1'b0}};
end
always #(posedge clk) begin
if (ren==1) begin
for(i=0; i < DOUT/BYTE; i=i+1) begin
dout[BYTE*i +: BYTE] <= mem[rpointer+i];
end
rpointer <= rpointer + (DOUT/BYTE);
end
if (wen==1) begin
for(i=0; i < DIN/BYTE; i=i+1) begin
mem[wpointer+i] <= din[BYTE*i +: BYTE];
end
wpointer <= wpointer + (DIN/BYTE);
end
end

How slow is NaN arithmetic in the Intel x64 FPU?

Hints and allegations abound that arithmetic with NaNs can be 'slow' in hardware FPUs. Specifically in the modern x64 FPU, e.g on a Nehalem i7, is that still true? Do FPU multiplies get churned out at the same speed regardless of the values of the operands?
I have some interpolation code that can wander off the edge of our defined data, and I'm trying to determine whether it's faster to check for NaNs (or some other sentinel value) here there and everywhere, or just at convenient points.
Yes, I will benchmark my particular case (it could be dominated by something else entirely, like memory bandwidth), but I was surprised not to see a concise summary somewhere to help with my intuition.
I'll be doing this from the CLR, if it makes a difference as to the flavor of NaNs generated.
For what it's worth, using the SSE instruction mulsd with NaN is pretty much exactly as fast as with the constant 4.0 (chosen by a fair dice roll, guaranteed to be random).
This code:
for (unsigned i = 0; i < 2000000000; i++)
{
double j = doubleValue * i;
}
generates this machine code (inside the loop) with clang (I assume the .NET virtual machine uses SSE instructions when it can too):
movsd -16(%rbp), %xmm0 ; gets the constant (NaN or 4.0) into xmm0
movl -20(%rbp), %eax ; puts i into a register
cvtsi2sdq %rax, %xmm1 ; converts i to a double and puts it in xmm1
mulsd %xmm0, %xmm1 ; multiplies xmm0 (the constant) with xmm1 (i)
movsd %xmm1, -32(%rbp) ; puts the result somewhere on the stack
And with two billion iterations, the NaN (as defined by the C macro NAN from <math.h>) version took about 0.017 less seconds to execute on my i7. The difference was probably caused by the task scheduler.
So to be fair, they're exactly as fast.

Resources