Can I move 256-bit from memory location immediately to YMM registers?

Can I move 256-bit from memory location immediately to YMM registers? - avx

Can I move 256-bit from memory location immediately to YMM registers?
If I want to fill an xmm register, I use in inline asm in gcc:
"movlpd mytest_1(%rip),%xmm1 \n\t"
"movhpd mytest_1+8(%rip),%xmm1 \n\t"
Can this be made easier I guess?
Furthermore:
The same procedure move aligned or not 4 quadwords in 1 step to Ymm0?
I look for the reverse of Vmovdqa ymm1, mem256 source -> destination.

"movlpd mytest_1(%rip),%xmm1 \n\t"
"movhpd mytest_1+8(%rip),%xmm1 \n\t"
These two instructions can be combined to one movdqu/movdqa, because x86 is a Little Endian architecture
"movdqu mytest_1(%rip),%xmm1 \n\t" // 16-byte unaligned or
"movdqa mytest_1(%rip),%xmm1 \n\t" // for 16-byte aligned 'mytest_1'
Both can also be used for AVX 32-bit memory transfer (vmovdqu/vmovdqa):
"vmovdqu mytest_1(%rip),%ymm1 \n\t" // 32-byte unaligned or
"vmovdqa mytest_1(%rip),%ymm1 \n\t" // for 32-byte aligned 'mytest_1'
Regarding the second part of your question:
I look for the reverse of Vmovdqa ymm1, mem256 source -> destination.
This does work in both directions, e.g. the possible instructions for vmovdqa:
VMOVDQA ymm1, ymm2/m256 RM V/V AVX Move aligned packed integer values from ymm2/mem to ymm1.
VMOVDQA ymm2/m256, ymm1 MR V/V AVX Move aligned packed integer values from ymm1 to ymm2/mem.

Related

Detecting boundaries and resetting circular buffer pointer in both directions

I am working with an 8051 microcontroller, but my question is more algorithm specific.
I have created a circular buffer in memory for random incoming data from external sources. Suppose the buffer is 32 bytes and I received 34 bytes of data. Yes I'll handle that two bytes are dropped, but if I wanted to read the last 5 bytes then I'll have to somehow wrap around to the end of the buffer again to read more than 2 bytes.
Here's an example in 8051 code of what I'm trying to achieve:
BUFFER equ 40h ;our buffer = 40-5Fh (32 bytes)
BUFFERMASK equ 5Fh ;Mask so buffer doesn't go past 32nd byte
initialization:
mov R1,#BUFFER ;R1=our buffer pointer
mov #R1,#xxh ;Add some incoming data
inc R1
anl R1,#BUFFERMASK
mov #R1,#xxh ;Add some incoming data
inc R1
anl R1,#BUFFERMASK
...
mov #R1,#xxh ;Add some incoming data
inc R1
anl R1,#BUFFERMASK
;At this point we filled a large chunk of the buffer with data.
;Lets assume the buffer wrapped around and address is 41h
;and we want to read the data in reverse
mov A,#R1 ;Get last byte at 41h
dec R1
??? R1,??? (anl won't work here :( )
mov A,#R1 ;Get byte at 40h
dec R1
??? R1,??? (anl won't work here :( )
mov A,#R1 ;Get byte at 5Fh (how do we jump with a logic statement?)
dec R1
??? R1,??? (anl won't work here :( )
I understand that I could get away with a CJNE (compare and jump if not equal), but the disadvantages with that statement are: 1.) a need for a label for each CJNE, 2.) and the carry flag being modified after execution, and 3.) an extra clock cycle wasted if the boundary is hit.
Is there any way I could pull this off with simple anl/orl (AND or OR) logic? I am willing to change the memory address of the cyclic buffer if that creates an advantage in my situation.

How can I store the value 2^128-1 in memory (16 bytes)?

According to this link What are the sizes of tword, oword and yword operands? we can store a number using this convention:
16 bytes (128 bit): oword, DO, RESO, DDQ, RESDQ
I tried the following:
section .data
number do 2538
Unfortunately the following error returns:
Integer supplied to a DT, DO or DY instruction
I don't understand why it doesn't work

If your assembler does not support 128 bit integer constants with do then you can achieve the same thing with dq by splitting the constant into two 64 bit halves, e.g.
section .data
number do 0x000102030405060708090a0b0c0d0e0f
could be implemented as
section .data
number dq 0x08090a0b0c0d0e0f,0x0001020304050607

Unless some other code needs it in memory, it's cheaper to generate on the fly a vector with all 128 bits set to 1 = 0xFF... repeating = 2^128-1:
pcmpeqw xmm0, xmm0 ; xmm0 = 0xFF... repeating
;You can store to memory if you want, e.g. to set a bitmap to all-ones.
movups [rdx], xmm0
See also What are the best instruction sequences to generate vector constants on the fly?
For the use-case you described in comments, there's no reason to mess with static data in .data or .rodata, or static storage in .bss. Just make space on the stack and pass pointers to that.
call_something_by_ref:
sub rsp, 24
pcmpeqw xmm0, xmm0 ; xmm0 = 0xFF... repeating
mov rdi, rsp
movaps [rdi], xmm0 ; one byte shorter than movaps [rsp], xmm0
lea rsi, [rdi+8]
call some_function
add rsp, 24
ret
Notice that this code has no immediate constants larger than 8 bits (for data or addresses), and it only touches memory that's already hot in cache (the bottom of the stack). And yes, store-forwarding does work from wide vector stores to integer loads when some_function dereferences RDI and RSI separately.

asm usage of memory location operands

I am in trouble with the definition 'memory location'. According to the 'Intel 64 and IA-32 Software Developer's Manual' many instruction can use a memory location as operand.
For example MOVBE (move data after swapping bytes):
Instruction: MOVBE m32, r32
The question is now how a memory location is defined;
I tried to use variables defined in the .bss section:
section .bss
memory: resb 4 ;reserve 4 byte
memorylen: equ $-memory
section .text
global _start
_start:
MOV R9D, 0x6162630A
MOV [memory], R9D
SHR [memory], 1
MOVBE [memory], R9D
EDIT:->
MOV EAX, 0x01
MOV EBX, 0x00
int 0x80
<-EDIT
If SHR is commented out yasm (yasm -f elf64 .asm) compiles without problems but when executing stdio shows: Illegal Instruction
And if MOVBE is commented out the following error occurs when compiling: error: invalid size for operand 1
How do I have to allocate memory for using the 'm' option shown by the instruction set reference?
[CPU=x64, Compiler=yasm]

If that is all your code, you are falling off at the end into uninitialized region, so you will get a fault. That has nothing to do with allocating memory, which you did right. You need to add code to terminate your program using an exit system call, or at least put an endless loop so you avoid the fault (kill your program using ctrl+c or equivalent).
Update: While the above is true, the illegal instruction here is more likely caused by the fact that your cpu simply does not support the MOVBE instruction, because not all do. If you look in the reference, you can see it says #UD If CPUID.01H:ECX.MOVBE[bit 22] = 0. That is trying to tell you that a particular flag bit in the ECX register returned by the 01 leaf of the CPUID instruction shows support of this instruction. If you are on linux, you can conveniently check in /proc/cpuinfo whether you have the movbe flag or not.
As for the invalid operand size: you should generally specify the operand size when it can not be deduced from the instruction. That said, SHR accepts all sizes (byte, word, dword, qword) so you should really not get that error at all, but you might get an operation of unexpected default size. You should use SHR dword [memory], 1 in this case, and that also makes yasm happy.
Oh, and +1 for reading the intel manual ;)

x86 Assembly: Writing a Program to Test Memory Functionality for Entire 1MB of Memory

Goal:
I need to write a program that tests the write functionality of an entire 1MB of memory on a byte by byte basis for a system using an Intel 80186 microprocessor. In other words, I need to write a 0 to every byte in memory and then check if a 0 was actually written. I need to then repeat the process using a value of 1. Finally, any memory locations that did not successfully have a 0 or 1 written to them during their respective write operation needs to be stored on the stack.
Discussion:
I am an Electrical Engineering student in college (Not Computer Science) and am relatively new to x86 assembly language and MASM611. I am not looking for a complete solution. However, I am going to need some guidance.
Earlier in the semester, I wrote a program that filled a portion of memory with 0's. I believe that this will be a good starting point for my current project.
Source Code For Early Program:
;****************************************************************************
;Program Name: Zeros
;File Name: PROJ01.ASM
;DATE: 09/16/14
;FUNCTION: FILL A MEMORY SEGMENT WITH ZEROS
;HISTORY:
;AUTHOR(S):
;****************************************************************************
NAME ZEROS
MYDATA SEGMENT
MYDATA ENDS
MYSTACK SEGMENT STACK
DB 0FFH DUP(?)
End_Of_Stack LABEL BYTE
MYSTACK ENDS
ASSUME SS:MYSTACK, DS:MYDATA, CS:MYCODE
MYCODE SEGMENT
START: MOV AX, MYSTACK
MOV SS, AX
MOV SP, OFFSET End_Of_Stack
MOV AX, MYDATA
MOV DS, AX
MOV AX, 0FFFFh ;Moves a Hex value of 65535 into AX
MOV BX, 0000h ;Moves a Hex value of 0 into BX
CALL Zero_fill ;Calls procedure Zero_fill
MOV AX, 4C00H ;Performs a clean exit
INT 21H
Zero_fill PROC NEAR ;Declares procedure Zero_fill with near directive
MOV DX, 0000h ;Moves 0H into DX
MOV CX, 0000h ;Moves 0H into CX. This will act as a counter.
Start_Repeat: INC CX ;Increments CX by 1
MOV [BX], DX ;Moves the contents of DX to the memory address of BX
INC BX ;Increments BX by 1
CMP CX, 10000h ;Compares the value of CX with 10000H. If equal, Z-flag set to one.
JNE Start_Repeat ;Jumps to Start_Repeat if CX does not equal 10000H.
RET ;Removes 16-bit value from stack and puts it in IP
Zero_fill ENDP ;Ends procedure Zero_fill
MYCODE ENDS
END START
Requirements:
1. Employ explicit segment structure.
2. Use the ES:DI register pair to address the test memory area.
3. Non destructive access: Before testing each memory location, I need to store the original contents of the byte. Which needs to be restored after testing is complete.
4. I need to store the addresses of any memory locations that fail the test on the stack.
5. I need to determine the highest RAM location.
Plan:
1. In a loop: Write 0000H to memory location, Check value at that mem location, PUSH values of ES and DI to the stack if check fails.
2. In a loop: Write FFFFH to memory location, Check value at that mem location, PUSH values of ES and DI to the stack if check fails.
Source Code Implementing Preliminary Plan:
;****************************************************************************
;Program Name: Memory Test
;File Name: M_TEST.ASM
;DATE: 10/7/14
;FUNCTION: Test operational status of each byte of memory between a starting
; location and an ending location
;HISTORY: Template code from Assembly Project 1
;AUTHOR(S):
;****************************************************************************
NAME M_TEST
MYDATA SEGMENT
MYDATA ENDS
MYSTACK SEGMENT STACK
DB 0FFH DUP(?)
End_Of_Stack LABEL BYTE
MYSTACK ENDS
ESTACK SEGMENT COMMON
ESTACK ENDS
ASSUME SS:MYSTACK, DS:MYDATA, CS:MYCODE, ES:ESTACK
MYCODE SEGMENT
START: MOV AX, MYSTACK
MOV SS, AX
MOV SP, OFFSET End_Of_Stack
MOV AX, MYDATA
MOV DS, AX
MOV AX, FFFFH ;Moves a Hex value of 65535 into AX
MOV BX, 0000H ;Moves a Hex value of 0 into BX
CALL M_TEST ;Calls procedure M_TEST
MOV AX, 4C00H ;Performs a clean exit
INT 21H
M_TEST PROC NEAR ;Declares procedure M_TEST with near directive
MOV DX, 0000H ;Fill DX with 0's
MOV AX, FFFFH ;Fill AX with 1's
MOV CX, 0000H ;Moves 0H into CX. This will act as a counter.
Start_Repeat: MOV [BX], DX ;Moves the contents of DX to the memory address of BX
CMP [BX], 0000H ;Compare value at memory location [BX] with 0H. If equal, Z-flag set to one.
JNE SAVE ;IF Z-Flag NOT EQUAL TO 0, Jump TO SAVE
MOV [BX], AX ;Moves the contents of AX to the memory address of BX
CMP [BX], FFFFH ;Compare value at memory location [BX] with FFFFH. If equal, Z-flag set to one.
JNE SAVE ;IF Z-Flag NOT EQUAL TO 0, Jump TO SAVE
INC CX ;Increments CX by 1
INC BX ;Increments BX by 1
CMP CX, 10000H ;Compares the value of CX with 10000H. If equal, Z-flag set to one.
JNE Start_Repeat ;Jumps to Start_Repeat if CX does not equal 10000H.
SAVE: PUSH ES
PUSH DI
RET ;Removes 16-bit value from stack and puts it in IP
M_TEST ENDP ;Ends procedure Zero_fill
MYCODE ENDS
END START
My commenting might not be accurate.
Questions:
1. How do I use ES:DI to address the test memory area?
2. What is the best way to hold on to the initial memory value so that I can replace it when I'm done testing a specific memory location? I believe registers AX - DX are already in use.
Also, if I have updated code and questions, should I post it on this same thread, or should I create a new post with a link to this one?
Any other advice would be greatly appreciated.
Thanks in advance.

How do I use ES:DI to address the test memory area?
E.g. mov al, es:[di]
What is the best way to hold on to the initial memory value so that I can replace it when I'm done testing a specific memory location? I believe registers AX - DX are already in use.
Right. You could use al to store the original value and have 0 and 1 pre-loaded in bl and cl and then do something like this (off the top of my head):
mov al, es:[di] // load/save original value
mov es:[di], bl // store zero
cmp bl, es:[di] // check that it sticks
jne #pushbad // jump if it didn't
mov es:[di], cl // same for 'one'
cmp cl, es:[di]
jne #pushbad
mov es:[di], al // restore original value
jmp #nextAddr
#pushbad:
mov es:[di], al // restore original value (may be redundant as the mem is bad)
push es
push di
#nextAddr:
...

Some words about for to test also the memory location that our own routine is claimed. We can copy and run our routine into the framebuffer of the display device.
..
Note: If we want to store or compare a memory location with an immediate value, then we have to specify how many bytes we want to access. (But in opposite of it with using a register as a source, or a target, the assembler already knows the size of it, so we do not need to specify.)
Accessing one byte of one address (with an immediate value):
CMP BYTE[BX], 0 ; with NASM (Netwide Assembler)
MOV BYTE[BX], 0
CMP BYTE PTR[BX], 0 ; with MASM (Microsoft Macro Assembler)
MOV BYTE PTR[BX], 0
Accessing two bytes of two adresses together
(executing faster, if the target address is even aligned):
CMP WORD[BX], 0 ; with NASM
MOV WORD[BX], 0
CMP WORD PTR[BX], 0 ; with MASM
MOV WORD PTR[BX], 0

If you start with the assumption that any location in RAM might be faulty; then this means you can't use RAM to store your code or your data. This includes temporary usage - for example, you can't temporarily store your code in RAM and then copy it to display memory, because you risk copying corrupted code from RAM to display memory.
With this in mind; the only case where this makes sense is code in ROM testing the RAM - e.g. during the firmware's POST (Power On Self Test). Furthermore; this means that you can't use the stack at all - not for keeping track of faulty areas, or even for calling functions/routines.
Note that you might assume that you can test a small area (e.g. find the first 1 KiB that isn't faulty) and then use that RAM for storing results, etc. This would be a false assumption.
For RAM faults there are many causes. The first set of causes is "open connection" and "shorted connection" on either the address bus or the data bus. For a simple example, if address line 12 happens to be open circuit, the end result will be that the first 4 KiB always has identical contents to the second 4 KiB of RAM. You can test the first 4 KiB of RAM as much as you like and decide it's "good", but then when you test the second 4 KiB of RAM you trash the contents of the first 4 KiB of RAM.
There is a "smart sequence" of tests. Specifically, test the address lines from highest to lowest (e.g. write different values to 0x000000 and 0x800000 and check that they're both correct; then do the same for 0x000000 and 0x400000, then 0x000000 and 0x200000, and so on until you get to addresses 0x000000 and 0x000001). However, the way RAM chips are connected to the CPU is not necessarily as simple as a direct mapping. For example, maybe the highest bit of the address selects which bank of RAM; and in that case you'd have to test both 0x000000 and 0x400000 and also 0x800000 and 0xC00000 to test both banks.
Once you're sure the address lines work; then you can do similar for data lines and the RAM itself. The most common test is called a "walking ones" test; where you store 0x01, then 0x02, and so on (up to 0x80). This detects things like "sticky bits" (e.g. where a bit's state happens to be "stuck" to its neighbour's state). If you only write (e.g.) 0x00 and test it then write 0xFF and test it, then you will miss most RAM faults.
Also; be very careful with "open connection". On some machines bus capacitance can play tricks on you, where you write a value and the bus capacitance "stores" the previous value, so that when you read it back it looks like it's correct even when there's no connection. To avoid this risk, you need to write a different value in between - e.g. write 0x55 to the address you're testing, then write 0xAA somewhere else, then read the original value back (and hope you get 0x55 because the RAM works, and not 0xAA). With this in mind (for performance) you may consider doing "walking ones" in one area of RAM while also doing "walking zeros" in the next area of RAM; so that you're always alternating between reading a value from one area and reading the inverted value from the other.
Finally, some RAM problems depend on noise, temperature, etc. In these cases you can do extremely thorough RAM tests, say that it's all perfect, then suffer from RAM corruption 2 minutes afterwards. This is why (e.g.) the typical advice is to run something like "memtest" for 8 hours or so if you really want to test RAM properly.

How slow is NaN arithmetic in the Intel x64 FPU?

Hints and allegations abound that arithmetic with NaNs can be 'slow' in hardware FPUs. Specifically in the modern x64 FPU, e.g on a Nehalem i7, is that still true? Do FPU multiplies get churned out at the same speed regardless of the values of the operands?
I have some interpolation code that can wander off the edge of our defined data, and I'm trying to determine whether it's faster to check for NaNs (or some other sentinel value) here there and everywhere, or just at convenient points.
Yes, I will benchmark my particular case (it could be dominated by something else entirely, like memory bandwidth), but I was surprised not to see a concise summary somewhere to help with my intuition.
I'll be doing this from the CLR, if it makes a difference as to the flavor of NaNs generated.

For what it's worth, using the SSE instruction mulsd with NaN is pretty much exactly as fast as with the constant 4.0 (chosen by a fair dice roll, guaranteed to be random).
This code:
for (unsigned i = 0; i < 2000000000; i++)
{
double j = doubleValue * i;
}
generates this machine code (inside the loop) with clang (I assume the .NET virtual machine uses SSE instructions when it can too):
movsd -16(%rbp), %xmm0 ; gets the constant (NaN or 4.0) into xmm0
movl -20(%rbp), %eax ; puts i into a register
cvtsi2sdq %rax, %xmm1 ; converts i to a double and puts it in xmm1
mulsd %xmm0, %xmm1 ; multiplies xmm0 (the constant) with xmm1 (i)
movsd %xmm1, -32(%rbp) ; puts the result somewhere on the stack
And with two billion iterations, the NaN (as defined by the C macro NAN from <math.h>) version took about 0.017 less seconds to execute on my i7. The difference was probably caused by the task scheduler.
So to be fair, they're exactly as fast.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart