LC-3: BLKW ho to specify memory location to store data at? - memory

In LC-3 when you use BLKW how do you initialize the block of memory to be at location x3102 instead of the next available memory location?

First, let's make a side note that all the memory of the 16-bit address space is there for you to use, according to the memory map (e.g. at least in the range x3000-xFDFF), and, it is initialized to zero; you can use it at will.
Generally speaking, the LC-3 assemblers don't allow multiple .ORIG directives throughout the file, instead, they require one at the beginning of the file. If they did allow subsequent .ORIG directives, this would be a way to accomplish what you're asking about.
But even if they did, frequently, we'd run into instruction offset encoding limitations. So there's an alternate solution I'll show below.
But first, let's look at the instruction offset/immediate encoding limitations.
The usual data memory access instruction formats have a very limited offset, only 9 bits worth (+/- about 256), and the offset is pc-relative. So, for example, the following won't work:
.ORIG x3000
LEA R0, TESTA
LD R1, TESTA
LEA R2, TESTB ; will get an error due to instruction offset encoding limitation
LD R3, TESTB ; will get an error due to instruction offset encoding limitation
HALT
TESTA
.FILL #1234
.BLKW xFA ; exactly enough padding to relocate TESTB to x3100
TESTB
.FILL #4321 ; which we can initialize with a non-zero value
.END
This illustrative: while this will successfully place TESTB at x3100, it cannot be reached by the either the LEA or the LD instructions due to the limited 9-bit pc-relative displacement.
(There is also the other practical limitation that as instructions are added the .BLKW operand has to shrink in size, which is clearly painful — this aspect would have been eliminated by supporting a .ORIG directive within.)
So, the alternative for large blocks and other such is to resort to using zero-initialized memory, and referencing this other memory using pointer variables nearby: using LD to load an address, rather than LEA, and LDI to access a value rather than LD.
.ORIG x3000
LEA R0, TESTA
LD R1, TESTA
LD R2, TESTBRef ; will put x3100 into R3
LDI R3, TESTBRef ; will access the memory value at address x3100..
HALT
TESTA
.FILL #1234
TESTBRef ; a nearby data pointer, that points using the full 16-bits
.FILL x3100
.END
In the latter, above, there is no declaration to reserve storage at x3100, nor can we initialize that storage at x3100 with non-zero initialization (e.g. no strings, no pre-defined arrays).
A data-to-data pointer, TESTBRef is used. Unlike code-to-code/data references (i.e. instructions referencing code or data), data-to-code/data
(i.e. data referencing code or data) pointers have all 16-bits available for pointing.
So, once we use this approach, of simply using other memory, we forgo the automatic placement of labels after other labels (for those other areas), and also forgo non-zero initialization.
Some LC-3 assemblers will allow multiple files, and these each allow their own .ORIG directive — so by using multiple files, we can place code&data at varied locations in the address space. However, the code-to-data instruction offset encoding limits still apply, and so, you'll likely end up managing other such memory areas manually anyway, and using data pointers as well.
Note that the JSR instruction has an 11-bit offset so code-to-code references can reach farther than code-to-data references.

Related

register and memory, risc-v

I'm studying computer architecture in my university and I guess I don't know the basic of computer system and C language concepts, few things really confuse me and I was kept searching bout it but couldn't find answer what I want and make me more confuse so upload question here.
1. I thought register is holding an instruction, storage address or any kind of data in CPU. And I also learned memory layout.
------------------
stack
dynamic data
static data
text
reserved part
------------------
Then register is having this memory layout in CPU? Or am I just confusing it with computer's 5 components(input, output, memory, control, datapath)'s memory's layout. I thought this is one of this 5 component's layout.
RISC-V (while loop in C)
Loop:
slli x10, x22, 3
add x10, x10, x25
ld x9, 0(x10)
bne x9, x24, Exit
addi x22, x22, 1
beq x0, x0, Loop
Exit:...
Then where does this operation happens? Register?
I learned RISC-V Registers like below.
x0: the constant value 0
x1: return address
...
x5-x7, x28-x31: temporaries
...
If register is in that memory layout what I draw above, then that x0, x1 stuffs are contained in where? It doesn't make sense from here. So I'm confusing how do I have to think register looks like.
Everything is so abstract in my mind so I guess question sounds bit weird. If anything is not cleared, comment me please.
Then register is having this memory layout in CPU?
No, that makes zero sense, your thinking is on the wrong track here.
The register file is its own separate space, not part of memory address space. It's not indexable with a variable, only by hard-coding register numbers into instructions, so there's not really any sense in which x2 is the "next register after x1" or anything. e.g. you can't loop over registers. They're just two separate 32 or 64-bit data-storage spaces that software can use however they want.
The natural categories to break them up are based on software / calling conventions:
stack pointer
call-preserved registers (function calls don't modify them, or conversely if you want to use one in a function you have to save/restore it)
call-clobbered registers (function calls must be assumed to step on them, and conversely can be used without saving/restoring)
the zero register.
Also arg-passing vs. return-value registers.

aarch64 inline assembly stack pointer constraint memory address with offset for Clang 6+

I noticed at different optimization levels, Clang 6 is sometimes using ldp (load neon register pair) for adjacent memory addresses vld1 neon load instrinsics.
I am trying to use inline assembly to manually force more load pair instructions. The source array is held in the stack and when Clang itself produces ldp instructions, it is using a stack pointer with offset however when I enter the array with its index with inline assembly, it expands into a x register for the address. This works however is causing performance regression. I believe this is because reading from the stack is faster but a x register as a source address might be pointing to the heap which may in turn referencing back to the stack though I am not sure, or perhaps it is reading from duplicate data in the heap.
This is an example what I am using now.
asm (
"ldp %q[DST1], %q[DST2], [%[SRC]]" "\n"
: [DST1] "=w" (TMP1), [DST2] "=w" (TMP2)
: [SRC] "X" (&K2[8])
);
and this is what Clang expands it into
ldp q19, q4, [x11]
But I want to use a stack pointer with offset address, automatically resolved from the indexed K2 array variable. e.g.
ldp q19, q4, [sp,#32]
The offsets of the stack pointer address in the disassembled code are not adjacent, so I cannot just hard code the sp register and enter an offset to load sequential data. This is because Clang 6 is consolidating identical values in other arrays used by other functions into the stack.
GCC has the aarch64 machine constraints like k which is for the stack pointer (sp) register and Ump which are meant for stp and ldp store/load pair instruction addresses which I never got to work on either GCC or Clang the latter having no equivalent constraints in its sparse documentation.
My preference is to use Clang 6 as it is producing code that is over 6% faster than GCC 8 because it is arranging most of the instructions in a performance critical loop to dual issue properly.
Is there anyway to enter an array with index as input into inline assembly and have it automatically resolve to a stack pointer with offset address in Clang 6?
Have you tried using a memory source operand like [SRC] "m" (K2[8])? Without that, you haven't even told the compiler that the memory contents are also an input to the inline asm, so it might reorder your asm wrt. stores, or do dead-store elimination.
Letting the compiler pick the addressing mode is the entire point of "m" operands.

Assembly memory addressing in big endian format

Kinda stuck here and was hoping for a pointer on memory addressing.
In theory, these represent R1 through R4. I assume 0x60 is R1, and 0x6C is R4, incrementing by a word each time. Is that the case?
If I wanted to run
ADD R1, R2
Would it store the result of the addition of 0x60 and 0x6C in memory location 0x60? Or am I looking at this wrong?
ARM registers do not correspond to any memory location. In some contexts ("spill slots" on the stack, "task state" used for multitasking) there will be memory locations reserved to save the contents of some or all registers, but they must be explicitly copied back and forth.
The problem you're trying to do is poorly worded, but I think the table gives the values of memory locations 0x60 through 0x6C, and, separately, the text ("[R1] = ..., [R2] = ..., etc") gives the values of the registers. If I'm reading this right, the instruction labeled (a) will copy the low byte of the value at memory location 0x62, which is either 0x9A or 0x90, I'm not sure which, into register R1, sign-extending it. I hope that's enough to get you unstuck.

Memory access using _m128i address

I'm working on one project that uses SSE in non-conventional ways. One of the things about it, is that addresses of memory locations are kept duplicated in __m128i variable.
My task is to get value from memory using this address and do it as fast as possible. Value that we want to get from memory is also 128 bit long. I know that keeping address in __m128i is an abuse of SSE, but it cannot be done other way. Addresses have to be duplicated.
My current implementation:
Get lower 64 bit of duplicated address using MOVQ
Having address, use MOVAPS to get value from the memory
In assembly it looks like this:
MOVQ %xmm1, %rax
MOVAPS (%rax), %xmm2
Question: can it be done faster? May be some optimizations can be applied if we do this multiple times in a row?
That movq / dereference sequence is your best bet if you have addresses stored in xmm registers.
Haswell's gather implementation is slower than manually loading things, so using VGATHERQPS (qword indices -> float data) is unlikely to be a win. Maybe with a future CPU design that has a much faster gather.
But the real question is why would you have addresses in XMM registers in the first place? Esp. duplicated into both halves of the register. This just seems like a bad idea that would take extra time to set up, and take extra time to use. (esp. on AMD hardware, where move between GP and vector registers takes 5 or 10 cycles, vs. 1 for Intel.) It would be better to load addresses from RAM directly into GP registers.

What happens when memory "wraps" on an IA-32 supporting machine?

I'm creating a 64-bit model of IA-32 and am representing memory as a 0-based array of 2**64 bytes (the language I'm modeling this in uses ** as the exponentiation operator). This means that valid indices into the array are from 0 to 2**64-1. Now, to model the possible modes of accessing that memory, one can treat one element as an 8-bit number, two elements as a (little-endian) 16-bit number, etc.
My question is, what should my model do if they ask for a 16-bit (or 32-bit, etc.) number from location 2**64-1? Right now, what the model does is say that the returned value is Memory(2**64-1) + (8 * Memory(0)). I'm not updating any flags (which feels wrong). Is wrapping like this the correct behavior? Should I be setting any flags when the wrapping happens?
I have a copy of Intel-64-ia-32-ISA.pdf which I'm using as a reference, but it's 1,479 pages, and I'm having a hard time finding the answer to this particular question.
The answer is in Volume 3A, section 5.3: "Limit checking."
For ia-32:
When the effective limit is FFFFFFFFH (4 GBytes), these accesses [which extend beyond the end of the segment] may or may not cause the indicated exceptions. Behavior is implementation-specific and may vary from one execution to another.
For ia-64:
In 64-bit mode, the processor does not perform rumtime limit checking on code or data segments. Howver, the processor does check descriptor-table limits.
I tested it (did anyone expect that?) for 64bit numbers with this code:
mov dword [0], 0xDEADBEEF
mov dword [-4], 0x01020304
mov rdi, [-4]
call writelonghex
In a custom OS, with pages mapped as appropriate, running in VirtualBox. writelonghex just writes rdi to the screen as a 16-digit hexadecimal number. The result:
So yes, it does just wrap. Nothing funny happens.
No flags should be affected (though the manual doesn't say that no flags should be set for address wrapping, it does say that mov reg, [mem] doesn't affect them ever, and that includes this case), and no interrupt/trap/whatever happens (unless of course one or both pages touched are not present).

Resources