Memory access using _m128i address

Memory access using _m128i address - sse

I'm working on one project that uses SSE in non-conventional ways. One of the things about it, is that addresses of memory locations are kept duplicated in __m128i variable.
My task is to get value from memory using this address and do it as fast as possible. Value that we want to get from memory is also 128 bit long. I know that keeping address in __m128i is an abuse of SSE, but it cannot be done other way. Addresses have to be duplicated.
My current implementation:
Get lower 64 bit of duplicated address using MOVQ
Having address, use MOVAPS to get value from the memory
In assembly it looks like this:
MOVQ %xmm1, %rax
MOVAPS (%rax), %xmm2
Question: can it be done faster? May be some optimizations can be applied if we do this multiple times in a row?

That movq / dereference sequence is your best bet if you have addresses stored in xmm registers.
Haswell's gather implementation is slower than manually loading things, so using VGATHERQPS (qword indices -> float data) is unlikely to be a win. Maybe with a future CPU design that has a much faster gather.
But the real question is why would you have addresses in XMM registers in the first place? Esp. duplicated into both halves of the register. This just seems like a bad idea that would take extra time to set up, and take extra time to use. (esp. on AMD hardware, where move between GP and vector registers takes 5 or 10 cycles, vs. 1 for Intel.) It would be better to load addresses from RAM directly into GP registers.

Related

MIPS location of registers

I think I have a fairly basic MIPS question but am still getting my head wrapped around how addressing works in mips.
My question is: What is the address of the register $t0?
I am looking at the following memory allocation diagram from the MIPs "green sheet"
I had two ideas:
The register $t0 has a register number of 8 so I'm wondering if it would have an address of 0x0000 0008 and be in the reserved portion of the memory block.
Or would it fall in the Static Data Section and have an address of 0x1000 0008?
I know that MARS and different assemblers might start the addressing differently as described in this related question:
How is the la instruction translated in MIPS?
I trying to understand what the "base" address is for register $t0 so I have a better understanding how offsets(base) work.
For example what the address of 8($t0) would be
Thanks for the help!

feature
Registers
Memory
count
very few
vast
speed
fast
slow
Named
yes
no
Addressable
no
yes
There are typically 32 or fewer registers for programmers to use.  On a 32-bit machine we can address 2^32 different bytes of memory.
Registers are fast, while memory is slow, potentially taking dozens of cycles depending on cache features & conditions.
On a load-store machine, registers can be used directly in most instructions (by naming them in the machine code instruction), whereas memory access requires a separate load or store instruction.  Computational instructions on such a machine typically allows naming up to 3 registers (e.g. target, source1, source2).  Memory operands have to be brought into registers for computation (and sometimes moved back to memory).
Register can be named in instructions, but they do not have addresses and cannot be indexed.  On MIPS no register can be found as alias at some address in memory.  It is hard to put even a smallish array (e.g. array of 10) in registers because they have no addresses and cannot be indexed.  Memory has numerical addresses, so we can rely on storing arrays and objects in a predictable pattern of addresses.  (Memory generally doesn't have names, just addresses; however, there are usually special memory locations for working with I/O various devices, and, as you note memory is partitioned into sections that have start (and ending) addresses.)
To be clear, memory-based aliases have been designed into some processors of the past.  The HP/1000 (circa 70s'-80's), for example, had 2 registers (A & B), and they had aliases at memory locations 0 and 1, respectively.  However, this aliasing of CPU registers to memory is generally no longer done on modern processors.
For example what the address of 8($t0) would be
8($t0) refers to the memory address of (the contents of register $t0) + 8.  With proper usage, the program fragment would $t0 would be using $t0 as a pointer, which is some variable that holds a memory address.

Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

I was messing around with optimizing a function using Google Benchmark, and ran into a situation where my code was unexpectedly slowing down in certain situations. I started experimenting with it, looking at the compiled assembly, and eventually came up with a minimal test case that exhibits the issue. Here's the assembly I came up with that exhibits this slowdown:
.text
test:
#xorps %xmm0, %xmm0
cvtsi2ss %edi, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
retq
.global test
This function follows GCC/Clang's x86-64 calling convention for the function declaration extern "C" float test(int); Note the commented out xorps instruction. uncommenting this instruction dramatically improves the performance of the function. Testing it using my machine with an i7-8700K, Google benchmark shows the function without the xorps instruction takes 8.54ns (CPU), while the function with the xorps instruction takes 1.48ns. I've tested this on multiple computers with various OS's, processors, processor generations, and different processor manufacturers (Intel and AMD), and they all exhibit a similar performance difference. Repeating the addss instruction makes the slowdown more pronounced (to a point), and this slowdown still occurs using other instructions here (eg. mulss) or even a mix of instructions so long as they all depend on the value in %xmm0 in some way. It's worth pointing out that only calling xorps each function call results in the performance improvement. Sampling the performance with a loop (as Google Benchmark does) with the xorps call outside the loop still shows the slower performance.
Since this is a case where exclusively adding instructions improves performance, this appears to be caused by something really low-level in the CPU. Since it occurs across a wide variety of CPU's, it seems like this must be intentional. However, I couldn't find any documentation that explains why this happens. Does anybody have an explanation for what's going on here? The issue seems to be dependent on complicated factors, as the slowdown I saw in my original code only occurred on a specific optimization level (-O2, sometimes -O1, but not -Os), without inlining, and using a specific compiler (Clang, but not GCC).

cvtsi2ss %edi, %xmm0 merges the float into the low element of XMM0 so it has a false dependency on the old value. (Across repeated calls to the same function, creating one long loop-carried dependency chain.)
xor-zeroing breaks the dep chain, allowing out-of-order exec to work its magic. So you bottleneck on addss throughput (0.5 cycles) instead of latency (4 cycles).
Your CPU is a Skylake derivative so those are the numbers; earlier Intel have 3 cycle latency, 1 cycle throughput using a dedicated FP-add execution unit instead of running it on the FMA units. https://agner.org/optimize/. Probably function call/ret overhead prevents you from seeing the full 8x expected speedup from the latency * bandwidth product of 8 in-flight addss uops in the pipelined FMA units; you should get that speedup if you remove xorps dep-breaking from a loop within a single function.
GCC tends to be very "careful" about false dependencies, spending extra instructions (front-end bandwidth) to break them just in case. In code that bottlenecks on the front-end (or where total code size / uop-cache footprint is a factor) this costs performance if the register was actually ready in time anyway.
Clang/LLVM is reckless and cavalier about it, typically not bothering to avoid false dependencies on registers not written in the current function. (i.e. assuming / pretending that registers are "cold" on function entry). As you show in comments, clang does avoid creating a loop-carried dep chain by xor-zeroing when looping inside one function, instead of via multiple calls to the same function.
Clang even uses 8-bit GP-integer partial registers for no reason in some cases where that doesn't save any code-size or instructions vs. 32-bit regs. Usually it's probably fine, but there's a risk of coupling into a long dep chain or creating a loop-carried dependency chain if the caller (or a sibling function call) still has a cache-miss load in flight to that reg when we're called, for example.
See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for more about how OoO exec can overlap short to medium length independent dep chains. Also related: Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) is about unrolling a dot-product with multiple accumulators to hide FMA latency.
https://www.uops.info/html-instr/CVTSI2SS_XMM_R32.html has performance details for this instruction across various uarches.
You can avoid this if you can use AVX, with vcvtsi2ss %edi, %xmm7, %xmm0 (where xmm7 is any register you haven't written recently, or which is earlier in a dep chain that leads to the current value of EDI).
As I mentioned in Why does the latency of the sqrtsd instruction change based on the input? Intel processors
This ISA design wart is thanks to Intel optimizing for the short term with SSE1 on Pentium III. P3 handled 128-bit registers internally as two 64-bit halves. Leaving the upper half unmodified let scalar instructions decode to a single uop. (But that still gives PIII sqrtss a false dependency). AVX finally lets us avoid this with vsqrtsd %src,%src, %dst at least for register sources if not memory, and similarly vcvtsi2sd %eax, %cold_reg, %dst for the similarly near-sightedly designed scalar int->fp conversion instructions.
(GCC missed-optimization reports: 80586, 89071, 80571.)
If cvtsi2ss/sd had zeroed the upper elements of registers we wouldn't have this stupid problem / wouldn't need to sprinkle xor-zeroing instruction around; thanks Intel. (Another strategy is to use SSE2 movd %eax, %xmm0 which does zero-extend, then packed int->fp conversion which operates on the whole 128-bit vector. This can break even for float where the int->fp scalar conversion is 2 uops, and the vector strategy is 1+1. But not double where the int->fp packed conversion costs a shuffle + FP uop.)
This is exactly the problem that AMD64 avoided by making writes to 32-bit integer registers implicitly zero-extend to the full 64-bit register instead of leaving it unmodified (aka merging). Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register? (writing 8 and 16-bit registers do cause false dependencies on AMD CPUs, and Intel since Haswell).

What happens when memory "wraps" on an IA-32 supporting machine?

I'm creating a 64-bit model of IA-32 and am representing memory as a 0-based array of 2**64 bytes (the language I'm modeling this in uses ** as the exponentiation operator). This means that valid indices into the array are from 0 to 2**64-1. Now, to model the possible modes of accessing that memory, one can treat one element as an 8-bit number, two elements as a (little-endian) 16-bit number, etc.
My question is, what should my model do if they ask for a 16-bit (or 32-bit, etc.) number from location 2**64-1? Right now, what the model does is say that the returned value is Memory(2**64-1) + (8 * Memory(0)). I'm not updating any flags (which feels wrong). Is wrapping like this the correct behavior? Should I be setting any flags when the wrapping happens?
I have a copy of Intel-64-ia-32-ISA.pdf which I'm using as a reference, but it's 1,479 pages, and I'm having a hard time finding the answer to this particular question.

The answer is in Volume 3A, section 5.3: "Limit checking."
For ia-32:
When the effective limit is FFFFFFFFH (4 GBytes), these accesses [which extend beyond the end of the segment] may or may not cause the indicated exceptions. Behavior is implementation-specific and may vary from one execution to another.
For ia-64:
In 64-bit mode, the processor does not perform rumtime limit checking on code or data segments. Howver, the processor does check descriptor-table limits.

I tested it (did anyone expect that?) for 64bit numbers with this code:
mov dword [0], 0xDEADBEEF
mov dword [-4], 0x01020304
mov rdi, [-4]
call writelonghex
In a custom OS, with pages mapped as appropriate, running in VirtualBox. writelonghex just writes rdi to the screen as a 16-digit hexadecimal number. The result:
So yes, it does just wrap. Nothing funny happens.
No flags should be affected (though the manual doesn't say that no flags should be set for address wrapping, it does say that mov reg, [mem] doesn't affect them ever, and that includes this case), and no interrupt/trap/whatever happens (unless of course one or both pages touched are not present).

Heap overflow exploit

I understand that overflow exploitation requires three steps:
1.Injecting arbitrary code (shellcode) into target process memory space.
2.Taking control over eip.
3.Set eip to execute arbitrary code.
I read ben hawkens articles about heap exploitation and understood few tactics about how to ultimatly override a function pointer to point to my code.
In other words, I understand step 2.
I do not understand step 1 and 3.
How do I inject my code to the process memory space ?
During step 3 I override a function pointer with a
Pointer to my shellcode, How can I calculate\know what address
Was my injected code injected into ? (This problem is solved
In stackoverflow by using "jmp esp).

In a heap overflow, supposing that the system does not have ASLR activated, you will know the address of the memory chunks (aka, the buffers) you use in the overflow.
One option is to place the shellcode where the buffer is, given that you can control the contents of the buffer (as the application user). Once you have placed the shellcode bytes in the buffer, you only have to jump to that buffer address.
One way to perform that jump is by, for example, overwriting a .dtors entry. Once the vulnerable program finishes, the shellcode - placed in the buffer - will be executed. The complicated part is the .dtors overwriting. For that you will have to use the published heap exploiting techniques.
The prerequisites are that ASLR is deactivated (to know the address of the buffer before executing the vulnerable program) and that the memory region where the buffer is placed must be executable.
On more thing, steps 2 and 3 are the same. If you control eip, it's logic that you will point it to the shellcode (the arbitrary code).
P.S.: Bypassing ASLR is more complex.

Step 1 requires a vulnerability in the attacked code.
Common vulnerabilites include:
buffer overflow (common i C code, happens if the program reads an arbitrary long string into a fixed buffer)
evaluation of unsanitized data (common in SQL and script languages, but can occur in other languages as well)
Step 3 requires detailed knowledge of the target architecture.

How do I inject my code into process space?
This is quite a statement/question. It requires an 'exploitable' region of code in said process space. For example, Windows is currently rewriting most strcpy() to strncpy() if at all possible. I say if possible
because not all areas of code that use strcpy can successfully be changed over to strncpy. Why? BECAUSE ~# of this crux in difference shown below;
strcpy($buffer, $copied);
or
strncpy($buffer, $copied, sizeof($copied));
This is what makes strncpy so difficult to implement in real world scenarios. There has to be installed a 'magic number' on most strncpy operations (the sizeof() operator creates this magic number)
As coders' we are taught using hard coded values such as a strict compliance with a char buffer[1024]; is really bad coding practise.
BUT ~ in comparison - using buffer[]=""; or buffer[1024]=""; is the heart of the exploit. HOWEVER, if for example we change this code to the latter we get another exploit introduced into the system...
char * buffer;
char * copied;
strcpy(buffer, copied);//overflow this right here...
OR THIS:
int size = 1024;
char buffer[size];
char copied[size];
strncpy(buffer,copied, size);
This will stop overflows, but introduce a exploitable region in RAM due to size being predictable and structured into 1024 blocks of code/data.
Therefore, original poster, looking for strcpy for example, in a program's address space, will make the program exploitable if strcpy is present.
There are many reasons why strcpy is favoured by programmers over strncpy. Magic numbers, variable input/output data size...programming styles...etc...
HOW DO I FIND MYSELF IN MY CODE (MY LOCATION)
Check various hacker books for examples of this ~
BUT, try;
label:
pop eax
pop eax
call pointer
jmp label
pointer:
mov esp, eax
jmp $
This is an example that is non-working due to the fact that I do NOT want to be held responsible for writing the next Morris Worm! But, any decent programmer will get the jist of this code and know immediately what I am talking about here.
I hope your overflow techniques work in the future, my son!

Purpose of memory alignment

Admittedly I don't get it. Say you have a memory with a memory word of length of 1 byte. Why can't you access a 4 byte long variable in a single memory access on an unaligned address(i.e. not divisible by 4), as it's the case with aligned addresses?

The memory subsystem on a modern processor is restricted to accessing memory at the granularity and alignment of its word size; this is the case for a number of reasons.
Speed
Modern processors have multiple levels of cache memory that data must be pulled through; supporting single-byte reads would make the memory subsystem throughput tightly bound to the execution unit throughput (aka cpu-bound); this is all reminiscent of how PIO mode was surpassed by DMA for many of the same reasons in hard drives.
The CPU always reads at its word size (4 bytes on a 32-bit processor), so when you do an unaligned address access — on a processor that supports it — the processor is going to read multiple words. The CPU will read each word of memory that your requested address straddles. This causes an amplification of up to 2X the number of memory transactions required to access the requested data.
Because of this, it can very easily be slower to read two bytes than four. For example, say you have a struct in memory that looks like this:
struct mystruct {
char c; // one byte
int i; // four bytes
short s; // two bytes
}
On a 32-bit processor it would most likely be aligned like shown here:
The processor can read each of these members in one transaction.
Say you had a packed version of the struct, maybe from the network where it was packed for transmission efficiency; it might look something like this:
Reading the first byte is going to be the same.
When you ask the processor to give you 16 bits from 0x0005 it will have to read a word from 0x0004 and shift left 1 byte to place it in a 16-bit register; some extra work, but most can handle that in one cycle.
When you ask for 32 bits from 0x0001 you'll get a 2X amplification. The processor will read from 0x0000 into the result register and shift left 1 byte, then read again from 0x0004 into a temporary register, shift right 3 bytes, then OR it with the result register.
Range
For any given address space, if the architecture can assume that the 2 LSBs are always 0 (e.g., 32-bit machines) then it can access 4 times more memory (the 2 saved bits can represent 4 distinct states), or the same amount of memory with 2 bits for something like flags. Taking the 2 LSBs off of an address would give you a 4-byte alignment; also referred to as a stride of 4 bytes. Each time an address is incremented it is effectively incrementing bit 2, not bit 0, i.e., the last 2 bits will always continue to be 00.
This can even affect the physical design of the system. If the address bus needs 2 fewer bits, there can be 2 fewer pins on the CPU, and 2 fewer traces on the circuit board.
Atomicity
The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.
Conclusion
The memory system of a processor is quite a bit more complex and involved than described here; a discussion on how an x86 processor actually addresses memory can help (many processors work similarly).
There are many more benefits to adhering to memory alignment that you can read at this IBM article.
A computer's primary use is to transform data. Modern memory architectures and technologies have been optimized over decades to facilitate getting more data, in, out, and between more and faster execution units–in a highly reliable way.
Bonus: Caches
Another alignment-for-performance that I alluded to previously is alignment on cache lines which are (for example, on some CPUs) 64B.
For more info on how much performance can be gained by leveraging caches, take a look at Gallery of Processor Cache Effects; from this question on cache-line sizes
Understanding of cache lines can be important for certain types of program optimizations. For example, the alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.

It's a limitation of many underlying processors. It can usually be worked around by doing 4 inefficient single byte fetches rather than one efficient word fetch, but many language specifiers decided it would be easier just to outlaw them and force everything to be aligned.
There is much more information in this link that the OP discovered.

you can with some processors (the nehalem can do this), but previously all memory access was aligned on a 64-bit (or 32-bit) line, because the bus is 64 bits wide, you had to fetch 64 bit at a time, and it was significantly easier to fetch these in aligned 'chunks' of 64 bits.
So, if you wanted to get a single byte, you fetched the 64-bit chunk and then masked off the bits you didn't want. Easy and fast if your byte was at the right end, but if it was in the middle of that 64-bit chunk, you'd have to mask off the unwanted bits and then shift the data over to the right place. Worse, if you wanted a 2 byte variable, but that was split across 2 chunks, then that required double the required memory accesses.
So, as everyone thinks memory is cheap, they just made the compiler align the data on the processor's chunk sizes so your code runs faster and more efficiently at the cost of wasted memory.

Fundamentally, the reason is because the memory bus has some specific length that is much, much smaller than the memory size.
So, the CPU reads out of the on-chip L1 cache, which is often 32KB these days. But the memory bus that connects the L1 cache to the CPU will have the vastly smaller width of the cache line size. This will be on the order of 128 bits.
So:
262,144 bits - size of memory
128 bits - size of bus
Misaligned accesses will occasionally overlap two cache lines, and this will require an entirely new cache read in order to obtain the data. It might even miss all the way out to the DRAM.
Furthermore, some part of the CPU will have to stand on its head to put together a single object out of these two different cache lines which each have a piece of the data. On one line, it will be in the very high order bits, in the other, the very low order bits.
There will be dedicated hardware fully integrated into the pipeline that handles moving aligned objects onto the necessary bits of the CPU data bus, but such hardware may be lacking for misaligned objects, because it probably makes more sense to use those transistors for speeding up correctly optimized programs.
In any case, the second memory read that is sometimes necessary would slow down the pipeline no matter how much special-purpose hardware was (hypothetically and foolishly) dedicated to patching up misaligned memory operations.

#joshperry has given an excellent answer to this question. In addition to his answer, I have some numbers that show graphically the effects which were described, especially the 2X amplification. Here's a link to a Google spreadsheet showing what the effect of different word alignments look like.
In addition here's a link to a Github gist with the code for the test.
The test code is adapted from the article written by Jonathan Rentzsch which #joshperry referenced. The tests were run on a Macbook Pro with a quad-core 2.8 GHz Intel Core i7 64-bit processor and 16GB of RAM.

If you have a 32bit data bus, the address bus address lines connected to the memory will start from A2, so only 32bit aligned addresses can be accessed in a single bus cycle.
So if a word spans an address alignment boundary - i.e. A0 for 16/32 bit data or A1 for 32 bit data are not zero, two bus cycles are required to obtain the data.
Some architectures/instruction sets do not support unaligned access and will generate an exception on such attempts, so compiler generated unaligned access code requires not just additional bus cycles, but additional instructions, making it even less efficient.

If a system with byte-addressable memory has a 32-bit-wide memory bus, that means there are effectively four byte-wide memory systems which are all wired to read or write the same address. An aligned 32-bit read will require information stored in the same address in all four memory systems, so all systems can supply data simultaneously. An unaligned 32-bit read would require some memory systems to return data from one address, and some to return data from the next higher address. Although there are some memory systems that are optimized to be able to fulfill such requests (in addition to their address, they effectively have a "plus one" signal which causes them to use an address one higher than specified) such a feature adds considerable cost and complexity to a memory system; most commodity memory systems simply cannot return portions of different 32-bit words at the same time.

On PowerPC you can load an integer from an odd address with no problems.
Sparc and I86 and (I think) Itatnium raise hardware exceptions when you try this.
One 32 bit load vs four 8 bit loads isnt going to make a lot of difference on most modern processors. Whether the data is already in cache or not will have a far greater effect.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart