Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?

Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster? - clang

I was messing around with optimizing a function using Google Benchmark, and ran into a situation where my code was unexpectedly slowing down in certain situations. I started experimenting with it, looking at the compiled assembly, and eventually came up with a minimal test case that exhibits the issue. Here's the assembly I came up with that exhibits this slowdown:
.text
test:
#xorps %xmm0, %xmm0
cvtsi2ss %edi, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
addss %xmm0, %xmm0
retq
.global test
This function follows GCC/Clang's x86-64 calling convention for the function declaration extern "C" float test(int); Note the commented out xorps instruction. uncommenting this instruction dramatically improves the performance of the function. Testing it using my machine with an i7-8700K, Google benchmark shows the function without the xorps instruction takes 8.54ns (CPU), while the function with the xorps instruction takes 1.48ns. I've tested this on multiple computers with various OS's, processors, processor generations, and different processor manufacturers (Intel and AMD), and they all exhibit a similar performance difference. Repeating the addss instruction makes the slowdown more pronounced (to a point), and this slowdown still occurs using other instructions here (eg. mulss) or even a mix of instructions so long as they all depend on the value in %xmm0 in some way. It's worth pointing out that only calling xorps each function call results in the performance improvement. Sampling the performance with a loop (as Google Benchmark does) with the xorps call outside the loop still shows the slower performance.
Since this is a case where exclusively adding instructions improves performance, this appears to be caused by something really low-level in the CPU. Since it occurs across a wide variety of CPU's, it seems like this must be intentional. However, I couldn't find any documentation that explains why this happens. Does anybody have an explanation for what's going on here? The issue seems to be dependent on complicated factors, as the slowdown I saw in my original code only occurred on a specific optimization level (-O2, sometimes -O1, but not -Os), without inlining, and using a specific compiler (Clang, but not GCC).

cvtsi2ss %edi, %xmm0 merges the float into the low element of XMM0 so it has a false dependency on the old value. (Across repeated calls to the same function, creating one long loop-carried dependency chain.)
xor-zeroing breaks the dep chain, allowing out-of-order exec to work its magic. So you bottleneck on addss throughput (0.5 cycles) instead of latency (4 cycles).
Your CPU is a Skylake derivative so those are the numbers; earlier Intel have 3 cycle latency, 1 cycle throughput using a dedicated FP-add execution unit instead of running it on the FMA units. https://agner.org/optimize/. Probably function call/ret overhead prevents you from seeing the full 8x expected speedup from the latency * bandwidth product of 8 in-flight addss uops in the pipelined FMA units; you should get that speedup if you remove xorps dep-breaking from a loop within a single function.
GCC tends to be very "careful" about false dependencies, spending extra instructions (front-end bandwidth) to break them just in case. In code that bottlenecks on the front-end (or where total code size / uop-cache footprint is a factor) this costs performance if the register was actually ready in time anyway.
Clang/LLVM is reckless and cavalier about it, typically not bothering to avoid false dependencies on registers not written in the current function. (i.e. assuming / pretending that registers are "cold" on function entry). As you show in comments, clang does avoid creating a loop-carried dep chain by xor-zeroing when looping inside one function, instead of via multiple calls to the same function.
Clang even uses 8-bit GP-integer partial registers for no reason in some cases where that doesn't save any code-size or instructions vs. 32-bit regs. Usually it's probably fine, but there's a risk of coupling into a long dep chain or creating a loop-carried dependency chain if the caller (or a sibling function call) still has a cache-miss load in flight to that reg when we're called, for example.
See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for more about how OoO exec can overlap short to medium length independent dep chains. Also related: Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) is about unrolling a dot-product with multiple accumulators to hide FMA latency.
https://www.uops.info/html-instr/CVTSI2SS_XMM_R32.html has performance details for this instruction across various uarches.
You can avoid this if you can use AVX, with vcvtsi2ss %edi, %xmm7, %xmm0 (where xmm7 is any register you haven't written recently, or which is earlier in a dep chain that leads to the current value of EDI).
As I mentioned in Why does the latency of the sqrtsd instruction change based on the input? Intel processors
This ISA design wart is thanks to Intel optimizing for the short term with SSE1 on Pentium III. P3 handled 128-bit registers internally as two 64-bit halves. Leaving the upper half unmodified let scalar instructions decode to a single uop. (But that still gives PIII sqrtss a false dependency). AVX finally lets us avoid this with vsqrtsd %src,%src, %dst at least for register sources if not memory, and similarly vcvtsi2sd %eax, %cold_reg, %dst for the similarly near-sightedly designed scalar int->fp conversion instructions.
(GCC missed-optimization reports: 80586, 89071, 80571.)
If cvtsi2ss/sd had zeroed the upper elements of registers we wouldn't have this stupid problem / wouldn't need to sprinkle xor-zeroing instruction around; thanks Intel. (Another strategy is to use SSE2 movd %eax, %xmm0 which does zero-extend, then packed int->fp conversion which operates on the whole 128-bit vector. This can break even for float where the int->fp scalar conversion is 2 uops, and the vector strategy is 1+1. But not double where the int->fp packed conversion costs a shuffle + FP uop.)
This is exactly the problem that AMD64 avoided by making writes to 32-bit integer registers implicitly zero-extend to the full 64-bit register instead of leaving it unmodified (aka merging). Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register? (writing 8 and 16-bit registers do cause false dependencies on AMD CPUs, and Intel since Haswell).

Related

LC-3: BLKW ho to specify memory location to store data at?

In LC-3 when you use BLKW how do you initialize the block of memory to be at location x3102 instead of the next available memory location?

First, let's make a side note that all the memory of the 16-bit address space is there for you to use, according to the memory map (e.g. at least in the range x3000-xFDFF), and, it is initialized to zero; you can use it at will.
Generally speaking, the LC-3 assemblers don't allow multiple .ORIG directives throughout the file, instead, they require one at the beginning of the file. If they did allow subsequent .ORIG directives, this would be a way to accomplish what you're asking about.
But even if they did, frequently, we'd run into instruction offset encoding limitations. So there's an alternate solution I'll show below.
But first, let's look at the instruction offset/immediate encoding limitations.
The usual data memory access instruction formats have a very limited offset, only 9 bits worth (+/- about 256), and the offset is pc-relative. So, for example, the following won't work:
.ORIG x3000
LEA R0, TESTA
LD R1, TESTA
LEA R2, TESTB ; will get an error due to instruction offset encoding limitation
LD R3, TESTB ; will get an error due to instruction offset encoding limitation
HALT
TESTA
.FILL #1234
.BLKW xFA ; exactly enough padding to relocate TESTB to x3100
TESTB
.FILL #4321 ; which we can initialize with a non-zero value
.END
This illustrative: while this will successfully place TESTB at x3100, it cannot be reached by the either the LEA or the LD instructions due to the limited 9-bit pc-relative displacement.
(There is also the other practical limitation that as instructions are added the .BLKW operand has to shrink in size, which is clearly painful — this aspect would have been eliminated by supporting a .ORIG directive within.)
So, the alternative for large blocks and other such is to resort to using zero-initialized memory, and referencing this other memory using pointer variables nearby: using LD to load an address, rather than LEA, and LDI to access a value rather than LD.
.ORIG x3000
LEA R0, TESTA
LD R1, TESTA
LD R2, TESTBRef ; will put x3100 into R3
LDI R3, TESTBRef ; will access the memory value at address x3100..
HALT
TESTA
.FILL #1234
TESTBRef ; a nearby data pointer, that points using the full 16-bits
.FILL x3100
.END
In the latter, above, there is no declaration to reserve storage at x3100, nor can we initialize that storage at x3100 with non-zero initialization (e.g. no strings, no pre-defined arrays).
A data-to-data pointer, TESTBRef is used. Unlike code-to-code/data references (i.e. instructions referencing code or data), data-to-code/data
(i.e. data referencing code or data) pointers have all 16-bits available for pointing.
So, once we use this approach, of simply using other memory, we forgo the automatic placement of labels after other labels (for those other areas), and also forgo non-zero initialization.
Some LC-3 assemblers will allow multiple files, and these each allow their own .ORIG directive — so by using multiple files, we can place code&data at varied locations in the address space. However, the code-to-data instruction offset encoding limits still apply, and so, you'll likely end up managing other such memory areas manually anyway, and using data pointers as well.
Note that the JSR instruction has an 11-bit offset so code-to-code references can reach farther than code-to-data references.

What is the difference between loadu_ps and set_ps when using unformatted data?

I have some data that isn't stored as structure of arrays. What is the best practice for loading the data in registers?
__m128 _mm_set_ps (float e3, float e2, float e1, float e0)
// or
__m128 _mm_loadu_ps (float const* mem_addr)
With _mm_loadu_ps, I'd copy the data in a temporary stack array, vs. copying the data as values directly. Is there a difference?

It can be a tradeoff between latency and throughput, because separate stores into an array will cause a store-forwarding stall when you do a vector load. So it's high latency, but throughput could still be ok, and it doesn't compete with surrounding code for the vector shuffle execution unit. So it can be a throughput win if the surrounding code also has shuffle operations, vs. 3 shuffles to insert 3 elements into an XMM register after a scalar load of the first one. Either way it's still a lot of total uops, and that's another throughput bottleneck.
Most compilers like gcc and clang do a pretty good job with _mm_set_ps () when optimizing with -O3, whether the inputs are in memory or registers. I'd recommend it, except in some special cases.
The most common missed-optimization with _mm_set is when there's some locality between the inputs. e.g. don't do _mm_set_ps(a[i+2], a[i+3], a[i+0], a[i+1]]), because many compilers will use their regular pattern without taking advantage of the fact that 2 pairs of elements are contiguous in memory. In that case, use (the intrinsics for) movsd and movhps to load in two 64-bit chunks. (Not movlps: it merges into an existing register instead of zeroing the high elements, so it has a false dependency on the old contents while movsd zeros the high half.) Or a shufps if some reordering is needed between or within the 64-bit chunks.
The "regular pattern" that compilers use will usually be movss / insertps from memory if compiling with SSE4, or movss loads and unpcklps shuffles to combine pairs and then another unpcklps, unpcklpd, or movlhps to shuffle into one register. Or a shufps or shufpd if the compiler likes to waste code-side on immediate shuffle-control operands instead of using fixed shuffles intelligently.
See also Agner Fog's optimization guides for some handy tables of data-movement instructions to get a better idea of what the compiler has to work with, and how stuff performs. Note that Haswell and later can only do 1 shuffle per clock. Also other links in the x86 tag wiki.
There's no really cheap way for a compiler or human to do this, in the general case when you have 4 separate scalars that aren't contiguous in memory at all. Or for register inputs, where it can't optimize the way they're generated in registers in the first place to have some of them already packed together. (e.g. for function args passed in registers to a function that can't / doesn't inline.)
Anyway, it's not a big deal unless you have this inside an inner loop. In that case, definitely worry about it (and check the compiler's asm output to see if it made a mess or could do better if you program the gather yourself with intrinsics that map to single instructions like _mm_load_ss / _mm_shuffle_ps).
If possible, rearrange your data layout to make data contiguous in at least small chunks / stripes. (See https://stackoverflow.com/tags/sse/info, specifically these slides. But sometimes one part of the program needs the data one way, and the other needs another. Choose the layout that's good for the case that needs to be faster, or that runs more often, or whatever, and suck it up and do the best you can for the other part of the program. :P Possibly transpose / convert once to set up for multiple SIMD operations, but extra passes over data with no computation just suck up time and can hurt your computational intensity (how much ALU work you do for each time you load data into registers) more than they help.
And BTW, actual gather instructions (like AVX2 vgatherdps) are not very fast; even on Skylake it's probably not worth using a gather instruction for four 32-bit elements at known locations. On Broadwell / Haswell, gather is definitely not worth using for this.

Do any CPUs have hardware support for bounds checking?

It doesn't seem like it would be difficult to associate ranges with segments of memory. Then have an assembly instruction which treats 2 integers as "location" & "offset" (another for "data" if setting), and returns the data and error code. This would mean no longer having to make a choice between speed and security/safety when working with arrays.
Another example might be a function which verifies that instructions originating in a particular memory range cannot physically access memory outside that range. If all hardware connected to the motherboard had this capability (and were made to be compatible with each other), it would be trivial to make perfect virtual machines that run at nearly the same speed as the physical machine.
Dustin Soodak

Yes.
Decades ago, Lisp machines performed simultaneous validation checks (e.g. type checks and bounds checks) as the program ran with the assumption the program and state were valid, jumping "back in time" if the check failed - unfortunately this ability to get "free" runtime validation was lost when conventional (i.e. x86) machines became dominant.
https://en.wikipedia.org/wiki/Lisp_machine
Lisp Machines ran the tests in parallel with the more conventional single instruction additions. If the simultaneous tests failed, then the result was discarded and recomputed; this meant in many cases a speed increase by several factors. This simultaneous checking approach was used as well in testing the bounds of arrays when referenced, and other memory management necessities (not merely garbage collection or arrays).
Fortunately we're finally learning from the past and slowly, and by piecemeal, reintroducing those innovations - Intel's "MPX" (Memory Protection eXtensions) for x86 were introduced in Skylake-generation processors for hardware bounds-checking - though it isn't perfect.
(x86 is a regression in other ways too: IBM's mainframes had true hardware-accelerated system virtualization in the 1980s - we didn't get it on x86 until 2005 with Intel's "VT-x" and AMD's "AMD-V" extensions).
x86 BOUND
Technically, x86 does have hardware bounds-checking: the the BOUND instruction was introduced in 1982 in the Intel 80188 (as well as the Intel 286 and above, but not the Intel 8086, 8088 or 80186 processors).
While the BOUND instruction does provide hardware bounds-checking, I understand it indirectly caused performance issues because it breaks the hardware branch predictor (according to a Reddit thread, but I'm unsure why), but also because it requires the bounds to be specified in a tuple in memory - that's terrible for performance - I understand at runtime it's no faster than manually having the instructions to do an "if index not in range [x,y] then signal the BR exception to the program or OS" (so you might imagine the BOUND instruction was added for the convenience of people who coded assembly by-hand, which was quite common in the 1980s).
The BOUND instruction is still present in today's processors, but it was not included in AMD64 (x64) - likely for the performance reasons I explained above, and also because likely very few people were using it (and compilers could trivially replace it with a manual bounds check, that might have better performance anyway, as that could use registers).
Another disadvantage to storing the array bounds in memory is that code elsewhere (that wasn't subject to BOUNDS checking) could overwrite the previously written bounds for another pointer and circumvent the check that way - this is mostly a problem with code that intentionally tries to disable safety features (i.e. malware), but if the bounds were stored in the stack - and given how easy it is to corrupt the stack, it has even less utility.
Intel MPX
Intel MPX was introduced in Skylake architecture in 2015 and should be present in all Skylake and subsequent processor models in the mainstream Intel Core family (including Xeon, and non-SoC versions of Celeron and Pentium). Intel also implemented MPX in the Goldmont architecture (Atom, and SoC versions of Celeron and Pentium) from 2016 onwards.
MPX is superior to BOUND in that it provides dedicated registers to store the bounds range so the bounds-check should be almost zero-cost compared to BOUND which required a memory access. On the Intel 486 the BOUND instruction takes 7 cycles (compare to CMP which takes only 2 cycles even if the operand was a memory address). In Skylake the MPX equivalent (BNDMK, BNDCL and BNDCU) are all 1-cycle instructions and BNDMK can be amortized as it only needs to be called once for each new pointer).
I cannot find any information on wherever or not AMD has implemented their own version of MPX yet (as of June 2017).
Critical thoughts on MPX
Unfortunately the current state of MPX is not all that rosy - a recent paper by Oleksenko, Kuvaiskii, et al. in February 2017 "Intel MPX Explained" (PDF link: caution: not yet peer-reviewed) is a tad critical:
Our main conclusion is that Intel MPX is a promising technique that is not yet practical for widespread adoption. Intel MPX’s performance overheads are still high (~50% on average), and the supporting infrastructure has bugs which may cause compilation or runtime errors. Moreover, we showcase the design limitations of Intel MPX: it cannot detect temporal errors, may have false positives and false negatives in multithreaded code, and its restrictions
on memory layout require substantial code changes for some programs.
Also note that compared to the Lisp Machines of yore, Intel MPX is still executed inline - whereas in Lisp Machines (if my understanding is correct) bounds checks happened concurrently in hardware with a retroactive jump backwards if the check failed; thus, so-long as a running program's pointers do not point to out-of-bounds locations then there would be an absolutely zero runtime performance cost, so if you have this C code:
char arr[10];
arr[9] = 'a';
arr[8] = 'b';
Then under MPX then this would be executed:
Time Instruction Notes
1 BNDMK arr, arr+9 Set bounds 0 to 9.
2 BNDCL arr Check `arr` meets lower-bound.
3 BNDCU arr Check `arr` meets upper-bound.
4 MOV 'a' arr+9 Assign 'a' to arr+9.
5 MOV 'a' arr+8 Assign 'a' to arr+8.
But on a Lisp machine (if it were magically possible to compile C to Lisp...), then the program-reader-hardware in the computer has the ability to execute additional "side" instructions concurrently with the "actual" instructions, allowing the "side" instructions to instruct the computer to disregard the results from the "actual" instructions in the event of an error:
Time Actual instruction Side instruction
1 MOV 'A' arr+9 ENSURE arr+9 BETWEEN arr, arr+9
2 MOV 'A' arr+8 ENSURE arr+8 BETWEEN arr, arr+9
I understand the instructions-per-cycle for the "side" instructions are not the same as the "Actual" instructions - so the side-check for the instruction at Time=1 might only complete after the "Actual" instructions have already progressed on to Time=3 - but if the check failed then it would pass the instruction pointer of the failed instruction to the exception handler that would direct the program to disregard the results of the instructions executed after Time=1. I don't know how they could achieve that without massive amounts of memory or some mandatory execution pauses, possibly memory-fencing too -
that's outside the scope of my answer, but it is at least theoretically possible.
(Note in this contrived example I'm using constexpr index values that a compiler can prove will never be out-of-bounds so would omit the MPX checks entirely - so pretend they're user-supplied variables instead :) ).
I'm not an expert in x86 (or have any experience in microprocessor design, spare a CS500-level course I took at UW and didn't do the homework for...) but I don't believe concurrent execution of bounds-checks nor "time travel" is possible with x86's current design, despite the extant implementation of out-of-order execution - I might be wrong, however. I speculate that if all pointer-types were promoted to 3-tuples ( struct BoundedPointer<T> { T* ptr, T* min, T* max } - which technically already happens with MPX and other software-based bounds-checks as every guarded pointer has its bounds defined when BNDMK is called) then the protection could be provided for free by the MMU - but now pointers will consume 24 bytes of memory, each, instead of the current 8 bytes - or compare to the measly 4 bytes under 32-bit x86 - RAM is plentiful, but still a finite resource that shouldn't be wasted.
MPX in GCC
GCC supported for MPX from version 5.0 to 9.1 ( https://gcc.gnu.org/wiki/Intel%20MPX%20support%20in%20the%20GCC%20compiler ) when it was removed due to its maintenance burden.
MPX in Visual Studio / Visual C++
Visual Studio 2015 Update 1 (2015.1) added "experimental" support for MPX with the /d2MPX switch ( https://blogs.msdn.microsoft.com/vcblog/2016/01/20/visual-studio-2015-update-1-new-experimental-feature-mpx/ ). Support is still present in Visual Studio 2017 but Microsoft has not announced if it's considered a mainstream (i.e. non-experimental) feature yet.
MPX in Clang / LLVM
Clang has partially supported manual use of MPX in the past, but that support was fully removed in version 10.0
As of July 2021, LLVM still seems capable of outputting MPX instructions, but I can't see any evidence of an MPX "pass".
MPX in Intel C/C++ Compiler
The Intel C/C++ Compiler has supported MPX since version 15.0.

The XL compilers available on the IBM POWER processors on the Little Endian Linux, Big Endian Linux or AIX operating systems have a different implementation of array bounds checking.
Using the -qcheck or its synonym -C option turns on various kinds of checking. -qcheck=bounds checks array bounds. When this is used, the compilers check that every array reference has a valid subscript.
The hardware instruction used is a conditional trap, comparing the subscript to the upper limit and trapping if the subscript is too large or too small. In C and C++ the lower limit is 0. In Fortran it defaults to 1 but can be any integer. When it is not zero, the lower limit is subtracted from the subscript being checked, and the check compares that to the upper limit minus the lower limit.
When the limit is known at compile time and small enough, a conditional trap immediate instruction is enough. When the limit is calculated at execution time or is greater than 65535, a conditional trap instruction comparing two registers is needed.
The performance impact is small for several reasons:
1. The conditional trap instructions are fast.
2. They are executed in a standard integer pipeline. Since most POWER CPUs have 2 or 4 integer pipelines, there is usually an otherwise empty slot to put the trap in, so it is often essentially zero cost.
3. When it can the compiler optimizer moves the conditional trap out of loops so it is executed only once, checking all loop iterations at once.
4. When it can prove the actual subscript cannot exceed the limit, the optimizer discards the instruction.
5. Also when it can prove the subscript will also be invalid, the optimizer uses an unconditional trap.
6. If necessary -qcheck can be used during testing and skipped for production builds, but the overhead is small enough that's not usually necessary.
If my memory is correct, one long ago paper reported a 2% slowdown in one case and 0% in another. Since that CPU had only one integer pipeline, the slowdown should be significantly less with modern CPUs.
Other checking using the same mechanism is available to detect dereferencing NULL pointers, dividing an integer by zero, using an uninitialized auto variable, specially written asserts, etc.
This doesn't include all kinds of invalid memory usage, but it does handle the most common kind, does it very efficiently, and is very easy to use.

GCC supports -fbounds-check for similar purposes, but at this time it is only available for the Fortran front end (gfortran).

Is the stack only preserved above the stack pointer?

I sometimes see disassembled programs which have instructions like:
mov %eax, -4(%esp)
which stores eax to stack at esp-4, without changing esp.
I'd like to know whether in general, you could put data into the stack beyond the stack pointer, and have those data be preserved (not altered unless I do it specifically).
Also, does this depend on which OS I use?

It matters which OS you use, because different OSes have different ABIs. (See the x86 tag wiki if you don't know what that means).
There are two ways I can see that mov %eax, -4(%esp) could be sane:
In the Linux x32 ABI (long mode with 32bit pointers), where there's a 128B red zone like in the normal x86-64 ABI. Compilers frequently generate code using the address-size prefix when they can't prove that e.g. 4(%rdi) would be the same as 4(%edi) in every case (e.g. wraparound). Unfortunately gcc 5.3 still uses 32bit addressing for locals on the stack, which could only wrap if %rsp == 0 (since the ABI requires it to be 16B-aligned).
Anyway, void foo(void) { volatile int x = 10; } compiles to
movl $10, -4(%esp) / ret with gcc 5.3 -O3 -mx32 on the Godbolt Compiler Explorer.
In (kernel) code that runs with interrupts disabled. Since nothing asynchronous other than DMA can happen, nothing can clobber your stack memory. (Although x86 has NMIs: Non-maskable interrupts. Depending on the handler for NMIs, and whether they can be blocked at all, NMIs could clobber memory below the stack pointer, I think.)
In user-space, your signal handlers aren't the only thing that can asynchronously clobber memory below the stack pointer:
As Jester points out in comments on dwelch's answer, pages below the stack pointer can be discarded (asynchronously of course), so a process that temporarily uses a lot of stack isn't wasting all those pages forever. If %esp happens to be at a page boundary, -4(%esp) is in a different page. And instead of faulting in a newly-allocated page of stack memory, access to unmapped pages below the stack pointer turn into segfaults on Linux.
Unless you have a guarantee otherwise (e.g. the red zone), then you must assume that everything below %esp is scribbled over between every instruction. None of the standard 32bit ABIs have a red-zone, and the Windows 64bit ABI also lacks one. Asynchronous use of the stack (usually by signal handlers in Linux) is a whole-program thing, not something that the compiler could determine just from the current compilation unit (even in cases where the compiler could prove that -4(%esp) was in the same page as (%esp)).
Note that the Linux x32 ABI is a 64bit ABI for AMD64 aka x86-64, not i386 aka IA32 aka x86-32. It's much more like the usual AMD64 ABI, since it was designed after.

EDIT
not sure what you mean by above and below since some folks "see" addresses increasing up or increasing down.
But it doesnt matter. If the stack was initialized at address X and is currently at Y then the data between X and Y must be preserved (one end not inclusive). The memory on either side is fair game.
The compiler not the operating system makes this happen, it moves the stack pointer to cover whatever it needs for that function. And moves it back when done. Each nested function consuming more and more stack and each return giving a little back.

Memory access using _m128i address

I'm working on one project that uses SSE in non-conventional ways. One of the things about it, is that addresses of memory locations are kept duplicated in __m128i variable.
My task is to get value from memory using this address and do it as fast as possible. Value that we want to get from memory is also 128 bit long. I know that keeping address in __m128i is an abuse of SSE, but it cannot be done other way. Addresses have to be duplicated.
My current implementation:
Get lower 64 bit of duplicated address using MOVQ
Having address, use MOVAPS to get value from the memory
In assembly it looks like this:
MOVQ %xmm1, %rax
MOVAPS (%rax), %xmm2
Question: can it be done faster? May be some optimizations can be applied if we do this multiple times in a row?

That movq / dereference sequence is your best bet if you have addresses stored in xmm registers.
Haswell's gather implementation is slower than manually loading things, so using VGATHERQPS (qword indices -> float data) is unlikely to be a win. Maybe with a future CPU design that has a much faster gather.
But the real question is why would you have addresses in XMM registers in the first place? Esp. duplicated into both halves of the register. This just seems like a bad idea that would take extra time to set up, and take extra time to use. (esp. on AMD hardware, where move between GP and vector registers takes 5 or 10 cycles, vs. 1 for Intel.) It would be better to load addresses from RAM directly into GP registers.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart