Memory-to-memory copy - memory

Is there a mechanism of efficiently simulating a memory-to-memory copy using x86 instructions?
I need to guarantee an instant in time when both the source address and the destination address point to equal values. I am currently using a loop (i.e. do { reg = *A; *B = reg; } while (reg != *A) ), but I'd like to optimize on that.
EDIT: Missed some important information: I'm copying pointers.

REP MOVSB (google it), there's also WORD and DWORD versions, probably QWORDs by now too

Related

Pointer to result of AVX load (_mm256_load_si256)

If I have some class with a field like __m256i* loaded_v, and a method like:
void load() {
loaded_v = &_mm256_load_si256(reinterpret_cast<const __m256i*>(vector));
}
For how long will loaded_v be a valid pointer? Since there are a limited number of registers, I would imagine that eventually loaded_v will refer to a different value, or some other weird behavior will happen. However, I would like to reduce the number of loads I do.
I'm writing a packed bit array class, and I would like to use AVX intrinsics to increase performance. However, it is inefficient to load my array of bits every time I do some operation (and, or, xor, etc). Therefore, I would like to be able to explicitly call load() before performing some batch of operations. However, I don't understand how exactly AVX registers are handled. Could anyone help me out, or point to me to some documentation for this issue?
The optimizing compiler would use registers automatically.
It may put a __m256 variable into memory, or in a register, or may use a register in one part of you code, and spill it in another. This can be done not only with standalone automatic storage (stack) variable, but also with member of a class, especially if the class instance is an automatic storage variable itself.
In case of registers usage, __m256 variable would correspond one of ymm registers (one of 16 in x86-64, one of 8 in 32-bit compilation, or one of 32 in x86-64 with AVX512), there's no need to indirectly refer to it.
The _mm256_load_si256 intrinsic doesn't necessarily compile to vmovdqa. For example, this code:
#include <immintrin.h>
__m256i f(__m256i a, const void* p)
{
__m256i b = _mm256_load_si256(reinterpret_cast<const __m256i*>(p));
return _mm256_xor_si256(a, b);
}
Compiles as following (https://godbolt.org/z/ve67YPn4T):
vpxor ymm0, ymm0, YMMWORD PTR [rdx]
ret 0
C and C++ are high level languages; the intrinsics should be seen as a way to convey the semantic to the compiler, not instruction mnemonics.
You should load a value into a variable,
__m256i loaded_v;
loaded_v = _mm256_load_si256(reinterpret_cast<const __m256i*>(vector));
or a temporary:
__m256_whatever_operation(_mm256_load_si256(reinterpret_cast<const __m256i*>(vector)), other_operand);
And you should follow the usual C or C++ rules.
If you repeatedly load an indirect value from a pointer, it may be helpful to cache it in a variable, so that compiler would see the value does not change between loads, and use this as an optimization opportunity. Sure compiler may miss this opportunity anyway, or find it even without cached variable (possibly with the help of the strict aliasing rule).

OS memory allocation addresses

Quick curious question, memory allocation addresses are choosed by the language compiler or is it the OS which chooses the addresses for the memory asked?
This is from a doubt about virtual memory, where it could be quickly explained as "let the process think he owns all the memory", but what happens on 64 bits architectures where only 48 bits are used for memory addresses if the process wants a higher address?
Lets say you do a int a = malloc(sizeof(int)); and you have no memory left from the previous system call so you need to ask the OS for more memory, is the compiler the one who determines the memory address to allocate this variable, or does it just ask the OS for memory and it allocates it on the address returned by it?
It would not be the compiler, especially since this is dynamic memory allocation. Compilation is done well before you actually execute your program.
Memory reservation for static variables happens at compile time. But the static memory allocation will happen at start-up, before the user defined Main.
Static variables can be given space in the executable file itself, this would then be memory mapped into the process address space. This is only one of few times(?) I can image the compiler actually "deciding" on an address.
During dynamic memory allocation your program would ask the OS for some memory and it is the OS that returns a memory address. This address is then stored in a pointer for example.
The dynamic memory allocation in C/C++ is simply done by runtime library functions. Those functions can do pretty much as they please as long as their behavior is standards-compliant. A trivial implementation of compliant but useless malloc() looks like this:
void * malloc(size_t size) {
return NULL;
}
The requirements are fairly relaxed -- the pointer has to be suitably aligned and the pointers must be unique unless they've been previously free()d. You could have a rather silly but somewhat portable and absolutely not thread-safe memory allocator done the way below. There, the addresses come from a pool that was decided upon by the compiler.
#include "stdint.h"
// 1/4 of available address space, but at most 2^30.
#define HEAPSIZE (1UL << ( ((sizeof(void*)>4) ? 4 : sizeof(void*)) * 2 ))
// A pseudo-portable alignment size for pointerĊšbwitary types. Breaks
// when faced with SIMD data types.
#define ALIGNMENT (sizeof(intptr_t) > sizeof(double) ? sizeof(intptr_t) : siE 1Azeof(double))
void * malloc(size_t size)
{
static char buffer[HEAPSIZE];
static char * next = NULL;
void * result;
if (next == NULL) {
uintptr_t ptr = (uintptr_t)buffer;
ptr += ptr % ALIGNMENT;
next = (char*)ptr;
}
if (size == 0) return NULL;
if (next-buffer > HEAPSIZE-size) return NULL;
result = next;
next += size;
next += size % ALIGNMENT;
return result;
}
void free(void * ptr)
{}
Practical memory allocators don't depend upon such static memory pools, but rather call the OS to provide them with newly mapped memory.
The proper way of thinking about it is: you don't know what particular pointer you are going to get from malloc(). You can only know that it's unique and points to properly aligned memory if you've called malloc() with a non-zero argument. That's all.

Heap overflow exploit

I understand that overflow exploitation requires three steps:
1.Injecting arbitrary code (shellcode) into target process memory space.
2.Taking control over eip.
3.Set eip to execute arbitrary code.
I read ben hawkens articles about heap exploitation and understood few tactics about how to ultimatly override a function pointer to point to my code.
In other words, I understand step 2.
I do not understand step 1 and 3.
How do I inject my code to the process memory space ?
During step 3 I override a function pointer with a
Pointer to my shellcode, How can I calculate\know what address
Was my injected code injected into ? (This problem is solved
In stackoverflow by using "jmp esp).
In a heap overflow, supposing that the system does not have ASLR activated, you will know the address of the memory chunks (aka, the buffers) you use in the overflow.
One option is to place the shellcode where the buffer is, given that you can control the contents of the buffer (as the application user). Once you have placed the shellcode bytes in the buffer, you only have to jump to that buffer address.
One way to perform that jump is by, for example, overwriting a .dtors entry. Once the vulnerable program finishes, the shellcode - placed in the buffer - will be executed. The complicated part is the .dtors overwriting. For that you will have to use the published heap exploiting techniques.
The prerequisites are that ASLR is deactivated (to know the address of the buffer before executing the vulnerable program) and that the memory region where the buffer is placed must be executable.
On more thing, steps 2 and 3 are the same. If you control eip, it's logic that you will point it to the shellcode (the arbitrary code).
P.S.: Bypassing ASLR is more complex.
Step 1 requires a vulnerability in the attacked code.
Common vulnerabilites include:
buffer overflow (common i C code, happens if the program reads an arbitrary long string into a fixed buffer)
evaluation of unsanitized data (common in SQL and script languages, but can occur in other languages as well)
Step 3 requires detailed knowledge of the target architecture.
How do I inject my code into process space?
This is quite a statement/question. It requires an 'exploitable' region of code in said process space. For example, Windows is currently rewriting most strcpy() to strncpy() if at all possible. I say if possible
because not all areas of code that use strcpy can successfully be changed over to strncpy. Why? BECAUSE ~# of this crux in difference shown below;
strcpy($buffer, $copied);
or
strncpy($buffer, $copied, sizeof($copied));
This is what makes strncpy so difficult to implement in real world scenarios. There has to be installed a 'magic number' on most strncpy operations (the sizeof() operator creates this magic number)
As coders' we are taught using hard coded values such as a strict compliance with a char buffer[1024]; is really bad coding practise.
BUT ~ in comparison - using buffer[]=""; or buffer[1024]=""; is the heart of the exploit. HOWEVER, if for example we change this code to the latter we get another exploit introduced into the system...
char * buffer;
char * copied;
strcpy(buffer, copied);//overflow this right here...
OR THIS:
int size = 1024;
char buffer[size];
char copied[size];
strncpy(buffer,copied, size);
This will stop overflows, but introduce a exploitable region in RAM due to size being predictable and structured into 1024 blocks of code/data.
Therefore, original poster, looking for strcpy for example, in a program's address space, will make the program exploitable if strcpy is present.
There are many reasons why strcpy is favoured by programmers over strncpy. Magic numbers, variable input/output data size...programming styles...etc...
HOW DO I FIND MYSELF IN MY CODE (MY LOCATION)
Check various hacker books for examples of this ~
BUT, try;
label:
pop eax
pop eax
call pointer
jmp label
pointer:
mov esp, eax
jmp $
This is an example that is non-working due to the fact that I do NOT want to be held responsible for writing the next Morris Worm! But, any decent programmer will get the jist of this code and know immediately what I am talking about here.
I hope your overflow techniques work in the future, my son!

Beginner assembly programming memory usage question

I've been getting into some assembly lately and its fun as it challenges everything i have learned. I was wondering if i could ask a few questions
When running an executable, does the entire executable get loaded into memory?
From a bit of fiddling i've found that constants aren't really constants? Is it just a compiler thing?
const int i = 5;
_asm { mov i, 0 } // i is now 0 and compiles fine
So are all variables assigned with a constant value embedded into the file as well?
Meaning:
int a = 1;
const int b = 2;
void something()
{
const int c = 3;
int d = 4;
}
Will i find all of these variables embedded in the file (in a hex editor or something)?
If the executable is loaded into memory then "constants" are technically using memory? I've read around on the net people saying that constants don't use memory, is this true?
Your executable's text (i.e. code) and data segments get mapped into the process's virtual address space when the executable starts up, but the bytes might not actually be copied from the disk until those memory locations are accessed. See http://en.wikipedia.org/wiki/Demand_paging
C-language constants actually exist in memory, because you have to be able to take the address of them. (That is, &i.) Constants are usually found in the .rdata segment of your executable image.
A constant is going to take up memory somewhere--if you have the constant number 42 in your program, there must be somewhere in memory where the 42 is stored, even if that means that it's stored as the argument of an immediate-mode instruction.
The OS loads the code and data segments in order to prepare them for execution.
If the executable has a resource segment, the application loads parts of it at demand.
It's true that const variables take memory space but compilers are free to optimize
for memory usage and code size, and embed their values in the code.
(in case they don't detect any address references for those variables)
const char * aka C strings, usually are interned by the compilers, to save memory.

Reading from 16-bit hardware registers

On an embedded system we have a setup that allows us to read arbitrary data over a command-line interface for diagnostic purposes. For most data, this works fine, we use memcpy() to copy data at the requested address and send it back across a serial connection.
However, for 16-bit hardware registers, memcpy() causes some problems. If I try to access a 16-bit hardware register using two 8-bit accesses, the high-order byte doesn't read correctly.
Has anyone encountered this issue? I'm a 'high-level' (C#/Java/Python/Ruby) guy that's moving closer to the hardware and this is alien territory.
What's the best way to deal with this? I see some info, specifically, a somewhat confusing [to me] post here. The author of this post has exactly the same issue I do but I hate to implement a solution without fully understanding what I'm doing.
Any light you can shed on this issue is much appreciated. Thanks!
In addition to what Eddie said, you typically need to use a volatile pointer to read a hardware register (assuming a memory mapped register, which is not the case for all systems, but it sounds like is true for yours). Something like:
// using types from stdint.h to ensure particular size values
// most systems that access hardware registers will have typedefs
// for something similar (for 16-bit values might be uint16_t, INT16U,
// or something)
uint16_t volatile* pReg = (int16_t volatile*) 0x1234abcd; // whatever the reg address is
uint16_t val = *pReg; // read the 16-bit wide register
Here's a series of articles by Dan Saks that should give you pretty much everything you need to know to be able to effectively use memory mapped registers in C/C++:
"Mapping memory"
"Mapping memory efficiently"
"More ways to map memory"
"Sizing and aligning device registers"
"Use volatile judiciously"
"Place volatile accurately"
"Volatile as a promise"
Each register in this hardware is exposed as a two-byte array, the first element is aligned at a two-byte boundary (its address is even). memcpy() runs a cycle and copies one byte at each iteration, so it copies from these registers this way (all loops unrolled, char is one byte):
*((char*)target) = *((char*)register);// evenly aligned - address is always even
*((char*)target + 1) = *((char*)register + 1);//oddly aligned - address is always odd
However the second line works incorrectly for some hardware specific reasons. If you copy two bytes at a time instead of one at a time, it is instead done this way (short int is two bytes):
*((short int*)target) = *((short*)register;// evenly aligned
Here you copy two bytes in one operation and the first byte is evenly aligned. Since there's no separate copying from an oddly aligned address, it works.
The modified memcpy checks whether the addresses are venely aligned and copies in tow bytes chunks if they are.
If you require access to hardware registers of a specific size, then you have two choices:
Understand how your C compiler generates code so you can use the appropriate integer type to access the memory, or
Embed some assembly to do the access with the correct byte or word size.
Reading hardware registers can have side affects, depending on the register and its function, of course, so it's important to access hardware registers with the proper sized access so you can read the entire register in one go.
Usually it's sufficient to use an integer type that is the same size as your register. On most compilers, a short is 16 bits.
void wordcpy(short *dest, const short *src, size_t bytecount)
{
int i;
for (i = 0; i < bytecount/2; ++i)
*dest++ = *src++;
}
I think all the detail is contained in that thread you posted so I'll try and break it down a little;
Specifically;
If you access a 16-bit hardware register using two 8-bit
accesses, the high-order byte doesn't read correctly (it
always read as 0xFF for me). This is fair enough since
TI's docs state that 16-bit hardware registers must be
read and written using 16-bit-wide instructions, and
normally would be, unless you're using memcpy() to
read them.
So the problem here is that the hardware registers only report the correct value if their values are read in a single 16-bit read. This would be equivalent to doing;
uint16 value = *(regAddress);
This reads from the address into the value register using a single 16-byte read. On the other hand you have memcpy which is copying data a single-byte at a time. Something like;
while (n--)
{
*(uint8*)pDest++ = *(uint8*)pSource++;
}
So this causes the registers to be read 8-bits (1 byte) at a time, resulting in the values being invalid.
The solution posted in that thread is to use a version of memcpy that will copy the data using 16-bit reads whereever the source and destination are a6-bit aligned.
What do you need to know? You've already found a separate post explaining it. Apparently the CPU documentation requires that 16-bit hardware registers are accessed with 16-bit reads and writes, but your implementation of memcpy uses 8-bit reads/writes. So they don't work together.
The solution is simply not to use memcpy to access this register.
Instead, write your own routine which copies 16-bit values.
Not sure exactly what the question is - I think that post has the right solution.
As you stated, the issue is that the standard memcpy() routine reads a byte at a time, which does not work correctly for memory mapped hardware registers. That is a limitation of the processor - there's simply no way to get a valid value reading a byte at at time.
The suggested solution is to write your own memcpy() which only works on word-aligned addresses, and reads 16-bit words at a time. This is fairly straightforward - the link gives both a c and an assembly version. The only gotcha is to make sure you always do the 16 bit copies from validly aligned address. You can do that in 2 ways: either use linker commands or pragmas to make sure things are aligned, or add a special case for the extra byte at the front of an unaligned buffer.

Resources