What is the difference between non-packed and packed instruction in the context of SIMD-operations? - sse

What is the difference between non-packed and packed instruction in the context of SIMD-operations?
I was reading an article on optimizing your code for SSE:
http://www.cortstratton.org/articles/OptimizingForSSE.php#batch
and this question arose when I read
"As an added bonus, movss is a non-packed instruction, which allows us to make better use of the parallel instruction decoders.."
So what is the difference?

To my understanding, packed means that conceptually more than one value is transferred or used as an operand, whereas non-packed means that only one value is is processed; non-packed means that no parallel processing takes place.

SSE supports two modes of operation:
Packed mode - instructions operate in parallel on all data operands
Scalar mode - instructions operate on the least significant pairs of packed data operands.
Source

Related

How do stack machines efficiently store data types of different sizes?

Suppose I have the following primitive stack implementation for a virtual machine:
unsigned long stack[512];
unsigned short top = 0;
void push(unsigned long qword) {
stack[top] = qword;
top++;
}
void pop() {
top--;
}
unsigned long get() {
return top-1;
}
This stack actually works fine (except that it doesn't check for an overflow) but I now have the following problem: It is quite inefficient.
Here is an example:
Let's say I want to push a byte onto the stack. I would now have to cast it to a long and then push it onto the stack. But now a whole 7 bytes are not being used. This feels kind of wrong.
So now I have the following question:
How do stack machines efficiently store data types of different sizes? Do they do the same as in this implementation?
There are different metrics of efficiency. Using an eight bytes long to store a single byte will raise the memory consumption. On the other hand, memory is not the major concern on most of today’s machines. Further, a stack is a pre-allocated amount of memory, typically. So as long as not the entire memory block has been exhausted, it is entirely irrelevant whether the unused seven bytes are within that long or on the other side of the location marked by top.
In terms of CPU time, you don’t gain any advantage of transferring a quantity smaller than the hardware’s bus size. In the best case, it makes no difference. In the worst case, transferring a single byte boils down to reading a long from memory, manipulating one byte of it and writing the long back. In the latter case, it would be more efficient to expand the byte to long, to overwrite all eight bytes explicitly.
This is reflected by the design of the Java bytecode, for example. It does not only drop support for pushing and popping quantities smaller than 32 bit, it doesn’t even support arithmetic instructions for them¹. So for most use cases, you don’t even know that a quantity could be a byte before pushing. Only formal parameter types and array types may refer to byte.
But note that a JVM isn’t even a stack engine in the narrowest sense. There is no support for pushing and popping arbitrary numbers of items. As explained in this answer, expressing the intent using a stack allows very compact instructions. But Java bytecode doesn’t allow branching to code locations with a different number of items on the stack. So it doesn’t support pushing or popping items in a loop. In other words, for each instruction, the actual offset into the stack is predictable and also the operand types are known. So it’s always possible to transform Java bytecode to an IR not using a stack straight-forwardly. Such transformed code could use instructions with arbitrary operand sizes, if that has a benefit on the particular target architecture.
¹ And that was accounting for hardware in use a quarter century ago
There's no "one true" way of doing this, and the Java VM uses a few different strategies. All types less than 32-bits in size are widened to 32-bits. Pushing 1 byte to the stack effectively pushes 4 bytes to the stack. The benefit is simplicity when there are fewer native value sizes to deal with.
Another strategy is used for 64-bit values. They occupy two stack slots instead of one. The JVM has specific opcodes which indicate which type of value they expect on the stack, and the verifier ensures that no opcode is attempting to access a variable off the stack that doesn't match the type that should be there.
A third strategy is used for object references. The actual pointer size can be 32 bits or 64 bits, depending on the CPU capabilities, whether the JVM is running in 64-bit mode, etc. The JVM has specific opcodes for handling object references, and the verifier checks this too.

What is the difference between loadu_ps and set_ps when using unformatted data?

I have some data that isn't stored as structure of arrays. What is the best practice for loading the data in registers?
__m128 _mm_set_ps (float e3, float e2, float e1, float e0)
// or
__m128 _mm_loadu_ps (float const* mem_addr)
With _mm_loadu_ps, I'd copy the data in a temporary stack array, vs. copying the data as values directly. Is there a difference?
It can be a tradeoff between latency and throughput, because separate stores into an array will cause a store-forwarding stall when you do a vector load. So it's high latency, but throughput could still be ok, and it doesn't compete with surrounding code for the vector shuffle execution unit. So it can be a throughput win if the surrounding code also has shuffle operations, vs. 3 shuffles to insert 3 elements into an XMM register after a scalar load of the first one. Either way it's still a lot of total uops, and that's another throughput bottleneck.
Most compilers like gcc and clang do a pretty good job with _mm_set_ps () when optimizing with -O3, whether the inputs are in memory or registers. I'd recommend it, except in some special cases.
The most common missed-optimization with _mm_set is when there's some locality between the inputs. e.g. don't do _mm_set_ps(a[i+2], a[i+3], a[i+0], a[i+1]]), because many compilers will use their regular pattern without taking advantage of the fact that 2 pairs of elements are contiguous in memory. In that case, use (the intrinsics for) movsd and movhps to load in two 64-bit chunks. (Not movlps: it merges into an existing register instead of zeroing the high elements, so it has a false dependency on the old contents while movsd zeros the high half.) Or a shufps if some reordering is needed between or within the 64-bit chunks.
The "regular pattern" that compilers use will usually be movss / insertps from memory if compiling with SSE4, or movss loads and unpcklps shuffles to combine pairs and then another unpcklps, unpcklpd, or movlhps to shuffle into one register. Or a shufps or shufpd if the compiler likes to waste code-side on immediate shuffle-control operands instead of using fixed shuffles intelligently.
See also Agner Fog's optimization guides for some handy tables of data-movement instructions to get a better idea of what the compiler has to work with, and how stuff performs. Note that Haswell and later can only do 1 shuffle per clock. Also other links in the x86 tag wiki.
There's no really cheap way for a compiler or human to do this, in the general case when you have 4 separate scalars that aren't contiguous in memory at all. Or for register inputs, where it can't optimize the way they're generated in registers in the first place to have some of them already packed together. (e.g. for function args passed in registers to a function that can't / doesn't inline.)
Anyway, it's not a big deal unless you have this inside an inner loop. In that case, definitely worry about it (and check the compiler's asm output to see if it made a mess or could do better if you program the gather yourself with intrinsics that map to single instructions like _mm_load_ss / _mm_shuffle_ps).
If possible, rearrange your data layout to make data contiguous in at least small chunks / stripes. (See https://stackoverflow.com/tags/sse/info, specifically these slides. But sometimes one part of the program needs the data one way, and the other needs another. Choose the layout that's good for the case that needs to be faster, or that runs more often, or whatever, and suck it up and do the best you can for the other part of the program. :P Possibly transpose / convert once to set up for multiple SIMD operations, but extra passes over data with no computation just suck up time and can hurt your computational intensity (how much ALU work you do for each time you load data into registers) more than they help.
And BTW, actual gather instructions (like AVX2 vgatherdps) are not very fast; even on Skylake it's probably not worth using a gather instruction for four 32-bit elements at known locations. On Broadwell / Haswell, gather is definitely not worth using for this.

Difference between Record and Packed Record [duplicate]

While reviewing some code in our legacy Delphi 7 program, I noticed that everywhere there is a record it is marked with packed. This of course means that the record is stored byte-for-byte and not aligned to be faster for the CPU to access. The packing seems to have been done blindly as an attempt to outsmart the compiler or something -- basically valuing a few bytes of memory instead of faster access
An example record:
TFooTypeRec = packed record
RID : Integer;
Description : String;
CalcInTotalIncome : Boolean;
RequireAddress : Boolean;
end;
Should I fix this and make every record normal or "not" packed? Or with modern CPUs and memory is this negligible and probably a waste of time? Are there any problems that can result from unpacking?
There is no way to answer this question without a full understanding of how each of those packed records are used in your application code. It is the same as asking "Should I change this variable declaration from Int64 to Byte ?"
Without knowing what values that variable will be expected and required to maintain the answer could be yes. Or it could be no.
Similarly in your case. If a record needs to be packed then it should be left packed. If it does not need to be packed then there is no harm in not packing it. If you are not sure or cannot tell, then the safest course is to leave them as they are.
As a guide to making this determination (should you decide to proceed), situations where record packing is required or recommended include:
persistence of record values
sharing of record values with [potentially] differently compiled code
strict compatibility with externally defined structures
deliberately overlaying a type layout over differently structured memory
This isn't necessarily an exhaustive list, and what these all have in common is:
records comprising a series of values in adjacent bytes that must and can be relied upon by any potential producer or consumer of the record without possibility of interference from the compiler or other factors
What I would recommend is that (if possible and practical) you determine what purpose packing serves in each case and add documentation to that effect to the record declaration itself so that anyone in the future with the same question doesn't have to go through that discovery process, e.g.:
type
TSomeRecordType = packed record
// This record must be packed as it is used for persistence
..
end;
TSomeExternType = packed record
// This record must be packed as it is required to be compatible
// in memory with an externally defined struct (ref: extern code docs)
..
end;
The main idea of using packed records is not that you save a few bytes of memory! Instead, it is about guaranteeing that the variables are where you expect them to be in memory. Without such a guarantee, it would be impossible (or, at least, difficult) to manage memory manually on the heap and write to and read from files.
Hence, the program might malfunction if you 'unpack' the records!
If the record is stored/retrieved as packed or transfered in any way to a receiver that expects it to be packed, then do not change it.
Update :
There is a string type declared in your example. It looks suspicious, since storing the record in a binary file will not preserve the string content.
Packed record have length exactly size like members are.
No packed record are optimised (thay are aligned -> consequently higher) for better performance.

Heap overflow exploit

I understand that overflow exploitation requires three steps:
1.Injecting arbitrary code (shellcode) into target process memory space.
2.Taking control over eip.
3.Set eip to execute arbitrary code.
I read ben hawkens articles about heap exploitation and understood few tactics about how to ultimatly override a function pointer to point to my code.
In other words, I understand step 2.
I do not understand step 1 and 3.
How do I inject my code to the process memory space ?
During step 3 I override a function pointer with a
Pointer to my shellcode, How can I calculate\know what address
Was my injected code injected into ? (This problem is solved
In stackoverflow by using "jmp esp).
In a heap overflow, supposing that the system does not have ASLR activated, you will know the address of the memory chunks (aka, the buffers) you use in the overflow.
One option is to place the shellcode where the buffer is, given that you can control the contents of the buffer (as the application user). Once you have placed the shellcode bytes in the buffer, you only have to jump to that buffer address.
One way to perform that jump is by, for example, overwriting a .dtors entry. Once the vulnerable program finishes, the shellcode - placed in the buffer - will be executed. The complicated part is the .dtors overwriting. For that you will have to use the published heap exploiting techniques.
The prerequisites are that ASLR is deactivated (to know the address of the buffer before executing the vulnerable program) and that the memory region where the buffer is placed must be executable.
On more thing, steps 2 and 3 are the same. If you control eip, it's logic that you will point it to the shellcode (the arbitrary code).
P.S.: Bypassing ASLR is more complex.
Step 1 requires a vulnerability in the attacked code.
Common vulnerabilites include:
buffer overflow (common i C code, happens if the program reads an arbitrary long string into a fixed buffer)
evaluation of unsanitized data (common in SQL and script languages, but can occur in other languages as well)
Step 3 requires detailed knowledge of the target architecture.
How do I inject my code into process space?
This is quite a statement/question. It requires an 'exploitable' region of code in said process space. For example, Windows is currently rewriting most strcpy() to strncpy() if at all possible. I say if possible
because not all areas of code that use strcpy can successfully be changed over to strncpy. Why? BECAUSE ~# of this crux in difference shown below;
strcpy($buffer, $copied);
or
strncpy($buffer, $copied, sizeof($copied));
This is what makes strncpy so difficult to implement in real world scenarios. There has to be installed a 'magic number' on most strncpy operations (the sizeof() operator creates this magic number)
As coders' we are taught using hard coded values such as a strict compliance with a char buffer[1024]; is really bad coding practise.
BUT ~ in comparison - using buffer[]=""; or buffer[1024]=""; is the heart of the exploit. HOWEVER, if for example we change this code to the latter we get another exploit introduced into the system...
char * buffer;
char * copied;
strcpy(buffer, copied);//overflow this right here...
OR THIS:
int size = 1024;
char buffer[size];
char copied[size];
strncpy(buffer,copied, size);
This will stop overflows, but introduce a exploitable region in RAM due to size being predictable and structured into 1024 blocks of code/data.
Therefore, original poster, looking for strcpy for example, in a program's address space, will make the program exploitable if strcpy is present.
There are many reasons why strcpy is favoured by programmers over strncpy. Magic numbers, variable input/output data size...programming styles...etc...
HOW DO I FIND MYSELF IN MY CODE (MY LOCATION)
Check various hacker books for examples of this ~
BUT, try;
label:
pop eax
pop eax
call pointer
jmp label
pointer:
mov esp, eax
jmp $
This is an example that is non-working due to the fact that I do NOT want to be held responsible for writing the next Morris Worm! But, any decent programmer will get the jist of this code and know immediately what I am talking about here.
I hope your overflow techniques work in the future, my son!

Reading from 16-bit hardware registers

On an embedded system we have a setup that allows us to read arbitrary data over a command-line interface for diagnostic purposes. For most data, this works fine, we use memcpy() to copy data at the requested address and send it back across a serial connection.
However, for 16-bit hardware registers, memcpy() causes some problems. If I try to access a 16-bit hardware register using two 8-bit accesses, the high-order byte doesn't read correctly.
Has anyone encountered this issue? I'm a 'high-level' (C#/Java/Python/Ruby) guy that's moving closer to the hardware and this is alien territory.
What's the best way to deal with this? I see some info, specifically, a somewhat confusing [to me] post here. The author of this post has exactly the same issue I do but I hate to implement a solution without fully understanding what I'm doing.
Any light you can shed on this issue is much appreciated. Thanks!
In addition to what Eddie said, you typically need to use a volatile pointer to read a hardware register (assuming a memory mapped register, which is not the case for all systems, but it sounds like is true for yours). Something like:
// using types from stdint.h to ensure particular size values
// most systems that access hardware registers will have typedefs
// for something similar (for 16-bit values might be uint16_t, INT16U,
// or something)
uint16_t volatile* pReg = (int16_t volatile*) 0x1234abcd; // whatever the reg address is
uint16_t val = *pReg; // read the 16-bit wide register
Here's a series of articles by Dan Saks that should give you pretty much everything you need to know to be able to effectively use memory mapped registers in C/C++:
"Mapping memory"
"Mapping memory efficiently"
"More ways to map memory"
"Sizing and aligning device registers"
"Use volatile judiciously"
"Place volatile accurately"
"Volatile as a promise"
Each register in this hardware is exposed as a two-byte array, the first element is aligned at a two-byte boundary (its address is even). memcpy() runs a cycle and copies one byte at each iteration, so it copies from these registers this way (all loops unrolled, char is one byte):
*((char*)target) = *((char*)register);// evenly aligned - address is always even
*((char*)target + 1) = *((char*)register + 1);//oddly aligned - address is always odd
However the second line works incorrectly for some hardware specific reasons. If you copy two bytes at a time instead of one at a time, it is instead done this way (short int is two bytes):
*((short int*)target) = *((short*)register;// evenly aligned
Here you copy two bytes in one operation and the first byte is evenly aligned. Since there's no separate copying from an oddly aligned address, it works.
The modified memcpy checks whether the addresses are venely aligned and copies in tow bytes chunks if they are.
If you require access to hardware registers of a specific size, then you have two choices:
Understand how your C compiler generates code so you can use the appropriate integer type to access the memory, or
Embed some assembly to do the access with the correct byte or word size.
Reading hardware registers can have side affects, depending on the register and its function, of course, so it's important to access hardware registers with the proper sized access so you can read the entire register in one go.
Usually it's sufficient to use an integer type that is the same size as your register. On most compilers, a short is 16 bits.
void wordcpy(short *dest, const short *src, size_t bytecount)
{
int i;
for (i = 0; i < bytecount/2; ++i)
*dest++ = *src++;
}
I think all the detail is contained in that thread you posted so I'll try and break it down a little;
Specifically;
If you access a 16-bit hardware register using two 8-bit
accesses, the high-order byte doesn't read correctly (it
always read as 0xFF for me). This is fair enough since
TI's docs state that 16-bit hardware registers must be
read and written using 16-bit-wide instructions, and
normally would be, unless you're using memcpy() to
read them.
So the problem here is that the hardware registers only report the correct value if their values are read in a single 16-bit read. This would be equivalent to doing;
uint16 value = *(regAddress);
This reads from the address into the value register using a single 16-byte read. On the other hand you have memcpy which is copying data a single-byte at a time. Something like;
while (n--)
{
*(uint8*)pDest++ = *(uint8*)pSource++;
}
So this causes the registers to be read 8-bits (1 byte) at a time, resulting in the values being invalid.
The solution posted in that thread is to use a version of memcpy that will copy the data using 16-bit reads whereever the source and destination are a6-bit aligned.
What do you need to know? You've already found a separate post explaining it. Apparently the CPU documentation requires that 16-bit hardware registers are accessed with 16-bit reads and writes, but your implementation of memcpy uses 8-bit reads/writes. So they don't work together.
The solution is simply not to use memcpy to access this register.
Instead, write your own routine which copies 16-bit values.
Not sure exactly what the question is - I think that post has the right solution.
As you stated, the issue is that the standard memcpy() routine reads a byte at a time, which does not work correctly for memory mapped hardware registers. That is a limitation of the processor - there's simply no way to get a valid value reading a byte at at time.
The suggested solution is to write your own memcpy() which only works on word-aligned addresses, and reads 16-bit words at a time. This is fairly straightforward - the link gives both a c and an assembly version. The only gotcha is to make sure you always do the 16 bit copies from validly aligned address. You can do that in 2 ways: either use linker commands or pragmas to make sure things are aligned, or add a special case for the extra byte at the front of an unaligned buffer.

Resources