How do stack machines efficiently store data types of different sizes? - stack

Suppose I have the following primitive stack implementation for a virtual machine:
unsigned long stack[512];
unsigned short top = 0;
void push(unsigned long qword) {
stack[top] = qword;
top++;
}
void pop() {
top--;
}
unsigned long get() {
return top-1;
}
This stack actually works fine (except that it doesn't check for an overflow) but I now have the following problem: It is quite inefficient.
Here is an example:
Let's say I want to push a byte onto the stack. I would now have to cast it to a long and then push it onto the stack. But now a whole 7 bytes are not being used. This feels kind of wrong.
So now I have the following question:
How do stack machines efficiently store data types of different sizes? Do they do the same as in this implementation?

There are different metrics of efficiency. Using an eight bytes long to store a single byte will raise the memory consumption. On the other hand, memory is not the major concern on most of today’s machines. Further, a stack is a pre-allocated amount of memory, typically. So as long as not the entire memory block has been exhausted, it is entirely irrelevant whether the unused seven bytes are within that long or on the other side of the location marked by top.
In terms of CPU time, you don’t gain any advantage of transferring a quantity smaller than the hardware’s bus size. In the best case, it makes no difference. In the worst case, transferring a single byte boils down to reading a long from memory, manipulating one byte of it and writing the long back. In the latter case, it would be more efficient to expand the byte to long, to overwrite all eight bytes explicitly.
This is reflected by the design of the Java bytecode, for example. It does not only drop support for pushing and popping quantities smaller than 32 bit, it doesn’t even support arithmetic instructions for them¹. So for most use cases, you don’t even know that a quantity could be a byte before pushing. Only formal parameter types and array types may refer to byte.
But note that a JVM isn’t even a stack engine in the narrowest sense. There is no support for pushing and popping arbitrary numbers of items. As explained in this answer, expressing the intent using a stack allows very compact instructions. But Java bytecode doesn’t allow branching to code locations with a different number of items on the stack. So it doesn’t support pushing or popping items in a loop. In other words, for each instruction, the actual offset into the stack is predictable and also the operand types are known. So it’s always possible to transform Java bytecode to an IR not using a stack straight-forwardly. Such transformed code could use instructions with arbitrary operand sizes, if that has a benefit on the particular target architecture.
¹ And that was accounting for hardware in use a quarter century ago

There's no "one true" way of doing this, and the Java VM uses a few different strategies. All types less than 32-bits in size are widened to 32-bits. Pushing 1 byte to the stack effectively pushes 4 bytes to the stack. The benefit is simplicity when there are fewer native value sizes to deal with.
Another strategy is used for 64-bit values. They occupy two stack slots instead of one. The JVM has specific opcodes which indicate which type of value they expect on the stack, and the verifier ensures that no opcode is attempting to access a variable off the stack that doesn't match the type that should be there.
A third strategy is used for object references. The actual pointer size can be 32 bits or 64 bits, depending on the CPU capabilities, whether the JVM is running in 64-bit mode, etc. The JVM has specific opcodes for handling object references, and the verifier checks this too.

Related

Stack data size

I cannot find answer to this question. What is the size of data pushed on stack?
Let's say that you push some data on the stack. For example an int. Then value of stack pointer decreases by 4 bytes.
Until today I thought that the biggest data that can be pushed on stack cannot be larger than pointer size. But I did a little experiment. I wrote a simple app in C#:
int i = 0; //0x68 <-- address of variable
int j = 1; //0x64
ulong k = 2; //0x5C
int l = 3; //0x58
Due to my predictions, ulong should be allocated on heap, because it needs 8 bytes, while I am working on 32 bit system (so pointer size is 4 bytes). And that would be very odd, because this is simple local variable.
But it was pushed on stack.
So I think that there is something wrong with the way I think of stack. Because how would stack pointer "know" if he must be changed 4 bytes (if we push int) or 8 bytes (if we push long or double) or just 1 byte?
Or maybe values of local variables are on heap but addresses of those variables are on stack. But this would make no sense, because that's how objects are processed. When I create object (using new keyword) object is allocated on heap and address to this object is pushed on stack, right?
So could someone tell me (or give a link to article) how it really works?
The compiler or run-time environment knows the kind of data you are working with, and knows how to structure into units that will fit onto the stack. So, for example, on an architecture where the stack pointer only moves in 32-bit chunks, the compiler or run-time will know how to format a data element that is longer that 32-bits as multiple 32-bit chunks to push them. It will know, similarly, how to pop them off the stack and reconstruct a variable of the appropriate type.
The problem here is not really different for the stack than it is for any other kind of memory. The CPU architecture will be optimised to work with a small number of data elements of a particular size, but programming languages typically provide a much wider range of data types. The compiler or the run-time environment has to handle the packing and unpacking of data this situation requires.

what does Double word alignment mean?

what is double word aligned data??. i am working on a Ti processor with the c6accel DSP engine.The fft function requires the input data array of samples to be double word aligned.what does double word aligned exactly mean and how do i generate it?
A word is the amount of data which each register in the CPU is able to hold. This is dependant on the processor being used - on 32-bit systems this will be 32 bits, and on 64-bit systems this will be 64 bits, etc... Your TI processor will probably be either 16-bit or 32-bit depending on the model (guess based on this).
The size of a word will generally correspond to the size of a pointer, although technically this is not guarenteed to be true (doesn't work on PS3/XB360) and as a result should not be relied on as a rule (source). The correct way of determining the size of a world will depend on which operating system you're using. As quoted from the previous source:
The C header file may defines WORD_BIT and/or __WORDSIZE.
The size of a double word is just the size of a word * 2. Data in memory (RAM) is generally fetched by programs 1 word at a time assuming the fetch begins on a word boundary. If this is not the case the data on either side of the word boundary will need to be fetched in two separate instructions, which leads to inefficiencies since twice the amount of fetches need to be done and reading/writing to/from RAM is relatively slow (Sidenote: this is largely mitigated by caches in modern processors, although that's another topic altogether).
For TI C6x: double word = 64-bit
Make sure your array starts at an address that is a multiple of 8 bytes.
You can use the assembly directive .align before declaring your array, or you can use the linker to link your array section at an aligned address. If you can't control the address (e.g. you have memory from malloc), make your array start after a few bytes (to the next 8 byte boundary).

x86 Can push/pop be less than 4 bytes? [duplicate]

This question already has answers here:
Why is it not possible to push a byte onto a stack on Pentium IA-32?
(4 answers)
Closed 7 years ago.
Hi I am reading a guide on x86 by the University of Virginia and it states that pushing and popping the stack either removes or adds a 4-byte data element to the stack.
Why is this set to 4 bytes? Can this be changed, could you save memory on the stack by pushing on smaller data elements?
The guide can be found here if anyone wishes to view it:
http://www.cs.virginia.edu/~evans/cs216/guides/x86.html
Short answer: Yes, 16 or 32 bits. And, for x86-64, 64 bits.
The primary reasons for a stack are to return from nested function calls and to save/restore register values. It is also typically used to pass parameters and return function results. Except for the smallest parameters, these items usually have the same size by the design of the processor, namely, the size of the instruction pointer register. For 8088/8086, it is 16-bits. For 80386 and successors, it is 32-bits. Therefore, there is little value in having stack instructions that operate on other sizes.
There is also the consideration of the size of the data on the memory bus. It takes the same amount of time to retrieve or store a word as it does a byte. (Except for 8088 which has 16-bit registers but an 8-bit data bus.) Alignment also comes into play. The stack should be aligned on word boundaries so each value can be retrieved as one memory operation. The trade-off is usually taken to save time over saving memory. To pass one byte as a parameter, one word is usually used. (Or, depending on the optimization available to the compiler, one word-sized register would be used, avoiding the stack altogether.)

Is it possible to use cudaMemcpy with src and dest as different types?

I'm using a Tesla, and for the first time, I'm running low on CPU memory instead of GPU memory! Hence, I thought I could cut the size of my host memory by switching all integers to short (all my values are below 255).
However, I want my device memory to use integers, since the memory access is faster. So is there a way to copy my host memory (in short) to my device global memory (in int)? I guess this won't work:
short *buf_h = new short[100];
int *buf_d = NULL;
cudaMalloc((void **)&buf_d, 100*sizeof(int));
cudaMemcpy( buf_d, buf_h, 100*sizeof(short), cudaMemcpyHostToDevice );
Any ideas? Thanks!
There isn't really a way to do what you are asking directly. The CUDA API doesn't support "smart copying" with padding or alignment, or "deep copying" of nested pointers, or anything like that. Memory transfers require linear host and device memory, and alignment must be the same between source and destination memory.
Having said that, one approach to circumvent this restriction would be to copy the host short data to an allocation of short2 on the device. Your device code can retrieve a short2 containing two packed shorts, extract the value it needs and then cast the value to int. This will give the code 32 bit memory transactions per thread, allowing for memory coalescing, and (if you are using Fermi GPUs) good L1 cache hit rates, because adjacent threads within a block would be reading the same 32 bit word. On non Fermi GPUs, you could probably use a shared memory scheme to efficiently retrieve all the values for a block using coalesced reads.

Purpose of memory alignment

Admittedly I don't get it. Say you have a memory with a memory word of length of 1 byte. Why can't you access a 4 byte long variable in a single memory access on an unaligned address(i.e. not divisible by 4), as it's the case with aligned addresses?
The memory subsystem on a modern processor is restricted to accessing memory at the granularity and alignment of its word size; this is the case for a number of reasons.
Speed
Modern processors have multiple levels of cache memory that data must be pulled through; supporting single-byte reads would make the memory subsystem throughput tightly bound to the execution unit throughput (aka cpu-bound); this is all reminiscent of how PIO mode was surpassed by DMA for many of the same reasons in hard drives.
The CPU always reads at its word size (4 bytes on a 32-bit processor), so when you do an unaligned address access — on a processor that supports it — the processor is going to read multiple words. The CPU will read each word of memory that your requested address straddles. This causes an amplification of up to 2X the number of memory transactions required to access the requested data.
Because of this, it can very easily be slower to read two bytes than four. For example, say you have a struct in memory that looks like this:
struct mystruct {
char c; // one byte
int i; // four bytes
short s; // two bytes
}
On a 32-bit processor it would most likely be aligned like shown here:
The processor can read each of these members in one transaction.
Say you had a packed version of the struct, maybe from the network where it was packed for transmission efficiency; it might look something like this:
Reading the first byte is going to be the same.
When you ask the processor to give you 16 bits from 0x0005 it will have to read a word from 0x0004 and shift left 1 byte to place it in a 16-bit register; some extra work, but most can handle that in one cycle.
When you ask for 32 bits from 0x0001 you'll get a 2X amplification. The processor will read from 0x0000 into the result register and shift left 1 byte, then read again from 0x0004 into a temporary register, shift right 3 bytes, then OR it with the result register.
Range
For any given address space, if the architecture can assume that the 2 LSBs are always 0 (e.g., 32-bit machines) then it can access 4 times more memory (the 2 saved bits can represent 4 distinct states), or the same amount of memory with 2 bits for something like flags. Taking the 2 LSBs off of an address would give you a 4-byte alignment; also referred to as a stride of 4 bytes. Each time an address is incremented it is effectively incrementing bit 2, not bit 0, i.e., the last 2 bits will always continue to be 00.
This can even affect the physical design of the system. If the address bus needs 2 fewer bits, there can be 2 fewer pins on the CPU, and 2 fewer traces on the circuit board.
Atomicity
The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.
Conclusion
The memory system of a processor is quite a bit more complex and involved than described here; a discussion on how an x86 processor actually addresses memory can help (many processors work similarly).
There are many more benefits to adhering to memory alignment that you can read at this IBM article.
A computer's primary use is to transform data. Modern memory architectures and technologies have been optimized over decades to facilitate getting more data, in, out, and between more and faster execution units–in a highly reliable way.
Bonus: Caches
Another alignment-for-performance that I alluded to previously is alignment on cache lines which are (for example, on some CPUs) 64B.
For more info on how much performance can be gained by leveraging caches, take a look at Gallery of Processor Cache Effects; from this question on cache-line sizes
Understanding of cache lines can be important for certain types of program optimizations. For example, the alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.
It's a limitation of many underlying processors. It can usually be worked around by doing 4 inefficient single byte fetches rather than one efficient word fetch, but many language specifiers decided it would be easier just to outlaw them and force everything to be aligned.
There is much more information in this link that the OP discovered.
you can with some processors (the nehalem can do this), but previously all memory access was aligned on a 64-bit (or 32-bit) line, because the bus is 64 bits wide, you had to fetch 64 bit at a time, and it was significantly easier to fetch these in aligned 'chunks' of 64 bits.
So, if you wanted to get a single byte, you fetched the 64-bit chunk and then masked off the bits you didn't want. Easy and fast if your byte was at the right end, but if it was in the middle of that 64-bit chunk, you'd have to mask off the unwanted bits and then shift the data over to the right place. Worse, if you wanted a 2 byte variable, but that was split across 2 chunks, then that required double the required memory accesses.
So, as everyone thinks memory is cheap, they just made the compiler align the data on the processor's chunk sizes so your code runs faster and more efficiently at the cost of wasted memory.
Fundamentally, the reason is because the memory bus has some specific length that is much, much smaller than the memory size.
So, the CPU reads out of the on-chip L1 cache, which is often 32KB these days. But the memory bus that connects the L1 cache to the CPU will have the vastly smaller width of the cache line size. This will be on the order of 128 bits.
So:
262,144 bits - size of memory
128 bits - size of bus
Misaligned accesses will occasionally overlap two cache lines, and this will require an entirely new cache read in order to obtain the data. It might even miss all the way out to the DRAM.
Furthermore, some part of the CPU will have to stand on its head to put together a single object out of these two different cache lines which each have a piece of the data. On one line, it will be in the very high order bits, in the other, the very low order bits.
There will be dedicated hardware fully integrated into the pipeline that handles moving aligned objects onto the necessary bits of the CPU data bus, but such hardware may be lacking for misaligned objects, because it probably makes more sense to use those transistors for speeding up correctly optimized programs.
In any case, the second memory read that is sometimes necessary would slow down the pipeline no matter how much special-purpose hardware was (hypothetically and foolishly) dedicated to patching up misaligned memory operations.
#joshperry has given an excellent answer to this question. In addition to his answer, I have some numbers that show graphically the effects which were described, especially the 2X amplification. Here's a link to a Google spreadsheet showing what the effect of different word alignments look like.
In addition here's a link to a Github gist with the code for the test.
The test code is adapted from the article written by Jonathan Rentzsch which #joshperry referenced. The tests were run on a Macbook Pro with a quad-core 2.8 GHz Intel Core i7 64-bit processor and 16GB of RAM.
If you have a 32bit data bus, the address bus address lines connected to the memory will start from A2, so only 32bit aligned addresses can be accessed in a single bus cycle.
So if a word spans an address alignment boundary - i.e. A0 for 16/32 bit data or A1 for 32 bit data are not zero, two bus cycles are required to obtain the data.
Some architectures/instruction sets do not support unaligned access and will generate an exception on such attempts, so compiler generated unaligned access code requires not just additional bus cycles, but additional instructions, making it even less efficient.
If a system with byte-addressable memory has a 32-bit-wide memory bus, that means there are effectively four byte-wide memory systems which are all wired to read or write the same address. An aligned 32-bit read will require information stored in the same address in all four memory systems, so all systems can supply data simultaneously. An unaligned 32-bit read would require some memory systems to return data from one address, and some to return data from the next higher address. Although there are some memory systems that are optimized to be able to fulfill such requests (in addition to their address, they effectively have a "plus one" signal which causes them to use an address one higher than specified) such a feature adds considerable cost and complexity to a memory system; most commodity memory systems simply cannot return portions of different 32-bit words at the same time.
On PowerPC you can load an integer from an odd address with no problems.
Sparc and I86 and (I think) Itatnium raise hardware exceptions when you try this.
One 32 bit load vs four 8 bit loads isnt going to make a lot of difference on most modern processors. Whether the data is already in cache or not will have a far greater effect.

Resources