what does Double word alignment mean? - signal-processing

what is double word aligned data??. i am working on a Ti processor with the c6accel DSP engine.The fft function requires the input data array of samples to be double word aligned.what does double word aligned exactly mean and how do i generate it?

A word is the amount of data which each register in the CPU is able to hold. This is dependant on the processor being used - on 32-bit systems this will be 32 bits, and on 64-bit systems this will be 64 bits, etc... Your TI processor will probably be either 16-bit or 32-bit depending on the model (guess based on this).
The size of a word will generally correspond to the size of a pointer, although technically this is not guarenteed to be true (doesn't work on PS3/XB360) and as a result should not be relied on as a rule (source). The correct way of determining the size of a world will depend on which operating system you're using. As quoted from the previous source:
The C header file may defines WORD_BIT and/or __WORDSIZE.
The size of a double word is just the size of a word * 2. Data in memory (RAM) is generally fetched by programs 1 word at a time assuming the fetch begins on a word boundary. If this is not the case the data on either side of the word boundary will need to be fetched in two separate instructions, which leads to inefficiencies since twice the amount of fetches need to be done and reading/writing to/from RAM is relatively slow (Sidenote: this is largely mitigated by caches in modern processors, although that's another topic altogether).

For TI C6x: double word = 64-bit
Make sure your array starts at an address that is a multiple of 8 bytes.
You can use the assembly directive .align before declaring your array, or you can use the linker to link your array section at an aligned address. If you can't control the address (e.g. you have memory from malloc), make your array start after a few bytes (to the next 8 byte boundary).

Related

How many values can be stored per physical address in Memory?

I've read that you can only store one value per physical address in Ram. Now this data could be an instruction or data. Is this due to when the CPU reads in a Word from Ram, it can only deal with one value at a time? be that an instruction, int or a string. Is there a technical reason you can't fit more than one value per index. I've read about Scalar Processors but aren't they really old. Couldn't you fit two or more values in the width of a 64 bit Word for example? Or am i missing something really obvious here. I guess i'm asking is this a programming concept or is there an actual technical/hardware reason the cpu can't deal with more than one value per read of a Word from Ram..
Thanks
Rob
Most recent computers use addresses that point to a "Byte" location in memory.
Each machine instruction that includes "load (or store) from memory" functionality includes either an implicit or explicit specification of the number of bytes to be loaded/stored, starting at the target byte address. Common sizes are 1, 2, 4, 8 Bytes (corresponding to single data items of the most commonly supported sizes).
It is up to the application program to decide how to interpret the bytes and what operations to perform on them. It is certainly common to store the characters of a string in consecutive byte memory locations and process 4 or 8 characters at a time using 32-bit (4-Byte) or 64-bit (8-Byte) load and store instructions. Operation on the individual bytes (characters) may involves masking, shifting, and copying within the processor's general-purpose registers, but since the late 1990's, many/most microprocessors have included instructions specifically designed to treat the contents of a register as multiple independent (smaller) values.
"Packing" multiple data items into consecutive bytes of memory need not be limited to the sizes of registers for supported arithmetic types (1, 2, 4, 8 Bytes). Since about 2000, many processors have also included "Single Instruction Multiple Data" (SIMD) instructions to load bigger payloads into a set of "SIMD registers". (Common sizes are 16 and 32 Bytes, but some processors support 64 Byte registers.) Systems that include these SIMD load and store instructions typically also include instructions to operate on the SIMD registers "in parallel" -- treating the register contents as multiple independent values. It is common to provide instructions to treat the contents of a 256-bit (32-Byte) register as 32 1-Byte values, 16 2-Byte values, 8 4-Byte values, or 4 8-Byte values. The details vary by processor architecture and generation.

Is it easier to fetch a 4 byte word from a word addressable memory compared to byte addressable?

So i did find some answers related to this in stackvoerflow but non of them clearly answered this
so if our memory is byte addressable and the word size is for example 4 byte, then why not make the memory byte addressable?
if i'm not mistaking CPU will work with words right? so when the cpu tries to get a word from the memory what's the difference between getting a 4 byte word from a byte addressable memory vs getting a word from word addressable memory?
if i'm not mistaking CPU will work with words right?
It depends on the Instruction Set Architecture (ISA) implemented by the CPU. For example, x86 supports operands of sizes ranging from a single 8-bit byte to as much as 64 bytes (in the most recent CPUs). Although the word size in modern x86 CPUs is 8 or 4 bytes only. The word size is generally defined as equal to the size of a general-purpose register. However, the granularity of accessing memory or registers is not necessarily restricted to the word size. This is very convenient from a programmer's perspective and from the CPU implementation perspective as I'll discuss next.
so when the cpu tries to get a word from the memory what's the
difference between getting a 4 byte word from a byte addressable
memory vs getting a word from word addressable memory?
While an ISA may support byte addressability, a CPU that implements the ISA may not necessarily fetch data from memory one byte at a time. Spatial locality of reference is a memory access pattern very common in most real programs. If the CPU was to issue single-byte requests along the memory hierarchy, it would unnecessarily consume a lot of energy and significantly hurt performance to handle single-byte requests and move one-byte data across the hierarchy. Therefore, typically, when the CPU issues a memory request for data of some size at some address, a whole block of memory (known as a cache line, which is usually 64-byte in size and 64-byte aligned) is brought to the L1 cache. All requests to the same cache line can be effectively combined into a single request. Therefore, the address bus between different levels of the memory hierarchy does not have to include wires for the bits that constitute an offset within the cache line. In that case, the implementation would be really addressing memory at the 64-byte granularity.
It can be useful, however, to support byte addressability in the implementation . For example, if only one byte of a cache line has changed and the cache line has to be written back to main memory, instead of sending all the 64 bytes to memory, it would take less energy, bandwidth, and time to send only the byte that changed (or few bytes). Another situation where byte addressability is useful is when providing support for the idea of critical-word first. This is much more to it, but to keep the answer simple, I'll stop here.
DDR SDRAM is a prevalent class of main memory interfaces used in most computer systems today. The data bus width is 8 bytes in size and the protocol supports only transferring aligned 8-byte chunks with byte enable signals (called data masks) to select which bytes to write. Therefore, main memory is typically 8-byte addressable. It is the CPU that provides the illusion of byte addressability.
memory normally is byte-addressable. But whole-word loads are possible, and get 4x as much data in the same time.
There's basically no difference, if the word load is naturally aligned; the low bits of the address are zero instead of being not present.

Why is the smallest value that can be stored is a Byte(8bit) & not a Bit(1bit)?

Why is the smallest value that can be stored a Byte(8bit) & not a Bit(1bit) in memory?
Even booleans are stored as Bytes. Will we ever bump the smallest number to 32 or 64bits like register's on the CPU?
EDIT: To clarify as many answers seemed confused about the nature of questing. This question is about why isn't a byte 7-bit, 1-bit, 32-bit, etc (not why lower bit primitives must fit within the hardware's byte at min). Is the 8-bit byte simply historical as some hardware has 10-bit bytes for example. Or is there a mathematical reason 8-bit is ideal vs say 10-bit for general processing?
The hardware is built to read data in blocks (bytes, later words and dwords). This provides greater efficiency, than accessing individual bits, and also offers more addressing range. So most data is aligned to at least byte boundary. There exist encodings that operate with bit sequences, rather than bytes, but they are quite rare.
Nowadays the data is most often aligned to dword (32-bits) boundary anyway. Moreover, some hardware (ARM, for example), can't access misaligned multibyte variables, i.e. 16-bit word can't "cross" dword boundary - exception will be thrown.
Because computers address memory at the byte level, so anything smaller than a byte is not addressable.
The underlying methods of processor access are limited to the size of the smallest usable register. On most architectures, that size is 8 bits. You can use smaller portions of these; for instance, C has the bitfield feature in structs that will allow combining fields that only need to be certain bit lengths. Access will still require that the whole byte be read.
Some older exotic architectures actually did have different a "word size." In these machines, 10 bits might be the common size.
Lastly, processors are almost always backwards compatible. Intel, for instance, has maintained complete instruction compatibility from the 386 on up. If you take a program compiled for the 386, it will still run on an i7 processor. Changing the word size would break compatibility. So while it is possible, no manufacturer will ever do it.
Assume that we have native language that consist of 2 character such as a , b
to distinguish two characters we need at least 1 bit for example 0 to represent char a and 1 to represent char b
so that if we count number of characters and special characters and symbols, there are 128 character and to distinguish one character from another, you need log2(128) = 7 bit and 8th bit for transmission

Array not crossing 256 byte boundary

Is it possible to create an array that doesn't cross 256 byte boundary? That is addresses of the individual array items only differ in the lower byte. This is weaker requirement than keeping the array aligned to 256 bytes. The only solution I could think of was aligning to next_power_of_two(sizeof(array)), but I'm not sure about the gaps that would appear this way.
It is for a library for AVR microcontrollers, and this would save me a few precious instructions in an interrupt handler.The array that should have this property is 54 byte long out of about 80 bytes of total static memory used by the library. I'm looking for a way that doesn't increase the memory requirements.
I'm using avr-as gnu assembler and avr-ld linker.
Example:If the array starts at address 0x00f0, then the higher word will change from 0x00 to 0x01 while traversing the array.
I don't really care whether it starts at address 0x0100 or 0x0101 as long as it doesn't cross the boundary.
You only need 64 byte alignment to meet this requirement, so e.g. this should work:
uint8_t a[54] __attribute__ ((aligned(64)));
I don't know anything about AVR microcontrollers, but, generally speaking, static variables are usually placed in the data section of the executable, and, since your static memory requirements are low, all you need to ensure is that the data section is 256 byte aligned. (Which it may be by default. On x86, it usually is.) Check the linker options...

Purpose of memory alignment

Admittedly I don't get it. Say you have a memory with a memory word of length of 1 byte. Why can't you access a 4 byte long variable in a single memory access on an unaligned address(i.e. not divisible by 4), as it's the case with aligned addresses?
The memory subsystem on a modern processor is restricted to accessing memory at the granularity and alignment of its word size; this is the case for a number of reasons.
Speed
Modern processors have multiple levels of cache memory that data must be pulled through; supporting single-byte reads would make the memory subsystem throughput tightly bound to the execution unit throughput (aka cpu-bound); this is all reminiscent of how PIO mode was surpassed by DMA for many of the same reasons in hard drives.
The CPU always reads at its word size (4 bytes on a 32-bit processor), so when you do an unaligned address access — on a processor that supports it — the processor is going to read multiple words. The CPU will read each word of memory that your requested address straddles. This causes an amplification of up to 2X the number of memory transactions required to access the requested data.
Because of this, it can very easily be slower to read two bytes than four. For example, say you have a struct in memory that looks like this:
struct mystruct {
char c; // one byte
int i; // four bytes
short s; // two bytes
}
On a 32-bit processor it would most likely be aligned like shown here:
The processor can read each of these members in one transaction.
Say you had a packed version of the struct, maybe from the network where it was packed for transmission efficiency; it might look something like this:
Reading the first byte is going to be the same.
When you ask the processor to give you 16 bits from 0x0005 it will have to read a word from 0x0004 and shift left 1 byte to place it in a 16-bit register; some extra work, but most can handle that in one cycle.
When you ask for 32 bits from 0x0001 you'll get a 2X amplification. The processor will read from 0x0000 into the result register and shift left 1 byte, then read again from 0x0004 into a temporary register, shift right 3 bytes, then OR it with the result register.
Range
For any given address space, if the architecture can assume that the 2 LSBs are always 0 (e.g., 32-bit machines) then it can access 4 times more memory (the 2 saved bits can represent 4 distinct states), or the same amount of memory with 2 bits for something like flags. Taking the 2 LSBs off of an address would give you a 4-byte alignment; also referred to as a stride of 4 bytes. Each time an address is incremented it is effectively incrementing bit 2, not bit 0, i.e., the last 2 bits will always continue to be 00.
This can even affect the physical design of the system. If the address bus needs 2 fewer bits, there can be 2 fewer pins on the CPU, and 2 fewer traces on the circuit board.
Atomicity
The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.
Conclusion
The memory system of a processor is quite a bit more complex and involved than described here; a discussion on how an x86 processor actually addresses memory can help (many processors work similarly).
There are many more benefits to adhering to memory alignment that you can read at this IBM article.
A computer's primary use is to transform data. Modern memory architectures and technologies have been optimized over decades to facilitate getting more data, in, out, and between more and faster execution units–in a highly reliable way.
Bonus: Caches
Another alignment-for-performance that I alluded to previously is alignment on cache lines which are (for example, on some CPUs) 64B.
For more info on how much performance can be gained by leveraging caches, take a look at Gallery of Processor Cache Effects; from this question on cache-line sizes
Understanding of cache lines can be important for certain types of program optimizations. For example, the alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.
It's a limitation of many underlying processors. It can usually be worked around by doing 4 inefficient single byte fetches rather than one efficient word fetch, but many language specifiers decided it would be easier just to outlaw them and force everything to be aligned.
There is much more information in this link that the OP discovered.
you can with some processors (the nehalem can do this), but previously all memory access was aligned on a 64-bit (or 32-bit) line, because the bus is 64 bits wide, you had to fetch 64 bit at a time, and it was significantly easier to fetch these in aligned 'chunks' of 64 bits.
So, if you wanted to get a single byte, you fetched the 64-bit chunk and then masked off the bits you didn't want. Easy and fast if your byte was at the right end, but if it was in the middle of that 64-bit chunk, you'd have to mask off the unwanted bits and then shift the data over to the right place. Worse, if you wanted a 2 byte variable, but that was split across 2 chunks, then that required double the required memory accesses.
So, as everyone thinks memory is cheap, they just made the compiler align the data on the processor's chunk sizes so your code runs faster and more efficiently at the cost of wasted memory.
Fundamentally, the reason is because the memory bus has some specific length that is much, much smaller than the memory size.
So, the CPU reads out of the on-chip L1 cache, which is often 32KB these days. But the memory bus that connects the L1 cache to the CPU will have the vastly smaller width of the cache line size. This will be on the order of 128 bits.
So:
262,144 bits - size of memory
128 bits - size of bus
Misaligned accesses will occasionally overlap two cache lines, and this will require an entirely new cache read in order to obtain the data. It might even miss all the way out to the DRAM.
Furthermore, some part of the CPU will have to stand on its head to put together a single object out of these two different cache lines which each have a piece of the data. On one line, it will be in the very high order bits, in the other, the very low order bits.
There will be dedicated hardware fully integrated into the pipeline that handles moving aligned objects onto the necessary bits of the CPU data bus, but such hardware may be lacking for misaligned objects, because it probably makes more sense to use those transistors for speeding up correctly optimized programs.
In any case, the second memory read that is sometimes necessary would slow down the pipeline no matter how much special-purpose hardware was (hypothetically and foolishly) dedicated to patching up misaligned memory operations.
#joshperry has given an excellent answer to this question. In addition to his answer, I have some numbers that show graphically the effects which were described, especially the 2X amplification. Here's a link to a Google spreadsheet showing what the effect of different word alignments look like.
In addition here's a link to a Github gist with the code for the test.
The test code is adapted from the article written by Jonathan Rentzsch which #joshperry referenced. The tests were run on a Macbook Pro with a quad-core 2.8 GHz Intel Core i7 64-bit processor and 16GB of RAM.
If you have a 32bit data bus, the address bus address lines connected to the memory will start from A2, so only 32bit aligned addresses can be accessed in a single bus cycle.
So if a word spans an address alignment boundary - i.e. A0 for 16/32 bit data or A1 for 32 bit data are not zero, two bus cycles are required to obtain the data.
Some architectures/instruction sets do not support unaligned access and will generate an exception on such attempts, so compiler generated unaligned access code requires not just additional bus cycles, but additional instructions, making it even less efficient.
If a system with byte-addressable memory has a 32-bit-wide memory bus, that means there are effectively four byte-wide memory systems which are all wired to read or write the same address. An aligned 32-bit read will require information stored in the same address in all four memory systems, so all systems can supply data simultaneously. An unaligned 32-bit read would require some memory systems to return data from one address, and some to return data from the next higher address. Although there are some memory systems that are optimized to be able to fulfill such requests (in addition to their address, they effectively have a "plus one" signal which causes them to use an address one higher than specified) such a feature adds considerable cost and complexity to a memory system; most commodity memory systems simply cannot return portions of different 32-bit words at the same time.
On PowerPC you can load an integer from an odd address with no problems.
Sparc and I86 and (I think) Itatnium raise hardware exceptions when you try this.
One 32 bit load vs four 8 bit loads isnt going to make a lot of difference on most modern processors. Whether the data is already in cache or not will have a far greater effect.

Resources