Aligning memory on 16-byte and 32-byte boundaries

Aligning memory on 16-byte and 32-byte boundaries - memory

I'm doing several operations using SIMD instructions (SSE and AVX). As I understand, SSE instructions work best with 16-byte aligned memory, and AVX instructions work best with 32-byte aligned memory.
Would it be safe to always allocate memory aligned to 32-byte boundaries for optimal use with both SSE and AVX?
Are there ever any cases where 32-byte aligned memory is not also 16-byte aligned?

Are there ever any cases where 32-byte aligned memory is not also 16-byte aligned?
Alignment just means that the address is a multiple of 32. Any multiple of 32 is also a multiple of 16.
The first google hit for "alignment" is wikipedia, and you can follow the links to https://en.wikipedia.org/wiki/Data_structure_alignment#Definitions, which explains this in lots of detail.

Related

Do SIMD instructions require data to be aligned?

I'm confused with logical and physical memory alignments.
To use vector instructions efficiently such as AVX and SSE, we may need data to be aligned.
Does this say that data aligned in virtual memory is also aligned properly in physical memory?
If yes, how does the compiler do the good job?

What's the difference between slab and buddy system?

It seems to me they are quite similar. So what's the relation between slab and buddy system?

A slab is a collection of objects of the same size. It avoids fragmentation by allocating a fairly large block of memory and dividing it into equal-sized pieces. The number of pieces is typically much larger than two, say 128 or so.
There are two ways you can use slabs. First, you could have a slab just for one size that you allocate very frequently. For example, a kernel might have an inode slab. But you could also have a number of slabs in progressive sizes, like a 128-byte slab, a 192-byte slab, a 256-byte slab, and so on. You can then allocate an object of any size from the next slab size up.
Note that in neither case does a slab re-use memory for an object of a different size unless the entire slab is freed back to a global "large block" allocator.
The buddy system is an unrelated method where each object has a "buddy" object which it is coalesced with when it is freed. Blocks are divided in half when smaller blocks are needed. Note that in the buddy system, blocks are divided and coalesced into larger blocks as the primary means of allocation and returning for re-use. This is very different from how slabs work.
Or to put it more simply:
Buddy system: Various sized blocks are divided when allocated and coalesced when freed to efficiently divide a big block into smaller blocks of various sizes as needed.
Slab: Very large blocks are allocated and divided once into equal-sized blocks. No other dividing or coalescing takes place and freed blocks are just held in a list to be assigned to subsequent allocations.
The Linux kernel's core allocator is a flexible buddy system allocator. This allocator provide the slabs for the various slab allcoators.

In general slab allocator is a list of slabs with fixed size suited to place predefined size elements. As all objects in the pool of the same size there is no fragmentation.
Buddy allocator divides memory in chunks which sizes a doubled. For example if min chunk is 1k, the next will be 2K, then 4K etc. So if we will request to allocate 100b, then the chunk with size 1k will be chosen. What leads to fragmentation but allows to allocate arbitrary size objects (so it's well suited for user memory allocations where exact object sized could be of any size).
See also:
https://en.wikipedia.org/wiki/Slab_allocation
https://en.wikipedia.org/wiki/Buddy_memory_allocation
Also worse check this presentations: http://events.linuxfoundation.org/images/stories/pdf/klf2012_kim.pdf
Slides from page 22 reveal the summary of differencies.

Intel 80386 Programmer's Reference Manual: Interpreting bit numbering in example data structure

In illustrations of data structures in memory, smaller addresses appear at the lower-right part of the figure; addresses increase toward the left and upwards. Bit positions are numbered from right to left. Figure 1-1 illustrates this convention.
From my understanding, most modern computers are byte addressable meaning the combinations of bits correspond to one byte in memory. My question is how does that translate to the example diagram given above?
I see Fig 1-1 is showing 32 bytes in sequence, and as we move towards the top left, the address of each byte is increasing, but as for the bit offset I am confused. How are the bit offsets related to the bytes? Especially the top columns numbers 0, 7, 15, 23, 31.

operating systems main memory fragmentation

Suppose a small computer system has 4 MB of main memory. The system manages it in fixed sized frames. A frames table maintains the status of each frame in memory. How large (how many byte) should a frame be? You have a choice of one of the following: 1K, 5K, or 10K bytes. Which of these choices will minimize the total space wasted by processes due to fragmentation and frames table storage?
Assume the following: On the average, 10 processes will reside in memory. The average amount of wasted space will be 1/2 frame for each process.
The frames table must have one entry for each frame. Each entry requires 10 bytes.
Here is my answer:
1K would minimize the fragmentation, as known small size leads to big tables but smaller wasted space.
10 processes ~ 1/2 frame wasted on each.
Am I on the right track?

Yes, you are. I agree with you that on a system such as this, the smallest size makes the most sense. However, for example, if you take the situation of x86-64, where the options are 4kb, 2MB, 1GB. Considering modern memory sizes of approximation 4GB, obviously 1GB makes no sense, but because most programs nowadays contain quite a bit of compiled code, or in the case of interpreted and VM languages, all of the code of the VM, 2 MB pages make the most sense. In other words, to determine these things, you have to think about the average memory usage of a program in this system, the number of programs, and most importantly, the ration of average fragmentation to page table size. Because while a small memory size like that benefits from the low fragmentation, 4kb pages on 4GB of memory is a very large page table. Very large.

The direction of stack growth and heap growth

In some systems, the stack grows in upward direction whereas the heap grows in downward direction and in some systems, the stack grows in downward direction and the heap grows in upward direction. But, Which is the best design ? Are there any programming advantages for any of those two specific designs ? Which is most commonly used and why has it not been standardized to follow a single approach ? Are they helpful/targeted for certain specific scenarios. If yes, what are they ?

Heaps only "grow" in a direction in very naive implementations. As Paul R. mentions, the direction a stack grows is defined by the hardware - on Intel CPUs, it always goes toward smaller addresses "i.e. 'Up'"

I have read the works of Miro Samek and various other embedded gurus and It seems that they are not in favor of dynamic allocation on embedded systems. That is probably due to complexity and the potential for memory leaks. If you have a project that absolutely can't fail, you will probably want to avoid using Malloc thus the heap will be small. Other non mission critical systems could be just the opposite. I don't think there would be a standard approach.

Maybe it is just dependent on the processor: If it supports the stack going upward or downward?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart