It seems to me they are quite similar. So what's the relation between slab and buddy system?
A slab is a collection of objects of the same size. It avoids fragmentation by allocating a fairly large block of memory and dividing it into equal-sized pieces. The number of pieces is typically much larger than two, say 128 or so.
There are two ways you can use slabs. First, you could have a slab just for one size that you allocate very frequently. For example, a kernel might have an inode slab. But you could also have a number of slabs in progressive sizes, like a 128-byte slab, a 192-byte slab, a 256-byte slab, and so on. You can then allocate an object of any size from the next slab size up.
Note that in neither case does a slab re-use memory for an object of a different size unless the entire slab is freed back to a global "large block" allocator.
The buddy system is an unrelated method where each object has a "buddy" object which it is coalesced with when it is freed. Blocks are divided in half when smaller blocks are needed. Note that in the buddy system, blocks are divided and coalesced into larger blocks as the primary means of allocation and returning for re-use. This is very different from how slabs work.
Or to put it more simply:
Buddy system: Various sized blocks are divided when allocated and coalesced when freed to efficiently divide a big block into smaller blocks of various sizes as needed.
Slab: Very large blocks are allocated and divided once into equal-sized blocks. No other dividing or coalescing takes place and freed blocks are just held in a list to be assigned to subsequent allocations.
The Linux kernel's core allocator is a flexible buddy system allocator. This allocator provide the slabs for the various slab allcoators.
In general slab allocator is a list of slabs with fixed size suited to place predefined size elements. As all objects in the pool of the same size there is no fragmentation.
Buddy allocator divides memory in chunks which sizes a doubled. For example if min chunk is 1k, the next will be 2K, then 4K etc. So if we will request to allocate 100b, then the chunk with size 1k will be chosen. What leads to fragmentation but allows to allocate arbitrary size objects (so it's well suited for user memory allocations where exact object sized could be of any size).
See also:
https://en.wikipedia.org/wiki/Slab_allocation
https://en.wikipedia.org/wiki/Buddy_memory_allocation
Also worse check this presentations: http://events.linuxfoundation.org/images/stories/pdf/klf2012_kim.pdf
Slides from page 22 reveal the summary of differencies.
Related
According to this WWDC iOS Memory Deep Dive talk https://developer.apple.com/videos/play/wwdc2018/416, memory footprint equals to dirty and swapped size combined. However, when I use vmmap -summary [mempgraph] to inspect my memgraphs, many times they don't add up. In this particular case vmmap
, memory footprint = 118.5M while the dirty size is 123.3M. How can footprint be smaller than the dirty size?
In the same WWDC talk, it's mentioned that heap --sortBySize [memgraph] can be used to inspect heap allocations, and I see from my memgraph that the heap size is about 5M All zones: 110206 nodes (55357344 bytes) , which is much smaller than the MALLOC regions in the vmmap result. Doesn't malloc allocate spaces in the heap?
Can I say 'maximum resident set size of a process' equals 'required RAM size at least for the process'?
Am I right? If not, why?
Of sorts, yes. However, the OS may over allocate memory to the process,
or under allocate (and therefore, use swap space). At any rate, it is a good approximation.
See Peak memory usage of a linux/unix process
As the title states, I am trying to add the same image with different offsets, stored in a list, to the accumulating image.
The current implementation performs this on a CPU, and with some intrinsics it can be quite fast.
However, with larger images (2048x2048) and many offsets in the list (~10000), the performance is not satisfactory.
My question is, can the accumulation of the image with different offsets be efficiently implemented on a GPU?
Yes, you can. The results will be likely much faster than on CPU. The trick is to not send the data for each addition, and to not even launch a new kernel for each addition: the kernel you have should do some decent number of offset additions at once, at least 16 but possibly a few hundred, depending on your typical list size (and you can have more than one kernel of course).
Suppose a small computer system has 4 MB of main memory. The system manages it in fixed sized frames. A frames table maintains the status of each frame in memory. How large (how many byte) should a frame be? You have a choice of one of the following: 1K, 5K, or 10K bytes. Which of these choices will minimize the total space wasted by processes due to fragmentation and frames table storage?
Assume the following: On the average, 10 processes will reside in memory. The average amount of wasted space will be 1/2 frame for each process.
The frames table must have one entry for each frame. Each entry requires 10 bytes.
Here is my answer:
1K would minimize the fragmentation, as known small size leads to big tables but smaller wasted space.
10 processes ~ 1/2 frame wasted on each.
Am I on the right track?
Yes, you are. I agree with you that on a system such as this, the smallest size makes the most sense. However, for example, if you take the situation of x86-64, where the options are 4kb, 2MB, 1GB. Considering modern memory sizes of approximation 4GB, obviously 1GB makes no sense, but because most programs nowadays contain quite a bit of compiled code, or in the case of interpreted and VM languages, all of the code of the VM, 2 MB pages make the most sense. In other words, to determine these things, you have to think about the average memory usage of a program in this system, the number of programs, and most importantly, the ration of average fragmentation to page table size. Because while a small memory size like that benefits from the low fragmentation, 4kb pages on 4GB of memory is a very large page table. Very large.
I am learning OpenGL and recently discovered about glGenTextures.
Although several sites explain what it does, I feel forced to wonder how it behaves in terms of speed and, particularly, memory.
Exactly what should I consider when calling glGenTextures? Should I consider unloading and reloading textures for better speed? How many textures should a standard game need? What workarounds are there to get around any limitations memory and speed may bring?
According to the manual, glGenTextures only allocates texture "names" (eg ids) with no "dimensionality". So you are not actually allocating texture memory as such, and the overhead here is negligible compared to actual texture memory allocation.
glTexImage will actually control the amount of texture memory used per texture. Your application's best usage of texture memory will depend on many factors: including the maximum working set of textures used per frame, the available dedicated texture memory of the hardware, and the bandwidth of texture memory.
As for your question about a typical game - what sort of game are you creating? Console games are starting to fill blu-ray disk capacity (I've worked on a PS3 title that was initially not projected to fit on blu-ray). A large portion of this space is textures. On the other hand, downloadable web games are much more constrained.
Essentially, you need to work with reasonable game design and come up with an estimate of:
1. The total textures used by your game.
2. The maximum textures used at any one time.
Then you need to look at your target hardware and decide how to make it all fit.
Here's a link to an old Game Developer article that should get you started:
http://number-none.com/blow/papers/implementing_a_texture_caching_system.pdf