my redis version is redis-version 3.2.9 and I Modify redis.conf,
hash-max-ziplist-entries 256
hash-max-ziplist-value 4096
however, the results do not play As descriped in Memory Optimization(redis hash structure can make memory more-efficient),
as well, Capacity assessment also confuse me, I will show the result I get below
As showed above, redis string key-value: the first pic shows that 3085 and 4086 uses the same memory. The second pic shows that 4096 uses more memory(about 1024 byte per key), not 4096 per key. jemalooc
I hope someone can help me, thank you
Redis internally, for optimisation purpose, stores entries in a data-structure called ZipList which directly works with memory addresses.
So the optimisation is actually compaction and reduction of memory wastage in using and maintaining pointers.
ziplist:
+----+----+----+
| a | b | c |
+----+----+----+
now, let's say we did an update in value for b and the value size has increased from let's say 10 to 20 bytes.
We have no way to fit that value in between. So we do a zip-list resizing.
ziplist:
+----+--------+----+
| a | bb | c |
+----+--------+----+
So, when when doing resizing, it will create a new block of memory with the larger size and copy the old data to that newly allocated memory and then it will deallocate the old memory area.
Since memory is moved in such cases it leads to memory fragmentation.
Redis also does memory de-fragmentation which can bring this ratio down to less than 1.
This fragmentation is calculated as,
(resident memory) / (memory allocation)
How is resident memory less than allocated memory you ask!
Normally the allocated memory should be fully contained in the resident memory, nevertheless there are a few exceptions:
If parts of the virtual memory are paged out to disk, the resident memory can be smaller than the allocated memory.
There are cases of shared memory where the shared memory is marked as used, but not as resident.
Related
I am dealing with a legacy system (Ruby 2.7.6), which suffers from a memory leak, that led the previous developers to make use of puma worker killer that overcomes the memory issue by restarting the process every 30 minutes.
As traffic increases, we now need to increase the number of instances and decrease the 30 minutes kill rate to even 20 minutes.
We would like to investigate the source of this memory leak, which apparently originates from one of our many Gem dependencies (information given by a previous developer).
The system is on AWS (Elastic Beanstalk) but can also run on docker.
Can anyone suggest a good tool and guide how to find the source for this memory leak?
Thanks
** UPDATE:
I made use of mini-profiler and I took some memory snapshot to see the influence of about 100 requests on the server, [BEFORE, DURING, AFTER]
judging by the outputs, it does not seem there is a memory leak in Ruby, but the memory usage did increase and stay up, although does not seem to be used by us...
BEFORE:
KiB Mem : 2007248 total, 628156 free, 766956 used, 612136
buff/cache KiB Swap: 2097148 total, 2049276 free, 47872 used.
1064852 avail Mem
Total allocated: 115227 bytes (1433 objects) Total retained: 21036
bytes (147 objects)
allocated memory by gem
33121 activesupport-6.0.4.7
21687 actionpack-6.0.4.7
14484 activerecord-6.0.4.7
12582 var/app
9904 ipaddr
6957 rack-2.2.4
3512 actionview-6.0.4.7
2680 mysql2-0.5.3
1813 rack-mini-profiler-3.0.0
1696 audited-5.0.2
1552 concurrent-ruby-1.1.10
DURING:
KiB Mem : 2007248 total, 65068 free, 1800424 used, 141756
buff/cache KiB Swap: 2097148 total, 2047228 free, 49920 used.
58376 avail Mem
Total allocated: 225272583 bytes (942506 objects) Total retained:
1732241 bytes (12035 objects)
allocated memory by gem
106497060 maxmind-db-1.0.0
58308032 psych
38857594 user_agent_parser-2.7.0
4949108 activesupport-6.0.4.7
3967930 other
3229962 activerecord-6.0.4.7
2154670 rack-2.2.4
1467383 actionpack-6.0.4.7
1336204 activemodel-6.0.4.7
AFTER:
KiB Mem : 2007248 total, 73760 free, 1817688 used, 115800
buff/cache KiB Swap: 2097148 total, 2032636 free, 64512 used.
54448 avail Mem
Total allocated: 109563 bytes (1398 objects) Total retained: 14988
bytes (110 objects)
allocated memory by gem
29745 activesupport-6.0.4.7
21495 actionpack-6.0.4.7
13452 activerecord-6.0.4.7
12502 var/app
9904 ipaddr
7237 rack-2.2.4
3128 actionview-6.0.4.7
2488 mysql2-0.5.3
1813 rack-mini-profiler-3.0.0
1360 audited-5.0.2
1360 concurrent-ruby-1.1.10
Where can the leak be then? is it Puma?
It seems from the statistics in the question that most objects get freed properly by the memory allocator.
However - when you have a lot of repeated allocations, the system's malloc can sometimes (and often does) hold the memory without releasing it to the system (Ruby isn't aware of this memory that is considered "free").
This is done for 2 main reasons:
Most importantly: heap fragmentation (the allocator is unable to free the memory and unable to use parts of it for future allocations).
The system's memory allocator knows it would probably need this memory again soon (that's in relation to the part of the memory that can be freed and doesn't suffer from fragmentation).
This can be solved by trying to replace the system's memory allocator with an allocator that's tuned for your specific needs (i.e., jamalloc, such as suggested here and here and asked about here).
You could also try to use gems that have a custom memory allocator when using C extensions (the iodine gem does that, but you could make other gems do it too).
This approach should help mitigate the issue, but the fact is that some of your gems appear memory hungry... I mean...:
is the maxmind-db gem using 106,497,060 bytes (106MB) of memory or did it allocate that number of objects?
and why is psych so hungry? are there any roundtrips between data and YAML that could be skipped?
there seems to be a lot of user agent strings stored concurrently... (the user_agent_parser gem)... maybe you could make a cache of these strings instead of having a lot of duplicates. For example, you could make a Set of these strings and replace each String object with the object in the Set. This way equal strings would point at the same object (preventing some object duplication and freeing up some memory).
Is it Puma?
Probably not.
Although I am the author of the iodine web server, I really love the work the Puma team did over the years and think it's a super solid server for what it offers. I really doubt the leak is from the server, but you can always switch and see what happens.
Re: the difference between the Linux report and the Ruby profiler
The difference is in the memory held by malloc - "free" memory that isn't returned to the system but Ruby doesn't know about.
Ruby profilers test the memory Ruby allocated ("live" memory, if you will). They have access to the number of objects allocated and the memory held by those objects.
The malloc library isn't part of Ruby. It's part of the C runtime library on top of which Ruby sits.
There's memory allocated for the process by malloc that isn't used by Ruby. That memory is either waiting to be used (retained by malloc for future use) or waiting to be released back to the system (or fragmented and lost for the moment).
That difference between what Ruby uses and what malloc holds should explain the difference between The Linux reporting and the Ruby profiling reporting.
Some gems might be using their own custom made memory allocator (i.e., iodine does that). These behave the same as malloc in the sense that the memory they hold will not show up in the Ruby profiler (at least not completely).
I am using Delphi 7 Enterprise under Windows 7 64 bit.
My computer had 16 GB of RAM.
I try to use kbmMemTable 7.70.00 Professional Edition (http://news.components4developers.com/products_kbmMemTable.html) .
My table has 150,000 records, but when I try to copy the data from Dataset to the kbmMemTable it only copies 29000 records and I get this error: EOutOfMemory
I saw this message:
https://groups.yahoo.com/neo/groups/memtable/conversations/topics/5769,
but it didn't solve my problem.
An out of memory can happen of various reasons:
Your application uses too much memory in general. A 32 bit application typically runs out of memory when it has allocated 1.4GB using FastMM memory manager. Other memory managers may have worse or better ranges.
Memory fragementation. There may not be enough space in memory for a single large allocation that is requested. kbmMemTable will attempt to allocate roughly 200000 x 4 bytes as one single large allocation. As its own largest single allocation. That shouldnt be a problem.
Too many small allocations leading to the above memory fragmentation. kbmMemTable will allocate from 1 to n blocks of memory per record depending on the setting of the Performance property .
If Performance is set to fast, then 1 block will be allocated (unless blobs fields exists, in which case an additional allocation will be made per not null blob field).
If Performance is balanced or small, then each string field will allocate another block of memory per record.
best regards
Kim/C4D
Please explain it nicely. Don't just write definition. Also explain what it does and how is it different from segmentation.
Fragmentation needs to be considered with memory allocation techniques. Paging is basically not a memory allocation technique, but rather a means of providing virtual address spaces.
Considering the comparison with segmentation, what you're probably asking about is the difference between a memory allocation technique using fixed size blocks (like the pages of paging, assuming 4KB page size here) and a technique using variable size blocks (like the segments used for segmentation).
Now, assume that you directly use the page allocation interface to implement memory management, that is you have two functions for dealing with memory:
alloc_page, which allocates a single page and returns a pointer to the beginning of the newly available address space, and
free_page, which frees a single, allocated page.
Now suppose all of your currently available virtual memory is used, but you need to store 1 additional byte. You call alloc_page and get a 4KB block of memory. You only use 1 byte of that huge block, but also the other 4095 bytes are, from the perspective of the allocator, used. If this happens multiple times eventually all pages will be allocated, so further calls to alloc_page will fail. Even if you just need another additional byte (which could be one of the 4095 that got wasted above) the allocator will tell you that you're out of memory. This is internal fragmentation.
If, on the other hand, you would use variable sized blocks (like in segmentation), then you're vulnerable to external fragmentation: Suppose you manage 6 bytes of memory (F means "free"):
FFFFFF
You first allocate 3 bytes for a, then 1 for b and finally 2 bytes for c:
aaabcc
Now you free both a and c, leaving only b allocated:
FFFbFF
You now have 5 bytes of unused memory, but if you try to allocate a block of 4 bytes (which is less than the available memory) the allocation will fail due to the unfavorable placement of the memory for b. This is external fragmentation.
Now, if you extend your page allocator to be able to allocate multiple pages and add alloc_multiple_pages, you have to deal with both internal and external fragmentation.
There is no external fragmentation in paging but internal fragmentation exists.
First, we need to understand what is external fragmentation. External fragmentation occurs when we have a memory to accommodate a process but it's not continuous.
How does it not occur in paging?
Paging divides virtual memory or all processes into equal-sized pages and physical memory into fixed size frames. So you are typically fixing equal size blocks called pages into equal block shaped spaces called frames! Try to visualize and conclude that there can never be external fragmentation.
In the case of segmentation, we divide virtual addresses into different sized blocks that is why there may be the case some blocks in main memory must stick together or compact to make space for the new process! I hope it helps!
When a process is divided into fix sized pages, there is generally some leftover space in the last page(internal fragmentation). When there are many processes, each of their last page's unused area could add up to be greater than or equal to size of one page. Now even if you have to total free size of one page or more but you cannot load a new page because a page has to be continuous. External fragmentation has happened. So, I don't think external fragmentation is completely zero in paging.
EDIT: It is all about how External Fragmentation is defined. The collection of internal fragmentation do not contribute to external fragmentation. External fragmentation is contributed by the empty space which is EXTERNAL to partition(or page). So if suppose there are only two frames in main memory ,say of size 16B, each occupied by only 1B data. The internal fragmentation in each frame is 15B. The total unused space is 30B. Now if you want to load one new page of some process, you will see that you do not have any frame available. You are unable to load a new page eventhough you have 30B unused space. Will you call this as external fragmentation? Answer is no. Because these 15B unused space are INTERNAL to the pages. So in paging, internal fragmentation is possible but not external fragmentation.
Paging allows a process to be allocated physical memory in non-contiguous fashion. I will answer that why external fragmentation can't occur in paging.
External frag occurs when a process, which was allocated contiguous memory , is unloaded from physical memory, which creates a hole (free space ) in the memory.
Now if a new process comes, which requires more memory than this hole, then we won't be able to allocate contiguous memory to that process due to non contiguous nature of free memory, this is called external fragmentation.
Now, the problem above originated due to the constraint of allocating contiguous memory to the process. This is what paging solved by allowing process to get non contiguous physical memory.
In paging, the probability of having external fragmentation is very low although internal fragmentation may occur.
In paging scheme, the whole main memory and the virtual memory is divided into some fixed size slots which are called pages (in case of virtual memory) and page frames (in case of main memory or RAM or physical memory). So, whenever a process is executed in main memory, it occupies the entire space of a page frame. Let us say, the main memory has 4096 page frames with each page frame having a size of 4096 bytes. Suppose, there is a process P1 which requires 3000 bytes of space for its execution in main memory. So, in order to execute P1, it is brought from virtual memory to main memory and placed in a page frame (F1) but P1 requires only 3000 bytes of space for its execution and as a result of which (4096 - 3000 = 1096 bytes) of space in the page frame F1 is wasted. In other words, this denotes the case of internal fragmentation in the page frame F1.
Again, external fragmentation may occur if some space of the main memory could not be included in a page frame. But this case is very rare as usually the size of a main memory, the size of a page frame as well as the total no. of page frames in main memory can be expressed in terms of power of 2.
As far as I've understood, I would answer your question like so:
Why is there internal fragmentation with paging?
Because a page has fixed size, but processes may request more or less space. Say a page is 32 units, and a process requests 20 units. Then when a page is given to the requesting process, that page is no longer useable despite having 12 units of free "internal" space.
Why is there no external fragmentation with paging?
Because in paging, a process is allowed to be allocated spaces that are non-contiguous in the physical memory. Meanwhile, the logical representation of those blocks will be contiguous in the virtual memory. This is what I mean:
A process requires 128 units of space. This is 4 pages as in the previous example. Unregardless of the actual page numbers (formally frame numbers) in the physical memory, you give those pages the numbers 0, 1, 2, and 3. This is the virtual representation that is the defining characteristic of paging itself. Those pages may be 21, 213, 23, 234 in the actual physical memory. But they can really be anything, contiguous or non-contiguous. Therefore, even if paging leaves small free spaces in between used spaces, those small free spaces can still be used together as if they were one contiguous block of space. That's why external fragmentation won't happen.
Frames are allocated as units. If the memory requirements of a process do not happen to coincide with page boundaries, the last frame allocated may not be completely full.
For example, if the page size is 2,048 bytes, a process of 72,766 bytes will need 35 pages plus 1,086 bytes. It will be allocated 36 frames, resulting in internal fragmentation of 2,048 - 1,086 = 962 bytes. In the worst case, a process would need 11 pages plus 1 byte. It would be allocated 11 + 1 frames, resulting in internal fragmentation of almost an entire frame.
const char programSource[] =
"__kernel void vecAdd(__global int *a, __global int *b, __global int *c)"
"{"
" int gid = get_global_id(0);"
"for(int i=0; i<10; i++){"
" a[gid] = b[gid] + c[gid];}"
"}";
The kernel above is a vector addition done ten times per loop. I have used the programming guide and stack overflow to figure out how global memory works, but I still can't figure out by looking at my code if I am accessing global memory in a good way. I am accessing it in a contiguous fashion and I am guessing in an aligned way. Does the card load 128kb chunks of global memory for arrays a, b, and c? Does it then load the 128kb chunks for each array once for every 32 gid indexes processed? (4*32=128) It seems like then I am not wasting any global memory bandwidth right?
BTW, the compute profiler shows a gld and gst efficiency of 1.00003, which seems weird, I thought it would just be 1.0 if all my stores and loads were coalesced. How is it above 1.0?
Yes your memory access pattern is pretty much optimal. Each halfwarp is accessing 16 consecutive 32bit words. Furthermore the access is 64byte aligned, since the buffers themselves are aligned and the startindex for each halfwarp is a multiple of 16. So each halfwarp will generate one 64Byte transaction. So you shouldn't waste memory bandwidth through uncoalesced accesses.
Since you asked for examples in your last question lets modify this code for other (less optimal access pattern (since the loop doesn't really do anything I will ignore that):
kernel void vecAdd(global int* a, global int* b, global int* c)
{
int gid = get_global_id(0);
a[gid+1] = b[gid * 2] + c[gid * 32];
}
At first lets se how this works on compute 1.3 (GT200) hardware
For the writes to a this will generate a slightly unoptimal pattern (following the halfwarps identified by their id range and the corresponding access pattern):
gid | addr. offset | accesses | reasoning
0- 15 | 4- 67 | 1x128B | in aligned 128byte block
16- 31 | 68-131 | 1x64B, 1x32B | crosses 128B boundary, so no 128B access
32- 47 | 132-195 | 1x128B | in aligned 128byte block
48- 63 | 196-256 | 1x64B, 1x32B | crosses 128B boundary, so no 128B access
So basically we are wasting about half our bandwidth (the less then doubled access width for the odd halfwarps doesn't help much because it generates more accesses, which isn't faster then wasting more bytes so to speak).
For the reads from b the threads access only even elements of the array, so for each halfwarp all accesses lie in a 128byte aligned block (the first element is at the 128B boundary, since for that element the gid is a multiple of 16=> the index is a multiple of 32, for 4 byte elements, that means the address offset is a multiple of 128B). The accesspattern stretches over the whole 128B block, so this will do a 128B transfer for every halfwarp, again waisting half the bandwidth.
The reads from c generate one of the worst case scenarios, where each thread indices in its own 128B block, so each thread needs its own transfer, which one one hand is a bit of a serialization scenario (although not quite as bad as normaly, since the hardware should be able to overlap the transfers). Whats worse is the fact that this will transfer a 32B block for each thread, wasting 7/8 of the bandwidth (we access 4B/thread, 32B/4B=8, so only 1/8 of the bandwidth is utilized). Since this is the accesspattern of naive matrixtransposes, it is highly advisable to do those using local memory (speaking from experience).
Compute 1.0 (G80)
Here the only pattern which will create a good access is the original, all patterns in the example will create completely uncoalesced access, wasting 7/8 of the bandwidth (32B transfer/thread, see above). For G80 hardware every access where the nth thread in a halfwarp doesn't access the nth element creates such uncoalesced accesses
Compute 2.0 (Fermi)
Here every access to memory creates 128B transactions (as many as necessary to gather all data, so 16x128B in the worst case), however those are cached, making it less obvious where data will be transfered. For the moment lets assume the cache is big enough to hold all data and there are no conflicts, so every 128B cacheline will be transferred at most once. Lets furthermoe assume a serialized execution of the halfwarps, so we have a deterministic cache occupation.
Accesses to b will still always transfer 128B Blocks (no other thread indices in the coresponding memoryarea). Access to c will generate 128B transfers per thread (worst accesspattern possible).
For accesses to a it is the following (treating them like reads for the moment):
gid | offset | accesses | reasoning
0- 15 | 4- 67 | 1x128B | bringing 128B block to cache
16- 31 | 68-131 | 1x128B | offsets 68-127 already in cache, bring 128B for 128-131 to cache
32- 47 | 132-195 | - | block already in cache from last halfwarp
48- 63 | 196-259 | 1x128B | offsets 196-255 already in cache, bringing in 256-383
So for large arrays the accesses to a will waste almost no bandwidth theoretically.
For this example the reality is of course not quite as good, since the accesses to c will trash the cache pretty nicely
For the profiler I would assume that the efficiencies over 1.0 are simply results of floating point inaccurencies.
Hope that helps
I am using R on some relatively big data and am hitting some memory issues. This is on Linux. I have significantly less data than the available memory on the system so it's an issue of managing transient allocation.
When I run gc(), I get the following listing
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2147186 114.7 3215540 171.8 2945794 157.4
Vcells 251427223 1918.3 592488509 4520.4 592482377 4520.3
yet R appears to have 4gb allocated in resident memory and 2gb in swap. I'm assuming this is OS-allocated memory that R's memory management system will allocate and GC as needed. However, lets say that I don't want to let R OS-allocate more than 4gb, to prevent swap thrashing. I could always ulimit, but then it would just crash instead of working within the reduced space and GCing more often. Is there a way to specify an arbitrary maximum for the gc trigger and make sure that R never os-allocates more? Or is there something else I could do to manage memory usage?
In short: no. I found that you simply cannot micromanage memory management and gc().
On the other hand, you could try to keep your data in memory, but 'outside' of R. The bigmemory makes that fairly easy. Of course, using a 64bit version of R and ample ram may make the problem go away too.