my redis version is redis-version 3.2.9 and I Modify redis.conf,
hash-max-ziplist-entries 256
hash-max-ziplist-value 4096
however, the results do not play As descriped in Memory Optimization(redis hash structure can make memory more-efficient),
as well, Capacity assessment also confuse me, I will show the result I get below
As showed above, redis string key-value: the first pic shows that 3085 and 4086 uses the same memory. The second pic shows that 4096 uses more memory(about 1024 byte per key), not 4096 per key. jemalooc
I hope someone can help me, thank you
Redis internally, for optimisation purpose, stores entries in a data-structure called ZipList which directly works with memory addresses.
So the optimisation is actually compaction and reduction of memory wastage in using and maintaining pointers.
ziplist:
+----+----+----+
| a | b | c |
+----+----+----+
now, let's say we did an update in value for b and the value size has increased from let's say 10 to 20 bytes.
We have no way to fit that value in between. So we do a zip-list resizing.
ziplist:
+----+--------+----+
| a | bb | c |
+----+--------+----+
So, when when doing resizing, it will create a new block of memory with the larger size and copy the old data to that newly allocated memory and then it will deallocate the old memory area.
Since memory is moved in such cases it leads to memory fragmentation.
Redis also does memory de-fragmentation which can bring this ratio down to less than 1.
This fragmentation is calculated as,
(resident memory) / (memory allocation)
How is resident memory less than allocated memory you ask!
Normally the allocated memory should be fully contained in the resident memory, nevertheless there are a few exceptions:
If parts of the virtual memory are paged out to disk, the resident memory can be smaller than the allocated memory.
There are cases of shared memory where the shared memory is marked as used, but not as resident.
I am implementing a spiking neural network using the CUDA library and am really unsure of how to proceed with regard to the following things:
Allocating memory (cudaMalloc) to many different arrays. Up until now, simply using cudaMalloc 'by hand' has sufficed, as I have not had to make more than 10 or so arrays. However, I now need to make pointers to, and allocate memory for thousands of arrays.
How to decide how much memory to allocate to each of those arrays. The arrays have a height of 3 (1 row for the postsynaptic neuron ids, 1 row for the number of the synapse on the postsynaptic neuron, and 1 row for the efficacy of that synapse), but they have an undetermined length which changes over time with the number of outgoing synapses.
I have heard that dynamic memory allocation in CUDA is very slow and so toyed with the idea of allocating the maximum memory required for each array, however the number of outgoing synapses per neuron varies from 100-10,000 and so I thought this was infeasible, since I have on the order of 1000 neurons.
If anyone could advise me on how to allocate memory to many arrays on the GPU, and/or how to code a fast dynamic memory allocation for the above tasks I would have more than greatly appreciative.
Thanks in advance!
If you really want to do this, you can call cudaMalloc as many times as you want; however, it's probably not a good idea. Instead, try to figure out how to lay out the memory so that neighboring threads in a block will access neighboring elements of RAM whenever possible.
The reason this is likely to be problematic is that threads execute in groups of 32 at a time (a warp). NVidia's memory controller is quite smart, so if neighboring threads ask for neighboring bytes of RAM, it coalesces those loads into a single request that can be efficiently executed. In contrast, if each thread in a warp is accessing a random memory location, the entire warp must wait till 32 memory requests are completed. Furthermore, reads and writes to the card's memory happen a whole cache line at a time, so if the threads don't use all the RAM that was read before it gets evicted from the cache, memory bandwidth is wasted. If you don't optimize for coherent memory access within thread blocks, expect a 10x to 100x slowdown.
(side note: The above discussion is still applicable with post-G80 cards; the first generation of CUDA hardware (G80) was even pickier. It also required aligned memory requests if the programmer wanted the coalescing behavior.)
I know I can reserve virtual memory using VirtualAlloc.
e.g. I can claim 1GB of virtual memory and then call in the first MB of that to put my a growing array into.
When the array grows beyond 1MB I call in the 2nd MB and so on.
This way I don't need to move the array around in memory when it grows, it just stays in place and the Intel/AMD virtual memory manager takes care of my problems.
However does FastMM support this structure, so I don't have to do my own memory management?
Pseudo code:
type
PBigarray = ^TBigarray;
TBigArray = array[0..0] of SomeRecord;
....
begin
VirtualMem:= FastMM.ReserveVirtualMemory(1GB);
PBigArray:= FastMM.ClaimPhysicalMemory(VirtualMem, 1MB);
....
procedure GrowBigArray
begin
FastMM.ClaimMorePhysicalMemory(PBigArray, 1MB {extra});
//will generate OOM exception when claim exceeds 1GB
Does FastMM support this?
No, FastMM4 (as of the latest version I looked at) does not explicitly support this. It's really not a functionality you would expect in a general purpose memory manager as it's trivially simple to do with VirtualAlloc calls.
NexusMM4 (which is part of NexusDB) does something that gives you a similar result, but without wasting all the address space before it is needed in the background.
If you make an initial large allocation (directly via GetMem, or indirectly via a dynamic array or such) the memory is allocated in just the size needed, via VirtualAlloc.
But if that allocation is then resized to a larger size, NexusMM will use a different way to allocate memory which allows it to simply unmap the allocation from the address space an remap it again, at a larger size, when further reallocs takes place.
This prevents the 2 major problems that most general purpose memory managers have when reallocating:
during a normal realloc the existing and new allocation need to be present in the address space at the same time, temporarily doubling the address space and physical memory requirements
during a normal realloc, the whole contents of the existing allocation needs to be copied
So with NexusMM you would get all the advantages of what you showed in your pseudo code (with the exception that the first realloc will involve a copy, and that growing your array might change it's address) by simply using normal GetMem/ReallocMem/FreeMem calls.
I am using R on some relatively big data and am hitting some memory issues. This is on Linux. I have significantly less data than the available memory on the system so it's an issue of managing transient allocation.
When I run gc(), I get the following listing
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2147186 114.7 3215540 171.8 2945794 157.4
Vcells 251427223 1918.3 592488509 4520.4 592482377 4520.3
yet R appears to have 4gb allocated in resident memory and 2gb in swap. I'm assuming this is OS-allocated memory that R's memory management system will allocate and GC as needed. However, lets say that I don't want to let R OS-allocate more than 4gb, to prevent swap thrashing. I could always ulimit, but then it would just crash instead of working within the reduced space and GCing more often. Is there a way to specify an arbitrary maximum for the gc trigger and make sure that R never os-allocates more? Or is there something else I could do to manage memory usage?
In short: no. I found that you simply cannot micromanage memory management and gc().
On the other hand, you could try to keep your data in memory, but 'outside' of R. The bigmemory makes that fairly easy. Of course, using a 64bit version of R and ample ram may make the problem go away too.
Admittedly I don't get it. Say you have a memory with a memory word of length of 1 byte. Why can't you access a 4 byte long variable in a single memory access on an unaligned address(i.e. not divisible by 4), as it's the case with aligned addresses?
The memory subsystem on a modern processor is restricted to accessing memory at the granularity and alignment of its word size; this is the case for a number of reasons.
Speed
Modern processors have multiple levels of cache memory that data must be pulled through; supporting single-byte reads would make the memory subsystem throughput tightly bound to the execution unit throughput (aka cpu-bound); this is all reminiscent of how PIO mode was surpassed by DMA for many of the same reasons in hard drives.
The CPU always reads at its word size (4 bytes on a 32-bit processor), so when you do an unaligned address access — on a processor that supports it — the processor is going to read multiple words. The CPU will read each word of memory that your requested address straddles. This causes an amplification of up to 2X the number of memory transactions required to access the requested data.
Because of this, it can very easily be slower to read two bytes than four. For example, say you have a struct in memory that looks like this:
struct mystruct {
char c; // one byte
int i; // four bytes
short s; // two bytes
}
On a 32-bit processor it would most likely be aligned like shown here:
The processor can read each of these members in one transaction.
Say you had a packed version of the struct, maybe from the network where it was packed for transmission efficiency; it might look something like this:
Reading the first byte is going to be the same.
When you ask the processor to give you 16 bits from 0x0005 it will have to read a word from 0x0004 and shift left 1 byte to place it in a 16-bit register; some extra work, but most can handle that in one cycle.
When you ask for 32 bits from 0x0001 you'll get a 2X amplification. The processor will read from 0x0000 into the result register and shift left 1 byte, then read again from 0x0004 into a temporary register, shift right 3 bytes, then OR it with the result register.
Range
For any given address space, if the architecture can assume that the 2 LSBs are always 0 (e.g., 32-bit machines) then it can access 4 times more memory (the 2 saved bits can represent 4 distinct states), or the same amount of memory with 2 bits for something like flags. Taking the 2 LSBs off of an address would give you a 4-byte alignment; also referred to as a stride of 4 bytes. Each time an address is incremented it is effectively incrementing bit 2, not bit 0, i.e., the last 2 bits will always continue to be 00.
This can even affect the physical design of the system. If the address bus needs 2 fewer bits, there can be 2 fewer pins on the CPU, and 2 fewer traces on the circuit board.
Atomicity
The CPU can operate on an aligned word of memory atomically, meaning that no other instruction can interrupt that operation. This is critical to the correct operation of many lock-free data structures and other concurrency paradigms.
Conclusion
The memory system of a processor is quite a bit more complex and involved than described here; a discussion on how an x86 processor actually addresses memory can help (many processors work similarly).
There are many more benefits to adhering to memory alignment that you can read at this IBM article.
A computer's primary use is to transform data. Modern memory architectures and technologies have been optimized over decades to facilitate getting more data, in, out, and between more and faster execution units–in a highly reliable way.
Bonus: Caches
Another alignment-for-performance that I alluded to previously is alignment on cache lines which are (for example, on some CPUs) 64B.
For more info on how much performance can be gained by leveraging caches, take a look at Gallery of Processor Cache Effects; from this question on cache-line sizes
Understanding of cache lines can be important for certain types of program optimizations. For example, the alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.
It's a limitation of many underlying processors. It can usually be worked around by doing 4 inefficient single byte fetches rather than one efficient word fetch, but many language specifiers decided it would be easier just to outlaw them and force everything to be aligned.
There is much more information in this link that the OP discovered.
you can with some processors (the nehalem can do this), but previously all memory access was aligned on a 64-bit (or 32-bit) line, because the bus is 64 bits wide, you had to fetch 64 bit at a time, and it was significantly easier to fetch these in aligned 'chunks' of 64 bits.
So, if you wanted to get a single byte, you fetched the 64-bit chunk and then masked off the bits you didn't want. Easy and fast if your byte was at the right end, but if it was in the middle of that 64-bit chunk, you'd have to mask off the unwanted bits and then shift the data over to the right place. Worse, if you wanted a 2 byte variable, but that was split across 2 chunks, then that required double the required memory accesses.
So, as everyone thinks memory is cheap, they just made the compiler align the data on the processor's chunk sizes so your code runs faster and more efficiently at the cost of wasted memory.
Fundamentally, the reason is because the memory bus has some specific length that is much, much smaller than the memory size.
So, the CPU reads out of the on-chip L1 cache, which is often 32KB these days. But the memory bus that connects the L1 cache to the CPU will have the vastly smaller width of the cache line size. This will be on the order of 128 bits.
So:
262,144 bits - size of memory
128 bits - size of bus
Misaligned accesses will occasionally overlap two cache lines, and this will require an entirely new cache read in order to obtain the data. It might even miss all the way out to the DRAM.
Furthermore, some part of the CPU will have to stand on its head to put together a single object out of these two different cache lines which each have a piece of the data. On one line, it will be in the very high order bits, in the other, the very low order bits.
There will be dedicated hardware fully integrated into the pipeline that handles moving aligned objects onto the necessary bits of the CPU data bus, but such hardware may be lacking for misaligned objects, because it probably makes more sense to use those transistors for speeding up correctly optimized programs.
In any case, the second memory read that is sometimes necessary would slow down the pipeline no matter how much special-purpose hardware was (hypothetically and foolishly) dedicated to patching up misaligned memory operations.
#joshperry has given an excellent answer to this question. In addition to his answer, I have some numbers that show graphically the effects which were described, especially the 2X amplification. Here's a link to a Google spreadsheet showing what the effect of different word alignments look like.
In addition here's a link to a Github gist with the code for the test.
The test code is adapted from the article written by Jonathan Rentzsch which #joshperry referenced. The tests were run on a Macbook Pro with a quad-core 2.8 GHz Intel Core i7 64-bit processor and 16GB of RAM.
If you have a 32bit data bus, the address bus address lines connected to the memory will start from A2, so only 32bit aligned addresses can be accessed in a single bus cycle.
So if a word spans an address alignment boundary - i.e. A0 for 16/32 bit data or A1 for 32 bit data are not zero, two bus cycles are required to obtain the data.
Some architectures/instruction sets do not support unaligned access and will generate an exception on such attempts, so compiler generated unaligned access code requires not just additional bus cycles, but additional instructions, making it even less efficient.
If a system with byte-addressable memory has a 32-bit-wide memory bus, that means there are effectively four byte-wide memory systems which are all wired to read or write the same address. An aligned 32-bit read will require information stored in the same address in all four memory systems, so all systems can supply data simultaneously. An unaligned 32-bit read would require some memory systems to return data from one address, and some to return data from the next higher address. Although there are some memory systems that are optimized to be able to fulfill such requests (in addition to their address, they effectively have a "plus one" signal which causes them to use an address one higher than specified) such a feature adds considerable cost and complexity to a memory system; most commodity memory systems simply cannot return portions of different 32-bit words at the same time.
On PowerPC you can load an integer from an odd address with no problems.
Sparc and I86 and (I think) Itatnium raise hardware exceptions when you try this.
One 32 bit load vs four 8 bit loads isnt going to make a lot of difference on most modern processors. Whether the data is already in cache or not will have a far greater effect.