Measuring Memory used by a dict in Erlang - memory

To get the item count and memory usage of an ets table, T; we may use
ets:info(T,size) & ets:info(T,memory) respectively.
Similarly, dict:size(D) gives the item count for a dict, D.
How can we determine the amount of memory used by a dict ?
Thanks.

Dict is normal Erlang term so it is stored in process heap and is object of garbage collection. You will be usually more concerned about process memory usage than dict itself. You can determine memory usage using erlang:process_info/2. If you will be still interested in size occupied by dict term you can use erts_debug:size/1 and if you would like to know memory used when send as message erts_debug:flat_size/1. Note both functions returns size in words so multiple by 4 or 8 bytes depending on used VM. (i.e 32 or 64 bit VM, use erlang:system_info(wordsize))

Related

Fetching data from memory, how is it done?

What part of the computer program fetches the relevant data from memory?
For example:
let arr = [1,2,3,4]
let x = arr[3]
What is responsible for fetching the data stored at the address ((the address of array arr) + 3), I know a pointer points to the memory address of interest, but that's an abstraction of what's happening, what's really responsible for fetching the stored data in memory, and storing it in a register for an operation, or in a different memory address?
The processor has an address generator, and a read & write interface.  This is true across all processors but the details vary.
Simple, educational and older processors have an MAR, memory address register, and MBR, memory buffer register, and some control signals. That constitutes the cpu to memory interface.
To write to memory, the CPU puts a value into the MBR and MAR, and then asserts a write control signal.  The memory responds by performing the requested write operation (and may signal the processor that it is done).
To do a read operation, the CPU puts a value into the MAR, and asserts a read control signal; the memory responds puttiing a value into the MBR, (and asserts a ready control signal).  The processor takes the value from the MBR and transfers it to where it should go.
You might look up architectures like LC-3, MARIE, and look for block diagrams having those specific registers. Modern processors introduce caches in between the CPU and memory, so there are multiple memory interfaces.
The machine code program will use available instruction encodings to tell the processor when and what to read or write. The compiler's job is to identify machine code instruction sequences that will accomplish the C program's intent.

Heap - How free bytes are tracked?

I am reading about heap and stack usage and I have a question about the heap and the dynamically allocated memory.
How/where the heap memory used by an application is known to be used or availabe?
What I always see in an explanation of dynamic allocation is something like this :
int *p;
p = (int*)malloc(sizeof(int));/* A space in the heap is allocated to store an int*/
p = (int*)malloc(sizeof(int));/* p now points to another space in the heap ; the first allocated space is lost, causing a memory leak*/
I understand the /* comments */ I wrote in the code but I don't clearly understand that the first space is "lost" so I came up with two ideas:
everytime there is a dynamic allocation, "something" must keep track of the space reserved in memory (address of the first byte and the size of the space allocated) : something like a list that is updated at each dynamic allocation.
or is there some sort of a flag for each byte saying that it is free or used by an application.
In this way, a call to the free function would, for the first idea, update the list by removing the address of the bytes or for the second one change the flags.
Thank you for your time.

how does malloc work in details?

I am trying to find some useful information on the malloc function.
when I call this function it allocates memory dynamically. it returns the pointer (e.g. the address) to the beginning of the allocated memory.
the questions:
how the returned address is used in order to read/write into the allocated memory block (using inderect addressing registers or how?)
if it is not possible to allocate a block of memory it returns NULL. what is NULL in terms of hardware?
in order to allocate memory in heap we need to know which memory parts are occupied. where this information (about the occupied memory) is stored (if for example we use a small risc microcontroller)?
Q3 The usual way that heaps are managed are through a linked list. In the simplest case, the malloc function retains a pointer to the first free-space block in the heap, and each free-space block has a header that points to the next free space block in the heap. So the heap is in-effect self-defining in terms of knowing what is not occupied (and by inference what is therefore occupied); this minimizes the amount of overhead RAM needed to manage the heap.
When new space is needed via a malloc call, a large enough free-space block is found by traversing the linked list. That found free-space block is given to the malloc caller (with a small hidden header), and if needed a smaller free-space block is inserted into the linked list with any residual space between the original free space block and how much memory the malloc call asked for.
When a heap block is released by the application, its block is just formatted with the linked-list header, and added to the linked list, usually with some extra logic to combine consecutive free-space blocks into one larger free-space block.
Debugging versions of malloc usually do more, including retaining linked-lists of the allocated areas too, "guard zones" around the allocated heap areas to help detect memory overflows, etc. These take up extra heap space (making the heap effectively smaller in terms of usable space for the applications), but are extremely helpful when debugging.
Q2 A NULL pointer is effectively just a zero, which if used attempts to access memory starting at location 0 of RAM, which is almost always reserved memory of the OS. This is the cause of a significant quantity of memory violation aborts, all caused by programmer's lack of error checking for NULL returns from functions that allocate memory).
Because accessing memory location 0 by a non-OS application is never what is wanted, most hardware aborts any attempt to access location 0 by non-OS software. Even with page mapping such that the applications memory space (including location 0) is never mapped to real RAM location 0, since NULL is always zero, most CPUs will still abort attempts to access location 0 on the assumption that this is an access via a pointer that contains NULL.
Given your RISC processor, you will need to read its documentation to see how it handles attempts to access memory location 0.
Q1 There are many high-level language ways to use allocated memory, primarily through pointers, strings, and arrays.
In terms of assembly language and the hardware itself, the allocated heap block address just gets put into a register that is being used for memory indirection. You will need to see how that is handled in the RISC processor. However if you use C or C++ or such higher level language, then you don't need to worry about registers; the compiler handles all that.
Since you are using malloc, can we assume you are using C?
If so, you assign the result to a pointer variable, then you can access the memory by referencing through the variable. You don't really know how this is implemented in assembly. That depends on CPU you are using. malloc return 0 if it fails. Since usually NULL is defined as 0, you can test for NULL. You don't care how malloc tracks the free memory. If you really need this information, you should look at the source in glibc/malloc available on the net
char * c = malloc(10); // allocate 10 bytes
if (c == NULL)
// handle error case
else
*c = 'a' // write a in the first character on the block

Is it possible to use cudaMemcpy with src and dest as different types?

I'm using a Tesla, and for the first time, I'm running low on CPU memory instead of GPU memory! Hence, I thought I could cut the size of my host memory by switching all integers to short (all my values are below 255).
However, I want my device memory to use integers, since the memory access is faster. So is there a way to copy my host memory (in short) to my device global memory (in int)? I guess this won't work:
short *buf_h = new short[100];
int *buf_d = NULL;
cudaMalloc((void **)&buf_d, 100*sizeof(int));
cudaMemcpy( buf_d, buf_h, 100*sizeof(short), cudaMemcpyHostToDevice );
Any ideas? Thanks!
There isn't really a way to do what you are asking directly. The CUDA API doesn't support "smart copying" with padding or alignment, or "deep copying" of nested pointers, or anything like that. Memory transfers require linear host and device memory, and alignment must be the same between source and destination memory.
Having said that, one approach to circumvent this restriction would be to copy the host short data to an allocation of short2 on the device. Your device code can retrieve a short2 containing two packed shorts, extract the value it needs and then cast the value to int. This will give the code 32 bit memory transactions per thread, allowing for memory coalescing, and (if you are using Fermi GPUs) good L1 cache hit rates, because adjacent threads within a block would be reading the same 32 bit word. On non Fermi GPUs, you could probably use a shared memory scheme to efficiently retrieve all the values for a block using coalesced reads.

Working with large arrays - OutOfRam

I have an algorithm where I create two bi-dimensional arrays like this:
TYPE
TPtrMatrixLine = array of byte;
TCurMatrixLine = array of integer;
TPtrMatrix = array of TPtrMatrixLine;
TCurMatrix = array of TCurMatrixLine;
function x
var
PtrsMX: TPtrMatrix;
CurMx : TCurMatrix;
begin
{ Try to allocate RAM }
SetLength(PtrsMX, RowNr+1, ColNr+1);
SetLength(CurMx , RowNr+1, ColNr+1);
for all rows do
for all cols do
FillMatrixWithData; <------- CPU intensive task. It could take up to 10-20 min
end;
The two matrices have always the same dimension.
Usually there are only 2000 lines and 2000 columns in the matrix but sometimes it can go as high as 25000x6000 so for both matrices I need something like 146.5 + 586.2 = 732.8MB of RAM.
The problem is that the two blocks need to be contiguous so in most cases, even if 500-600MB of free RAM doesn't seem much on a modern computer, I run out of RAM.
The algorithm fills the cells of the array with data based on the neighbors of that cell. The operations are just additions and subtractions.
The TCurMatrixLine is the one that takes a lot or RAM since it uses integers to store data. Unfortunately, values stored may have sign so I cannot use Word instead of integers. SmallInt is too small (my values are bigger than SmallInt, but smaller than Word). I hope that if there is any other way to implement this, it needs not to add a lot of overhead, since processing a matrix with so many lines/column already takes a lot of time. In other words I hope that decreasing memory requirements will not increase processing time.
Any idea how to decrease the memory requirements?
[I use Delphi 7]
Update
Somebody suggested that each row of my array should be an independent uni-dimensional array.
I create as many rows (arrays) as I need and store them in TList. Sound very good. Obviously there will be no problem allocation such small memory blocks. But I am afraid it will have a gigantic impact on speed. I use now
TCurMatrixLine = array of integer;
TCurMatrix = array of TCurMatrixLine;
because it is faster than TCurMatrix= array of array of integer (because of the way data is placed in memory). So, breaking the array in independent lines may affect the speed.
The suggestion of using a signed 2 byte integer will greatly aid you.
Another useful tactic is to mark your exe as being LARGE_ADDRESS_AWARE by adding {$SetPEFlags IMAGE_FILE_LARGE_ADDRESS_AWARE} to your .dpr file. This will only help if you are running on 64 bit Windows and will increase your address space from 2GB to 4GB.
It may not work on Delphi 7 (I seem to recall you are using D7) and you must be using FastMM since the old Borland memory manager isn't compatible with large address space. If $SetPEFlags isn't available you can still mark the exe with EDITBIN.
If you still encounter difficulties then yet another trick is to do allocate smaller sub-blocks of memory and use a wrapper class to handle mapping indices to the appropriate sub-block and offset within. You can use a default index property to make this transparent to the calling code.
Naturally a block allocated approach like this does incur some processing overhead but it's your best bet if you are having troubles with getting contiguous blocks.
If the absolute values of elements of CurMx fits word then you can store it in word and use another array of boolean for its sign. It reduces 1 byte for each element.
Have you considered to manually allocate the data structure on the heap?
...and measured how this will affect the memory usage and the performance?
Using the heap might actually increase speed and reduce the memory usage, because you can avoid the whole array to be copied from one memory segment to another memory segment. (Eg. if your FillMatrixWithData are declared with a non-const open array parameter).

Resources