Memory allocation to static variables (Compile time memory allocation) - compile-time

Memory allocation to static variables is done at compile time.
If I compile my application number of times, will memory be allocated every time?
If yes, then by the time, it may consume complete memory of my computer. Practically, it never happens, How?
Also, when we run the executable of the same application on some other computer, It runs successfully. How it finds the static variables in other computer's memory, if it was compiled on other computer.
Also, If I start many instances of the same application, will copy of static variables be created for all the instances or single static variable be shared among all instances?
I think, copy will be created. But here I have doubt that memory was allocated at compile time and one instance of the application can use that memory, so how other instances will allocate memory to that static variables.
Overall, I have doubt over What does "memory allocation at compile time" actually mean?

You've misunderstood the statement 'memory allocation at compile time'. What is meant by that is that the compiler writes data to the binary that it produces that indicates that the memory should be set aside when the program is loaded by the operating system.
In particular, the field is usually stored in a section in the output file that is called the BSS. The compiler puts the static variable declaration in the BSS, the OS's program loader reads out the BSS section when loading the program, and sets aside enough memory in the freshly created process to store the BSS.
Every time the program is launched, that is every time a new process is created, new memory is set aside for that process. This includes the memory needed for the BSS aka static variables.

Related

If v8 uses the "code" or "text" memory type, or if everything is in the heap/stack

In a typical memory layout there are 4 items:
code/text (where the compiled code of the program itself resides)
data
stack
heap
I am new to memory layouts so I am wondering if v8, which is a JIT compiler and dynamically generates code, stores this code in the "code" segment of the memory, or just stores it in the heap along with everything else. I'm not sure if the operating system gives you access to the code/text so not sure if this is a dumb question.
The below is true for the major operating systems running on the major CPUs in common use today. Things will differ on old or some embedded operating systems (in particular things are a lot simpler on operating systems without virtual memory) or when running code without an OS or on CPUs with no support for memory protection.
The picture in your question is a bit of a simplification. One thing it does not show is that (virtual) memory is made up of pages provided to you by the operating system. Each page has its own permissions controlling whether your process can read, write and/or execute the data in that page.
The text section of a binary will be loaded onto pages that are executable, but not writable. The read-only data section will be loaded onto pages that are neither writable nor executable. All other memory in your picture ((un)initialized data, heap, stack) will be stored on pages that are writable, but not executable.
These permissions prevent security flaws (such as buffer overruns) that could otherwise allow attackers to execute arbitrary code by making the program jump into code provided by the attacker or letting the attacker overwrite code in the text section.
Now the problem with these permissions, with regards to JIT compilation, is that you can't execute your JIT-compiled code: if you store it on the stack or the heap (or within a global variable), it won't be on an executable page, so the program will crash when you try to jump into the code. If you try to store it in the text area (by making use of left-over memory on the last page or by overwriting parts of the JIT-compilers code), the program will crash because you're trying to write to read-only memory.
But thankfully operating systems allow you to change the permissions of a page (on POSIX-systems this can be done using mprotect and on Windows using VirtualProtect). So your first idea might be to store the generated code on the heap and then simply make the containing pages executable. However this can be somewhat problematic: VirtualProtect and some implementations of mprotect require a pointer to the beginning of a page, but your array does not necessarily start at the beginning of a page if you allocated it using malloc (or new or your language's equivalent). Further your array may share a page with other data, which you don't want to be executable.
To prevent these issues, you can use functions, such as mmap on Unix-like operating systems and VirtualAlloc on Windows, that give you pages of memory "to yourself". These functions will allocate enough pages to contain as much memory as you requested and return a pointer to the beginning of that memory (which will be at the beginning of the first page). These pages will not be available to malloc. That is, even if you array is significantly smaller than the size of a page on your OS, the page will only be used to store your array - a subsequent call to malloc will not return a pointer to memory in that page.
So the way that most JIT-compilers work is that they allocate read-write memory using mmap or VirtualAlloc, copy the generated machine instructions into that memory, use mprotect or VirtualProtect to make the memory executable and non-writable (for security reasons you never want memory to be executable and writable at the same time if you can avoid it) and then jump into it. In terms of its (virtual) address, the memory will be part of the heap's area of the memory, but it will be separate from the heap in the sense that it won't be managed by malloc and free.
Heap and stack are the memory regions where programs can allocate at runtime. This is not specific to V8, or JIT compilers. For more detail, I humbly suggest that you read whatever book that illustration came from ;-)

Different types of address binding in OS?

There are different ways by which OS know how to locate a particular piece of code in the physical storage.How does it convert the logical memory to physical location?
The binding is necessary to link the logical memory to the physical memory.To know where the program is stored is necessary in order to access it.The binding may be of three different types.
Compile Time Binding:Address where the program is stored is known at compile time.
Load Time Binding:Address is not known at compile time but known at loading of program i.e,before running.
Run Time Binding:Address is known at running of executable program.
Found very good explanation here.
Summarising below:
Logical memory/address is converted to physical location/address using following types of address binding (depending upon when is the binding/conversion happening):
Compile time binding
Load time binding
Execution time binding
If program's final location in physical memory is known at compile time then binding can happen at compile time itself, only caveat is that program needs to be recompiled anytime its physical memory location changes.
If program's final location in physical memory is "not" known at compile time then compiler generates relative addresses or relocatable address in terms of offsets from the starting location of the program(for e.g., 32 bytes from starting location). This relocatable address is then bound by loader to absolute addresses in physical memory when it loads the program into any process into the main memory. Now if the starting location changes then program does not need to be "recompiled" but only needs to be "reloaded".
Execution time binding happens only in the cases where process can move from one physical memory segment to another at execution time.

How does the cpu decide which data it puts in what memory (ram, cache, registers)?

When the cpu is executing a program, does it move all data through the memory pipeline? Then any piece of data would be moved from ram->cache->registers so all data that's executed goes in the cpu registers at some point. Or does it somehow select the code it puts in those faster memory types, or can you as a programmer select specific code you want to keep in, for example, the cache for optimization?
The answer to this question is an entire course in itself! A very brief summary of what (usually) happens is that:
You, the programmer, specify what goes in RAM. Well, the compiler does it on your behalf, but you're in control of this by how you declare your variables.
Whenever your code accesses a variable the CPU's MMU will check if the value is in the cache and if it is not, then it will fetch the 'line' that contains the variable from RAM into the cache. Some CPU instruction sets may allow you to prevent it from doing so (causing a stall) for specific low-frequecy operations, but it requires very low-level code to do so. When you update a value, the MMU will perform a 'cache flush' operation, committing the cached memory to RAM. Again, you can affect how and when this happens by low-level code. It will also depend on the MMU configuration such as whether the cache is write-through, etc.
If you are going to do any kind of operation on the value that will require it being used by an ALU (arithmetic Logic Unit) or similar, then it will be loaded into an appropriate register from the cache. Which register will depend on the instruction the compiler generated.
Some CPUs support Dynamic Memory Access (DMA), which provides a shortcut for operations that do not really require the CPU to be involved. These include memory-to-memory copies and the transfer of data between memory and memory-mapped peripheral control blocks (such as UARTs and other I/O blocks). These will cause data to be moved, read or written in RAM without actually affecting the CPU core at all.
At a higher level, some operating systems that support multiple processes will save the RAM allocated to the current process to the hard disk when the process is swapped out, and load it back in again from the disk when the process runs again. (This is why you may find 'Page Files' on your C: drive and the options to limit their size.) This allows all of the running processes to utilise most of the available RAM, even though they can't actually share it all simultaneously. Paging is yet another subject worthy of a course on its own. (Thanks to Leeor for mentioning this.)

How much SRAM will I use on my ARM board?

I am developing for the Arduino Due which has 96k SRAM and 512k flash memory for code. If I have a program that will compile to, say, 50k, when I run the code, how much sram will I use? will I use 50k immediately, or only the memory used by the functions I call? Is there a way to measure this memory usage before I upload the sketch to the arduino?
You can run
arm-none-eabi-size bin.elf
Where:
bin.elf is the generated binary (look it up in the compile log)
arm-none-eabi-size is a tool included with Arduino for arm which lets you know the memory distribution of your binary. This program can be found inside the Arduino directory. In my mac, this is /Applications/Arduino.app/Contents/Resources/Java/hardware/tools/g++_arm_none_eabi/bin
This command will output:
text data bss dec hex filename
9648 0 1188 10836 2a54 /var/folders/jz/ylfb9j0s76xb57xrkb605djm0000gn/T/build2004175178561973401.tmp/sketch_oct24a.cpp.elf
data + bss is RAM, text is program memory.
Very important: This doesn't account for dynamic memory (created in stack), this is only RAM memory for static and global variables. There are other techniques to check the RAM usage dynamically, like this one, but it will depend on the linker capabilities of the compiler suite you are using.
Your whole program is loaded into arduino, so atleast 50K flash memory will be used. Then on running the code, you will allocate some variables, some on stack, some global which will take some memory too but on SRAM.
I am not sure if there is a way to exactly measure the memory required but you can get a rough estimation based on the number and types of variables being allocated in the code. Remember, the global variables will take the space during the entire time the code is running on arduino, the local variables( the ones that are declared within a pair of {..}) remain in the memory till the '}' brace also known as the scope of the variables. Also remember, the compiled 50K code which you are mentioning is just the code portion, it does not include your variables, not even the global ones. The code is store in Flash memory and the variables are stored in the SRAM. The variables start taking memory only during runtime.
Also I curious to know how you are calculating that your code uses 50K memory?
Here is a little library to output the avalaible RAM memory.
I used it a lot when my program was crashing with no bug in the code. It turned out that I was running out of RAM.
So it's very handy!
Avalaible Memory Library
Hope it helps! :)

Is device memory allocated using CudaMalloc inaccessible on the device with free?

I cannot deallocate memory on the host that I've allocated on the device or deallocate memory on the device that I allocated on the host. I'm using CUDA 5.5 with VS2012 and Nsight. Is it because the heap that's on the host is not transferred to the heap that's on the device or the other way around, so dynamic allocations are unknown between host and device?
If this is in the documentation, it is not easy to find. It's also important to note, an error wasn't thrown until I ran the program with CUDA debugging and with Memory Checker enabled. The problem did not cause a crash outside of CUDA debugging, but would've cause problems later if I hadn't checked for memory issues retroactively. If there's a handy way to copy the heap/stack from host to device, that'd be fantastic... hopes and dreams.
Here's an example for my question:
__global__ void kernel(char *ptr)
{
free(ptr);
}
void main(void)
{
char *ptr;
cudaMalloc((void **)&ptr, sizeof(char *), cudaMemcpyHostToDevice);
kernel<<<1, 1>>>(ptr);
}
No you can't do this.
This topic is specifically covered in the programming guide here
Memory allocated via malloc() cannot be freed using the runtime (i.e., by calling any of the free memory functions from Device Memory).
Similarly, memory allocated via the runtime (i.e., by calling any of the memory allocation functions from Device Memory) cannot be freed via free().
It's in section B.18.2 of the programming guide, within section B.18 "B.18. Dynamic Global Memory Allocation and Operations".
The basic reason for it is that the mechanism used to reserve allocations using the runtime (e.g. cudaMalloc, cudaFree) is separate from the device code allocator, and in fact they reserve out of logically separate regions of global memory.
You may want to read the entire B.18 section of the programming guide, which covers these topics on device dynamic memory allocation.
Here is my solution to mixing dynamic memory allocation on the host using CRT, with the host's CUDA API, and with the kernel memory functions. First off, as mentioned above, they all must be managed separately using strategy that does not require dynamic allocations to be transferred directly between system and device without prior communication and coordination. Manual data copies are required that do not validate against the kernel's device heap as noted in Robert's answer/comments.
I also suggest to keep track of, audit, the number of bytes allocated and deallocated in the 3 different memory management APIs. For instance, every time a system:malloc, host:cudaMalloc, device:malloc or associated frees are called, use a variable to hold the number of bytes allocated or deallocated in each heap, i.e. from system, host, device. This helps with tracking leaks when debugging.
The process is complex to dynamically allocate, manage, and audit
memory between the system, host and device perspectives for deep
dynamic structure copies. Here is a strategy that works, suggestions
are welcomed:
Allocate system memory using cudaHostMalloc or malloc of a
structural type that contains pointers on the system heap;
Allocate device memory from host for the struct, and copy the
structure to the device (i.e. cudaMalloc, cudaMemcpy, etc.);
From within a kernel, use malloc to create a memory allocation
managed using the device heap and save the pointer(s) in the
structure that exists on the device from step 2;
Communicate what was allocated by the kernel to system by exchanging
the size of the allocations for each of the pointers in the struct;
Host performs the same allocation on the device using CUDA API (i.e.
cudaMalloc) from the system as was done by the kernel on the device,
recommended to have a separate pointer variable in the structure for
this;
At this point, the memory allocated dynamically from the kernel in
device memory can be manually copied to the location dynamically
allocated by the host in device memory (i.e. not using host:memcpy,
device:memcpy or cudaMemcpy);
Kernel cleans up memory allocations; and,
Host uses cudaMemcpy to move the structure from the device, a
similar strategy outlined in the above answer's comment can be used
as necessary for deep copies.
Note, cudaHostMalloc and system:malloc (or cudaHostMalloc) both share the same system heap, making system heap and host heap the same and interoperable, as mentioned in the CUDA guide, referenced above. Therefore, only system heap and device heap are mentioned.

Resources