So from what I have read SRAM is volatile and EEPROM is non volatile. If SRAM is volatile, how come I sometimes get values (random and garbage but still values) when I use *ptr.
For example for ptr=&x, *ptr might give me a value. Shouldn't I get NULL because it is volatile and SRAM is wiped out every time the power is off?
Volatile, in the terms of memory, means that values won't get preserved after a power cycle. Given the nature of RAM, it may contain any garbage value at the point of power-up. There is nothing in the hardware that initializes RAM to zero.
So you will have to initialize RAM to zero manually if this is needed.
The C standard actually mandates that such initialization is done on all variables with static storage duration - but those only. That "zero-out" initialization is carried out by some firmware before main() is executed. But local C variables will never get initialized automatically.
Please note that the volatile keyword in C has little to do with volatile memories. Don't confuse those two different terms.
No. You mixing contexts. One thing is memory volatility, it concerns the memory physical construction. Other is your code reading random memory address.
Sometimes hardware can wipe a SRAM on power up, sometimes not, you cannot count on it.
If you read non occupied address of the RAM in your code you will read garbage, either bits generated powering process or old data that was disposed and is no longer used in the same power cycle.
Related
I am writing an algorithm which all blocks are reading a same address. Such as we have a list=[1, 2, 3, 4], and all blocks are reading it and store it to their own shared memory...My test shows the more blocks reading it, the slower it will be...I guess no broadcast happen here? Any idea I can make it faster? Thank you!!!
I learnt from previous post that this can be broadcast in one wrap, seems can not happen in different wrap....(Actually in my case, the threads in one wrap are not reading a same location...)
Once list element is accessed by first warp of a SM unit, the second warp in same SM unit gets it from cache and broadcasts to all simt lanes. But another SM unit's warp may not have it in L1 cache so it fetches from L2 to L1 first.
It is similar in __constant__ memory but it requires same address to be accessed by all threads. Its latency is closer to register access. __constant__ memory is like instruction cache, you get more performance when all threads do same thing.
For example, if you have a Gaussian-filter that iterates over same coefficient-list of filter on all threads, it is better to use constant memory. Using shared memory does not have much advantage as the filter array is not scanned randomly. Shared memory is better when the filter array content is different per block or if it needs random access.
You can also combine constant memory and shared memory. Get half of list from constant memory, then the other half from shared memory. This should let 1024 threads hide latency of one memory type hidden behind the other.
If list is small enough, you can use registers directly (has to be compile-time known indices). But it increases register pressure and may decrease occupancy so be careful about this.
Some old cuda architectures (in case of fma operation) required one operand fetched from constant memory and the other operand from a register to achieve better performance in compute-bottlenecked algorithms.
In a test with 12000 floats as filter to be applied on all threads inputs, shared memory version with 128 threads-per-block completed work in 330 milliseconds while constant-memory version completed in 260 milliseconds and the L1 access performance was the real bottleneck in both versions so the real constant-memory performance is even better, as long as it is similar-index for all threads.
I have a C++ application which embeds Lua and periodically (~10 times per second) runs scripts that can create new userdata objects that allocate non-trivial amounts of memory (~1MiB each) on C++ side. The memory is correctly freed when these objects are garbage-collected, but the problem is that they don't seem to be collected until it's too late and the process starts consuming inordinate amounts of memory (many GiBs). I guess this happens because Lua GC thinks that these objects are just not worth collecting because they appear small on Lua side, and it has no idea how much memory do they actually consume. If this is correct, is there any way to tell it about the real cost of keeping these objects around?
For people familiar with C#, I'm basically looking for the equivalent of GC.AddMemoryPressure() for Lua.
The closest thing you're going to get at present is LUA_GCSTEP. The manual says:
lua_gc
int lua_gc (lua_State *L, int what, int data);
Controls the garbage collector. […] For more details about these options, see collectgarbage.
collectgarbage ([opt [, arg]])
This function is a generic interface to the garbage collector. It performs different functions according to its first argument, opt:
[…]
"step": performs a garbage-collection step. The step "size" is controlled by arg. With a zero value, the collector will perform one basic (indivisible) step. For non-zero values, the collector will perform as if that amount of memory (in KBytes) had been allocated by Lua. Returns true if the step finished a collection cycle.
(emphasis mine)
This will advance the GC once (as if that much was allocated), but that information isn't kept, meaning that parameters influencing the behavior for future collections (GC pause length, total memory in use, etc.) aren't affected. (So check if that's works well enough for you and if it doesn't maybe try playing with the GC pause and step multiplier.)
No. there is no such mechanism in Lua.
Because of having performance issues when passing a code from static to dynamic allocation, I started to wander about how memory allocation is managed in a Fortran code.
Specifically, in this question, I wander if the order or syntax used for the allocate statement makes any difference. That is, does it make any difference to allocate vectors like:
allocate(x(DIM),y(DIM))
versus
allocate(x(DIM))
allocate(y(DIM))
The syntax suggests that in the first case the program would allocate all the space for the vectors at once, possibly improving the performance, while in the second case it must allocate the space for one vector at a time, in such a way that they could end up far from each other. If not, that is, if the syntax does not make any difference, I wander if there is a way to control that allocation (for instance, allocating a vector for all space and using pointers to address the space allocated as multiple variables).
Finally, I notice now that I don't even know one thing: an allocate statement guarantees that at least a single vector occupies a contiguous space in memory (or the best it can?).
From the language standard point of view both ways how to write them are possible. The compiler is free to allocate the arrays where it wants. It normally calls malloc() to allocate some piece of memory and makes the allocatable arrays from that piece.
Whether it might allocate a single piece of memory for two different arrays in a single allocate statement is up to the compiler, but I haven't heard about any compiler doing that.
I just verified that my gfortran just calls __builtin_malloc two times in this case.
Another issue is already pointed out by High Performance Mark. Even when malloc() successfully returns, the actual memory pages might still not be assigned. On Linux that happens when you first access the array.
I don't think it is too important if those arrays are close to each other in memory or not anyway. The CPU can cache arrays from different regions of address space if it needs them.
Is there a way how to control the allocation? Yes, you can overload the malloc by your own allocator which does some clever things. It may be used to have always memory aligned to 32-bytes or similar purposes (example). Whether you will improve performance of your code by allocating things somehow close to each other is questionable, but you can have a try. (Of course this is completely compiler-dependent thing, a compiler doesn't have to use malloc() at all, but mostly they do.) Unfortunately, this will only works when the calls to malloc are not inlined.
There are (at least) two issues here, firstly the time taken to allocate the memory and secondly the locality of memory in the arrays and the impact of this on performance. I don't know much about the actual allocation process, although the links suggested by High Performance Mark and the answer by Vadimir F cover this.
From your question, it seems you are more interested in cache hits and memory locality given by arrays being next to each other. I would guess there is no guarantee either allocate statement ensures both arrays next to each other in memory. This is based on allocating arrays in a type, which in the fortran 2003 MAY 2004 WORKING DRAFT J3/04-007 standard
NOTE 4.20
Unless the structure includes a SEQUENCE statement, the use of this terminology in no way implies that these components are stored in this, or any other, order. Nor is there any requirement that contiguous storage be used.
From the discussion with Vadimir F, if you put allocatable arrays in a type and use the sequence keyword, e.g.
type botharrays
SEQUENCE
double precision, dimension(:), allocatable :: x, y
end type
this DOES NOT ensure they are allocated as adjacent in memory. For static arrays or lots of variables, a sequential type sounds like it may work like your idea of "allocating a vector for all space and using pointers to address the space allocated as multiple variables". I think common blocks (Fortran 77) allowed you to specify the relationship between memory location of arrays and variables in memory, but don't work with allocatable arrays either.
In short, I think this means you cannot ensure two allocated arrays are adjacent in memory. Even if you could, I don't see how this will result in a reduction in cache misses or improved performance. Even if you typically use the two together, unless the arrays are small enough that the cache will include multiple arrays in one read (assuming reads are allowed to go beyond array bounds) you won't benefit from the memory locality.
I am trying to find some useful information on the malloc function.
when I call this function it allocates memory dynamically. it returns the pointer (e.g. the address) to the beginning of the allocated memory.
the questions:
how the returned address is used in order to read/write into the allocated memory block (using inderect addressing registers or how?)
if it is not possible to allocate a block of memory it returns NULL. what is NULL in terms of hardware?
in order to allocate memory in heap we need to know which memory parts are occupied. where this information (about the occupied memory) is stored (if for example we use a small risc microcontroller)?
Q3 The usual way that heaps are managed are through a linked list. In the simplest case, the malloc function retains a pointer to the first free-space block in the heap, and each free-space block has a header that points to the next free space block in the heap. So the heap is in-effect self-defining in terms of knowing what is not occupied (and by inference what is therefore occupied); this minimizes the amount of overhead RAM needed to manage the heap.
When new space is needed via a malloc call, a large enough free-space block is found by traversing the linked list. That found free-space block is given to the malloc caller (with a small hidden header), and if needed a smaller free-space block is inserted into the linked list with any residual space between the original free space block and how much memory the malloc call asked for.
When a heap block is released by the application, its block is just formatted with the linked-list header, and added to the linked list, usually with some extra logic to combine consecutive free-space blocks into one larger free-space block.
Debugging versions of malloc usually do more, including retaining linked-lists of the allocated areas too, "guard zones" around the allocated heap areas to help detect memory overflows, etc. These take up extra heap space (making the heap effectively smaller in terms of usable space for the applications), but are extremely helpful when debugging.
Q2 A NULL pointer is effectively just a zero, which if used attempts to access memory starting at location 0 of RAM, which is almost always reserved memory of the OS. This is the cause of a significant quantity of memory violation aborts, all caused by programmer's lack of error checking for NULL returns from functions that allocate memory).
Because accessing memory location 0 by a non-OS application is never what is wanted, most hardware aborts any attempt to access location 0 by non-OS software. Even with page mapping such that the applications memory space (including location 0) is never mapped to real RAM location 0, since NULL is always zero, most CPUs will still abort attempts to access location 0 on the assumption that this is an access via a pointer that contains NULL.
Given your RISC processor, you will need to read its documentation to see how it handles attempts to access memory location 0.
Q1 There are many high-level language ways to use allocated memory, primarily through pointers, strings, and arrays.
In terms of assembly language and the hardware itself, the allocated heap block address just gets put into a register that is being used for memory indirection. You will need to see how that is handled in the RISC processor. However if you use C or C++ or such higher level language, then you don't need to worry about registers; the compiler handles all that.
Since you are using malloc, can we assume you are using C?
If so, you assign the result to a pointer variable, then you can access the memory by referencing through the variable. You don't really know how this is implemented in assembly. That depends on CPU you are using. malloc return 0 if it fails. Since usually NULL is defined as 0, you can test for NULL. You don't care how malloc tracks the free memory. If you really need this information, you should look at the source in glibc/malloc available on the net
char * c = malloc(10); // allocate 10 bytes
if (c == NULL)
// handle error case
else
*c = 'a' // write a in the first character on the block
I'm using a Tesla, and for the first time, I'm running low on CPU memory instead of GPU memory! Hence, I thought I could cut the size of my host memory by switching all integers to short (all my values are below 255).
However, I want my device memory to use integers, since the memory access is faster. So is there a way to copy my host memory (in short) to my device global memory (in int)? I guess this won't work:
short *buf_h = new short[100];
int *buf_d = NULL;
cudaMalloc((void **)&buf_d, 100*sizeof(int));
cudaMemcpy( buf_d, buf_h, 100*sizeof(short), cudaMemcpyHostToDevice );
Any ideas? Thanks!
There isn't really a way to do what you are asking directly. The CUDA API doesn't support "smart copying" with padding or alignment, or "deep copying" of nested pointers, or anything like that. Memory transfers require linear host and device memory, and alignment must be the same between source and destination memory.
Having said that, one approach to circumvent this restriction would be to copy the host short data to an allocation of short2 on the device. Your device code can retrieve a short2 containing two packed shorts, extract the value it needs and then cast the value to int. This will give the code 32 bit memory transactions per thread, allowing for memory coalescing, and (if you are using Fermi GPUs) good L1 cache hit rates, because adjacent threads within a block would be reading the same 32 bit word. On non Fermi GPUs, you could probably use a shared memory scheme to efficiently retrieve all the values for a block using coalesced reads.