Is device memory allocated using CudaMalloc inaccessible on the device with free? - memory

I cannot deallocate memory on the host that I've allocated on the device or deallocate memory on the device that I allocated on the host. I'm using CUDA 5.5 with VS2012 and Nsight. Is it because the heap that's on the host is not transferred to the heap that's on the device or the other way around, so dynamic allocations are unknown between host and device?
If this is in the documentation, it is not easy to find. It's also important to note, an error wasn't thrown until I ran the program with CUDA debugging and with Memory Checker enabled. The problem did not cause a crash outside of CUDA debugging, but would've cause problems later if I hadn't checked for memory issues retroactively. If there's a handy way to copy the heap/stack from host to device, that'd be fantastic... hopes and dreams.
Here's an example for my question:
__global__ void kernel(char *ptr)
{
free(ptr);
}
void main(void)
{
char *ptr;
cudaMalloc((void **)&ptr, sizeof(char *), cudaMemcpyHostToDevice);
kernel<<<1, 1>>>(ptr);
}

No you can't do this.
This topic is specifically covered in the programming guide here
Memory allocated via malloc() cannot be freed using the runtime (i.e., by calling any of the free memory functions from Device Memory).
Similarly, memory allocated via the runtime (i.e., by calling any of the memory allocation functions from Device Memory) cannot be freed via free().
It's in section B.18.2 of the programming guide, within section B.18 "B.18. Dynamic Global Memory Allocation and Operations".
The basic reason for it is that the mechanism used to reserve allocations using the runtime (e.g. cudaMalloc, cudaFree) is separate from the device code allocator, and in fact they reserve out of logically separate regions of global memory.
You may want to read the entire B.18 section of the programming guide, which covers these topics on device dynamic memory allocation.

Here is my solution to mixing dynamic memory allocation on the host using CRT, with the host's CUDA API, and with the kernel memory functions. First off, as mentioned above, they all must be managed separately using strategy that does not require dynamic allocations to be transferred directly between system and device without prior communication and coordination. Manual data copies are required that do not validate against the kernel's device heap as noted in Robert's answer/comments.
I also suggest to keep track of, audit, the number of bytes allocated and deallocated in the 3 different memory management APIs. For instance, every time a system:malloc, host:cudaMalloc, device:malloc or associated frees are called, use a variable to hold the number of bytes allocated or deallocated in each heap, i.e. from system, host, device. This helps with tracking leaks when debugging.
The process is complex to dynamically allocate, manage, and audit
memory between the system, host and device perspectives for deep
dynamic structure copies. Here is a strategy that works, suggestions
are welcomed:
Allocate system memory using cudaHostMalloc or malloc of a
structural type that contains pointers on the system heap;
Allocate device memory from host for the struct, and copy the
structure to the device (i.e. cudaMalloc, cudaMemcpy, etc.);
From within a kernel, use malloc to create a memory allocation
managed using the device heap and save the pointer(s) in the
structure that exists on the device from step 2;
Communicate what was allocated by the kernel to system by exchanging
the size of the allocations for each of the pointers in the struct;
Host performs the same allocation on the device using CUDA API (i.e.
cudaMalloc) from the system as was done by the kernel on the device,
recommended to have a separate pointer variable in the structure for
this;
At this point, the memory allocated dynamically from the kernel in
device memory can be manually copied to the location dynamically
allocated by the host in device memory (i.e. not using host:memcpy,
device:memcpy or cudaMemcpy);
Kernel cleans up memory allocations; and,
Host uses cudaMemcpy to move the structure from the device, a
similar strategy outlined in the above answer's comment can be used
as necessary for deep copies.
Note, cudaHostMalloc and system:malloc (or cudaHostMalloc) both share the same system heap, making system heap and host heap the same and interoperable, as mentioned in the CUDA guide, referenced above. Therefore, only system heap and device heap are mentioned.

Related

Paged memory vs Pinned memory in memory copy [duplicate]

I observe substantial speedups in data transfer when I use pinned memory for CUDA data transfers. On linux, the underlying system call for achieving this is mlock. From the man page of mlock, it states that locking the page prevents it from being swapped out:
mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully;
In my tests, I had a fews gigs of free memory on my system so there was never any risk that the memory pages could've been swapped out yet I still observed the speedup. Can anyone explain what's really going on here?, any insight or info is much appreciated.
CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).
As described here http://forums.nvidia.com/index.php?showtopic=164661
host memory used by the asynchronous mem copy call needs to be page locked through cudaMallocHost or cudaHostAlloc.
I can also recommend to check cudaMemcpyAsync and cudaHostAlloc manuals at developer.download.nvidia.com. HostAlloc says that cuda driver can detect pinned memory:
The driver tracks the virtual memory ranges allocated with this(cudaHostAlloc) function and automatically accelerates calls to functions such as cudaMemcpy().
CUDA use DMA to transfer pinned memory to GPU. Pageable host memory cannot be used with DMA because they may reside on the disk.
If the memory is not pinned (i.e. page-locked), it's first copied to a page-locked "staging" buffer and then copied to GPU through DMA.
So using the pinned memory you save the time to copy from pageable host memory to page-locked host memory.
If the memory pages had not been accessed yet, they were probably never swapped in to begin with. In particular, newly allocated pages will be virtual copies of the universal "zero page" and don't have a physical instantiation until they're written to. New maps of files on disk will likewise remain purely on disk until they're read or written.
A verbose note on copying non-locked pages to locked pages.
It could be extremely expensive if non-locked pages are swapped out by OS on a busy system with limited CPU RAM. Then page fault will be triggered to load pages into CPU RAM through expensive disk IO operations.
Pinning pages can also cause virtual memory thrashing on a system where CPU RAM is precious. If thrashing happens, the throughput of CPU can be degraded a lot.

Is there device side pointer of host memory for kernel use in OpenCL (like CUDA)?

In CUDA, we can achieve kernel managed data transfer from host memory to device shared memory by device side pointer of host memory. Like this:
int *a,*b,*c; // host pointers
int *dev_a, *dev_b, *dev_c; // device pointers to host memory
…
cudaHostGetDevicePointer(&dev_a, a, 0); // mem. copy to device not need now, but ptrs needed instead
cudaHostGetDevicePointer(&dev_b, b, 0);
cudaHostGetDevicePointer(&dev_c ,c, 0);
…
//kernel launch
add<<<B,T>>>(dev_a,dev_b,dev_c);
// dev_a, dev_b, dev_c are passed into kernel for kernel accessing host memory directly.
In the above example, kernel code can access host memory via dev_a, dev_b and dev_c. Kernel can utilize these pointers to move data from host to shared memory directly without relaying them by global memory.
But seems that it is an mission impossible in OpenCL? (local memory in OpenCL is the counterpart of shared memory in CUDA)
You can find exactly identical API in OpenCL.
How it works on CUDA:
According to this presentation and the official documentation.
The money quote about cudaHostGetDevicePointer :
Passes back device pointer of mapped host memory allocated by
cudaHostAlloc or registered by cudaHostRegister.
CUDA cudaHostAlloc with cudaHostGetDevicePointer works exactly like CL_MEM_ALLOC_HOST_PTR with MapBuffer works in OpenCL. Basically if it's a discrete GPU the results are cached in the device and if it's a discrete GPU with shared memory with the host it will use the memory directly. So there is no actual 'zero copy' operation with discrete GPU in CUDA.
The function cudaHostGetDevicePointer does not take raw malloced pointers in, just like what is the limitation in OpenCL. From the API users point of view those two are exactly identical approaches allowing the implementation to do pretty much identical optimizations.
With discrete GPU the pointer you get points to an area where the GPU can directly transfer stuff in via DMA. Otherwise the driver would take your pointer, copy the data to the DMA area and then initiate the transfer.
However in OpenCL2.0 that is explicitly possible, depending on the capabilities of your devices. With the finest granularity sharing you can use randomly malloced host pointers and even use atomics with the host, so you could even dynamically control the kernel from the host while it is running.
http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf
See page 162 for the shared virtual memory spec. Do note that when you write kernels even these are still just __global pointers from the kernel point of view.

What's the difference between CUDA shared and global memory?

I’m getting confused about how to use shared and global memory in CUDA, especially with respect to the following:
When we use cudaMalloc(), do we get a pointer to shared or global
memory?
Does global memory reside on the host or device?
Is there a
size limit to either one?
Which is faster to access?
Is storing a
variable in shared memory the same as passing its address via the
kernel? I.e. instead of having
__global__ void kernel() {
__shared__ int i;
foo(i);
}
why not equivalently do
__global__ void kernel(int *i_ptr) {
foo(*i_ptr);
}
int main() {
int *i_ptr;
cudaMalloc(&i_ptr, sizeof(int));
kernel<<<blocks,threads>>>(i_ptr);
}
There've been many questions about specific speed issues in global vs shared memory, but none encompassing an overview of when to use either one in practice.
Many thanks
When we use cudaMalloc()
In order to store data on the gpu that can be communicated back to the host, we need to have alocated memory that lives until it is freed, see global memory as the heap space with life until the application closes or is freed, it is visible to any thread and block that have a pointer to that memory region. Shared memory can be considered as stack space with life until a block of a kernel finishes, the visibility is limited to only threads within the same block. So cudaMalloc is used to allocate space in global memory.
Do we get a pointer to shared or global memory?
You will get a pointer to a memory address residing in the global memory.
Does global memory reside on the host or device?
Global memory resides on the device. However, there is ways to use the host memory as "global" memory using mapped memory, see: CUDA Zero Copy memory considerations however, it may be slow speeds due to bus transfer speed limitations.
Is there a size limit to either one?
The size of the Global memory depends from card to card, anything from none to 32GB (V100). While the shared memory depend on the compute capability. Anything below compute capability 2.x have a maximum 16KB of shared memory per multiprocessor(where the amount of multiprocessors varies from card to card). And cards with compute capability of 2.x and greater have an minimum of 48KB of shared memory per multiprocessor.
See https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications
If you are using mapped memory, the only limitation is how much the host machine have in memory.
Which is faster to access?
In terms of raw numbers, shared memory is much faster (shared memory ~1.7TB/s, while global memory ~ XXXGB/s). However, in order to do anything you need to fill the shared memory with something, you usually pull from the global memory. If the memory access to global memory is coalesced(non-random) and big word size, you can achieve speeds close to the theoretical limit of hundreds of GB/s depending on the card and its memory interface.
The use of shared memory is when you need to within a block of threads, reuse data already pulled or evaluated from global memory. So instead of pulling from global memory again, you put it in the shared memory for other threads within the same block to see and reuse.
It is also common to be used as a scratch pad in order to reduce register pressure affecting how many work groups can be run at the same time.
Is storing a variable in shared memory the same as passing its address via the kernel?
No, if you pass an address of anything, it always is an address to global memory. From the host you can't set the shared memory, unless you pass it either as an constant where the kernel sets the shared memory to that constant, or you pass it an address to global memory where it is pulled by the kernel when needed.
The contents of global memory are visible to all the threads of grid. Any thread can read and write to any location of the global memory.
Shared memory is separate for each block of the grid. Any thread of a block can read and write to the shared memory of that block. A thread in one block cannot access shared memory of another block.
cudaMalloc always allocates global memory.
Global memory resides on the device.
Obviously, every memory has a size limit. The global memory is the total amount of DRAM of the GPU you are using. e.g I use GTX460M which has 1536 MB DRAM, therefore 1536 MB global memory. Shared memory is specified by the device architecture and is measured on per-block basis. Devices of compute capability 1.0 to 1.3 have 16 KB/Block, compute 2.0 onwards have 48 KB/Block shared memory by default.
Shared memory is magnitudes faster to access than global memory. Its like a local cache shared among the threads of a block.
No. Only global memory addresses can be passed to a kernel launched from host. In your first example, the variable is read from the shared memory, while in the second one, it is read from the global memory.
Update:
Devices of Compute Capability 7.0 (Volta Architecture) allow allocating shared memory of up-to 96 KB per block, provided the following conditions are satisfied.
Shared memory is allocated dynamically
Before launching the kernel, the maximum size of dynamic shared memory is specified using the function cudaFuncSetAttribute as follows.
__global__ void MyKernel(...)
{
extern __shared__ float shMem[];
}
int bytes = 98304; //96 KB
cudaFuncSetAttribute(MyKernel, cudaFuncAttributeMaxDynamicSharedMemorySize, bytes);
MyKernel<<<gridSize, blockSize, bytes>>>(...);
CUDA shared memory is memory shared between the threads within a block, i.e. between blocks in a grid the contents of shared memory are undefined. It can be thought as a manually managed L2 cache.
Usually global memory resides on the device, but recent versions of CUDA (if the device supports it) can map host memory into device address space, triggering an in-situ DMA transfer from host to device memory in such occasions.
There's a size limit on shared memory, depending on the device. Its reported in the device capabilities, retrieved when enumerating CUDA devices. Global memory is limited by the total memory available to the GPU. For example a GTX680 offers 48kiB of shared memory and 2GiB device memory.
Shared memory is faster to access than global memory, but access patterns must be aligned carefully (for both shared and global memory) to be efficient. If you can't make your access patterns properly aligned, use textures (also global memory, but accessed through a different circurity and cache, that can deal better with unaligned access).
Is storing a variable in shared memory the same as passing its address via the kernel?
No, definitely not. The code you proposed would be a case where you'd use in-situ transferred global memory. Shared memory can not be passed between kernels, as the contents of a shared block are defined within a execution block of threads only.

CUDA Global Memory, Where is it?

I understand that in CUDA's memory hierachy, we have things like shared memory, texture memory, constant memory, registers and of course the global memory which we allocate using cudaMalloc().
I've been searching through whatever documentations I can find but I have yet to come across any that explicitly explains what is the global memory.
I believe that the global memory allocated is on the GDDR of graphics card itself and not the RAM that is shared with the CPU since one of the documentations did state that the pointer cannot be dereferenced by the host side. Am I right?
Global memory is a virtual address space that can be mapped to device memory (memory on the graphics card) or page-locked (pinned) host memory. The latter requires CC > 1.0.
Local, constant, texture, and local memory are allocated in global memory but accessed through different address spaces and caches.
On CC > 2.0 the generic address space allows mapping of shared memory into the global address space; however, shared memory always resides in per SM on-chip memory.
Global memory is off-chip but on the graphics card.
Local memory is stored in global memory but addresses are interleaved in such a way that when arrays are store there, accesses are coalesced when each thread in the warp reads from the same index in its array.
Constant and texture memory is also (initially) stored in global memory, but it is cached in on-chip caches.
Shared memory and the L1 and L2 caches are on-chip.
This is discussed in Section 3.2.2 of the CUDA C Programming Guide. In short, all types of memory, i.e. shared, constant, texture and global, reside in the memory of the device, i.e. the GPU, itself.
You can, however, specifically declare parts of memory to be "Mapped", i.e. memory on the host to be accessible from the device. For this, see Section 3.2.4 of the Programming Guide.

Inside Dynamics memory management

i am student and want to know more about the dynamics memory management. For C++, calling operator new() can allocate a memory block under the Heap(Free Store ). In fact, I have not a full picture how to achieve it.
There are a few questions:
1) What is the mechanism that the OS can allocate a memory block?? As I know, there are some basic memory allocation schemes like first-fit, best-fit and worst-fit. Does OS use one of them to allocate memory dynamically under the heap?
2) For different platform like Android, IOS, Window and so on, are they used different memory allocation algorithms to allocate a memory block?
3) For C++, when i call operator new() or malloc(), Does the memory allocator allocate a memory block randomly in the heap?
Hope anyone can help me.
Thanks
malloc is not a system call, it is library (libc) routine which goes through some of its internal structures to give you address of a free piece of memory of the required size. It only does a system call if the process' data segment (i.e. virtual memory it can use) is not "big enough" according to the logic of malloc in question. (On Linux, the system call to enlarge data segment is brk)
Simply said, malloc provides fine-grained memory management, while OS manages coarser, big chunks of memory made available to that process.
Not only different platforms, but also different libraries use different malloc; some programs (e.g. python) use its internal allocator instead as they know its own usage patterns and can increase performance that way.
There is a longthy article about malloc at wikipedia.

Resources