How does a kernel's working set like number of registers, affect the GPUs ability to hide memory latencies.
By spreading the lookup latency across a group of parallel threads (warp). Refer to the CUDA Programming Guide in the CUDA SDK for detail
Related
I am running TensorFlow on NVidia Jetson TX1 and faced memory shortage when I train large network such as GoogleNet.
CPU and GPU in TX1 do not have separate memory and they share one memory. However, It seems that TensorFlow is trying to allocate separate memory space and copy from CPU side to GPU side. Thus it requests 2x memory than it really needs.
In my opinion, this situation can be handled by something like DMA access between CPU and GPU. As far as I know, TensorFlow utilizes DMA between GPUs (not sure which one handles this. TensorFlow? or GPU driver?). Can I use DMA between CPU and GPU also in TensorFlow? or any other suggestions?
EDIT: I just found that there is Zero Copy feature in CUDA which is what I exactly wanted. However, can I use this feature in TensorFlow?
In CPU world one can do it via memory map. Can similar things done for GPU?
If two process can share a same CUDA context, I think it will be trivial - just pass GPU memory pointer around. Is it possible to share same CUDA context between two processes?
Another possibility I could think of is to map device memory to a memory mapped host memory. Since it's memory mapped, it can be shared between two processes. Does this make sense / possible, and are there any overhead?
CUDA MPS effectively allows CUDA activity emanating from 2 or more processes to behave as if they share the same context on the GPU. (For clarity: CUDA MPS does not cause two or more processes to share the same context. However the work scheduling behavior appears similar to what you would observe if the work were emanating from the same process and therefore the same context.) However this won't provide for what you are asking for:
can two processes share the same GPU memory?
One method to achieve this is via CUDA IPC (interprocess communication) API.
This will allow you to share an allocated device memory region (i.e. a memory region allocated via cudaMalloc) between multiple processes. This answer contains additional resources to learn about CUDA IPC.
However, according to my testing, this does not enable sharing of host pinned memory regions (e.g. a region allocated via cudaHostAlloc) between multiple processes. The memory region itself can be shared using ordinary IPC mechanisms available for your particular OS, but it cannot be made to appear as "pinned" memory in 2 or more processes (according to my testing).
I'm a bit confused between about the difference between shared memory and distributed memory. Can you clarify?
Is shared memory for one processor and distributed for many (for network)?
Why do we need distributed memory, if we have shared memory?
Short answer
Shared memory and distributed memory are low-level programming abstractions that are used with certain types of parallel programming. Shared memory allows multiple processing elements to share the same location in memory (that is to see each others reads and writes) without any other special directives, while distributed memory requires explicit commands to transfer data from one processing element to another.
Detailed answer
There are two issues to consider regarding the terms shared memory and distributed memory. One is what do these mean as programming abstractions, and the other is what do they mean in terms of how the hardware is actually implemented.
In the past there were true shared memory cache-coherent multiprocessor systems. The systems communicated with each other and with shared main memory over a shared bus. This meant that any access from any processor to main memory would have equal latency. Today these types of systems are not manufactured. Instead there are various point-to-point links between processing elements and memory elements (this is the reason for non-uniform memory access, or NUMA). However, the idea of communicating directly through memory remains a useful programming abstraction. So in many systems this is handled by the hardware and the programmer does not need to insert any special directives. Some common programming techniques that use these abstractions are OpenMP and Pthreads.
Distributed memory has traditionally been associated with processors performing computation on local memory and then once it using explicit messages to transfer data with remote processors. This adds complexity for the programmer, but simplifies the hardware implementation because the system no longer has to maintain the illusion that all memory is actually shared. This type of programming has traditionally been used with supercomputers that have hundreds or thousands of processing elements. A commonly used technique is MPI.
However, supercomputers are not the only systems with distributed memory. Another example is GPGPU programming which is available for many desktop and laptop systems sold today. Both CUDA and OpenCL require the programmer to explicitly manage sharing between the CPU and the GPU (or other accelerator in the case of OpenCL). This is largely because when GPU programming started the GPU and CPU memory was separated by the PCI bus which has a very long latency compared to performing computation on the locally attached memory. So the programming models were developed assuming that the memory was separate (or distributed) and communication between the two processing elements (CPU and GPU) required explicit communication. Now that many systems have GPU and CPU elements on the same die there are proposals to allow GPGPU programming to have an interface that is more like shared memory.
In modern x86 terms, for example, all the CPUs in one physical computer share memory. e.g. 4-socket system with four 18-core CPUs. Each CPU has its own memory controllers, but they talk to each other so all the CPUs are part of one coherency domain. The system is NUMA shared memory, not distributed.
A room full of these machines form a distributed-memory cluster which communicates by sending messages over a network.
Practical considerations are one major reasons for distributed memory: it's impractical to have thousands or millions of CPU cores sharing the same memory with any kind of coherency semantics that make it worth calling it shared memory.
Soes anybody know any simulator that I can use in order to measure statistics of memory access latencies for multicore processors?
Are there such statistics(for any kind of multicore) already published somewhere?
You might try CodeAnalyst from AMD which monitors the performance registers during program execution on AMD processors. Multi-core too where applicable.
I don't know the name of intel's equivalent product.
I have gone through Cuda programming guide but still not clear where does cuda kernel reside on GPU? In other words, in which memory segment does it reside?
Also, How do I know what is the max kernel size supported by my device? Whether max kernel size depend on number of simultaneous kernels loaded on device?
The instructions are stored in global memory that is inaccessible to the user but are prefetched into an instruction cache during execution.
The maximum kernel size is stated in the Programming Guide in section G.1: 2 million instructions.