I have gone through Cuda programming guide but still not clear where does cuda kernel reside on GPU? In other words, in which memory segment does it reside?
Also, How do I know what is the max kernel size supported by my device? Whether max kernel size depend on number of simultaneous kernels loaded on device?
The instructions are stored in global memory that is inaccessible to the user but are prefetched into an instruction cache during execution.
The maximum kernel size is stated in the Programming Guide in section G.1: 2 million instructions.
Related
What is the usual recommended heap memory setting for a production environment, 1 microservice written in Java, if compiled as a native image using GraalVM Community Edition? Should I specify both -xms and -xms to keep the minimum heap and maximum heap size the same?
There is no recommendation from documentation, it says
Java Heap Size
When executing a native image, suitable Java heap settings will be determined automatically based on the system configuration and the used GC. To override this automatic mechanism and to explicitly set the heap size at run time, the following command-line options can be used:
-Xmx - maximum heap size in bytes
-Xms - minimum heap size in bytes
-Xmn - the size of the young generation in bytes
See here for full document. JVM memory management is an old topic discussed many times before
how to choose the jvm heap size?
java - best way to determine -Xmx and -Xms required for any webapp & standalone app
long story short for any application there is no "one" number. Some application may require 4GB some require 64GB depends to the "load" and data used per request(webapp) or OS(win/linux) app runs on. After you monitor the app sometime you can decide. It is not easy so that's why people are going serverless lately.
I've written a LKM which writes to a data structure of kernel (poolinfo_table). If I insmod this LKM in kernel 2.4 I guess it writes to this data structure, but when I do the same with kernel 3.10 my system restarts as I expect! What's wrong with kernel 2.4? Its kernel memory isn't protected or I'm not actually writing to it?! I mean I expect any kernel crush when I try to writing to its memory, so I'm in doubt I've written actually to my kernel 2.4's memory. In fact I tried the same code at my host system (Fedora 18) with kernel 3.10 and my guest (Redhat 9) with kernel 2.4. (I have hypervisor Xen)
In userland if you write to somewhere that you are not supposed to (somewhere in your address space that's not writeable, maybe it's not mapped or maybe it's write protected for example) the mmu may issue a bus error or segmentation violation to your process.
This is less likely but not impossible for threads in kernel space - you can easily cause memory corruption or mess up memory mapped devices without triggering an instant crash. The most likely crash that you'll generate in kernel space is by stomping on someone else's memory pointer and having them inadvertently step through it into unmapped space.
The major difference between userland and kernel really only relates to the scope of the damage you can do. Obviously in the kernel you can mess up a whole lot more.
My system has 32GB of ram, but the device information for the Intel OpenCL implementation says "CL_DEVICE_GLOBAL_MEM_SIZE: 2147352576" (~2GB).
I was under the impression that on a CPU platform the global memory is the "normal" ram and thus something like ~30+GB should be available to the OpenCL CPU implementation. (ofcourse I'm using the 64bit version of the SDK)
Is there some sort of secret setting to tell the Intel OpenCL driver to increase global memory and use all the system memory ?
SOLVED: Got it working by recompiling everything to 64bit. Quite stupid as it seems, but I thought that OpenCL was working similar to OpenGL, where you can easily allocate e.g. 8GB texture memory from a 32bit process and the driver handles the details for you (ofcourse you can't allocate 8GB in one sweep, but e.g. transfer multiple textures that add up to more that 4GB).
I still think that limiting the OpenCL memory abstraction to the adress space of the process (at least for intel/amd drivers) is irritating, but maybe there are some subtle details or performance tradeoff why this implementation was chosen.
Is it possible to use peer-to-peer memory transfer on GeForce cards or is it allowed only on Teslas? I assume cards are 2 GTX690s (each one has two GPUs on board).
I have tried copying between Quadro 4000 and Quadro 600, and it failed. I was transferring 3D arrays using cudaMemcpy3DPeer by filling the cudaMemcpy3DPeerParms struct.
Peer-to-peer memory copy should work on Geforce and Quadro as well as Tesla, see the programming guide for more details.
Memory copies can be performed between the memories of two different devices.
When a unified address space is used for both devices (see Unified Virtual Address
Space), this is done using the regular memory copy functions mentioned in Device
Memory.
Otherwise, this is done using cudaMemcpyPeer(), cudaMemcpyPeerAsync(),
cudaMemcpy3DPeer(), or cudaMemcpy3DPeerAsync()
Peer-to-peer memory access, which is where one GPU can directly read from another GPU, requires UVA (which means 64-bit OS) and Tesla and compute capability 2.0 or higher.
Tesla Compute Cluster Mode for Windows), on Windows XP, or on Linux, devices
of compute capability 2.0 and higher from the Tesla series may address each other’s
memory (i.e., a kernel executing on one device can dereference a pointer to the memory
of the other device).
How does a kernel's working set like number of registers, affect the GPUs ability to hide memory latencies.
By spreading the lookup latency across a group of parallel threads (warp). Refer to the CUDA Programming Guide in the CUDA SDK for detail