I have some questions related to OpenMP offloading in clang.
1.When clang offloads a certain code segment to a NVIDIA GPU how the data will be mapped to the GPU?
2.How it will decide which data to be mapped to the "shared memory" region in NVIDIA GPU?
3.Will the constants in the code segment be mapped to the constant memory in GPU?
I tried to find answers for these question but i couldn't find any reference. Thanks in advance.
Too general question, please clarify it.
clang-ykt tries to use shared memory at first, when the compiler see that the preallocated buffer is completely used, it uses the global memory. Clang trunk currently uses only the global memory.
No
Related
I am interested in how OpenCL memory transferring functions operate underneath (migration, reading/writing the buffer, mapping/unmapping). I could not find any open source implementation for OpenCL (for me Intel's one could be fine) and just explanations in the documentation don't give me any idea what is happening, for example, when I call clEnqueueMigrateMemObjects: what calls happen during this migration, what modules are active, how this migration happens, what mechanisms it uses underneath, does it use some cache mechanisms.
Is there a good source to read about it?
I am now exploring how OpenCL passes data to FPGAs. Xilinx currently uses native OpenCL implementation, present on a machine, plus some extensions.
If you're looking for low-level information (how a particular implementation implements those calls), probably the only source is the implementation.
There are a few opensource OpenCL on GPU implementations:
Raspberry Pi 3 (beta): https://github.com/doe300/VC4CL
OpenCL on Vulkan (beta): https://github.com/kpet/clvk
Mesa Clover (supports only 1.1): https://cgit.freedesktop.org/mesa/mesa/log/?qt=grep&q=clover
AMD ROCm: https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime
Intel sources of NEO (their new OpenCL implementation) here: https://github.com/intel/compute-runtime
I'm not aware of Xilinx providing sources for their implementation, so if you want to know what exactly happens on Xilinx, your best chance is probably to ask on Xilinx forums or via some official support.
I have been through the documentation and didn't get a clear detailed description about UMat; however I think it has something to relate with GPU and CPU. Please help me out.
Thank you.
Perhaps section 3 of this document will help: [link now broken]
https://software.intel.com/sites/default/files/managed/2f/19/inde_opencv_3.0_arch_guide.pdf
Specifically, section 3.1:
A unified abstraction cv::UMat that enables the same APIs to be implemented using CPU or OpenCL code, without a requirement to call OpenCL accelerated version explicitly. These functions use an OpenCL-enabled GPU if exists in the system, and automatically switch to CPU operation otherwise.
and section 3.3:
Generally, the cv::UMat is the C++ class, which is very similar to cv::Mat. But the actual UMat data can be located in a regular system memory, dedicated video memory, or shared memory.
Link to usage suggested in the comments by #BourbonCreams:
https://docs.opencv.org/3.0-rc1/db/dfa/tutorial_transition_guide.html#tutorial_transition_hints_opencl
So I have a GPU memory leak in certain scenarios in my application. However, I am not aware of any detailed memory profiler for the GPU like those for the CPU. Are there anything out there that can achieve this? I am using D3D (since its WPF, there are d3d9, d3d10, d3d11 components...)
Thanks!
Are you using the debug setting in Dx control panel? This helps you dump the id of the leaking allocation. You can then proceed to set a HKLM registry value and break on the leaking allocation, as is explained here:
http://legalizeadulthood.wordpress.com/2009/06/28/direct3d-programming-tip-5-use-the-debug-runtime/
http://www.gamedev.net/topic/313718-tracking-down-a-directx-leak/
You can also try NSight, which you can download for free from NVidia. For Maximus cards there is also a specific GPU Debugger, and otherwise you can use the Graphics Debugger and try to isolate the memory bump there. In the Performance Debugger you can detect both OpenGl and DirectX events, though this is more performance oriented.
Depending on your GPU's vendor (As you have not provided us with the information), here are the possible solutions:
Intel: Use the Intel Media SDK 's GPU Utilization Utility. This comes packaed in the Intel INDE (Integrated Developer Environment).
AMD: CodeXL provides an on-the-fly debugger and an extensive memory profiling tool, and is now provided as part of their GPUOPen initiative.
NVIDIA: Use the Nvidia Visual Profiler (NVVP) combined with traces from Nvidia Nsight, and these utilities are provided with the standard Nvidia CUDA installer.
Notes:
With Nvidia, you must also install the provided GPU driver (~from the CUDA SDK) to enable any form of GPU-based driver profiling and debugging. Take note of the above limitation if you use your development rig for other purposes such as gaming, as the bundled driver is often much, much older than the stock, Game-ready drivers.
Thanks and regards,
Brainiarc7.
I'm completely new to OpenCL and GPU programming in general. Right now I am working on a project where I'm trying to see the performance saves that making use of the GPU in a game has. With this, however, I have ran into a snag; how do I set up my Directx project to speak to the OpenCL code base?
I've been googling this for about a week and haven't been able to find anything. If someone could point me in the right direction, I would be greatful.
OpenCL does not have anything to do with DirectX, it's simply another library.
For OpenCL you'll need an implementation ('SDK'), as Khronos don't provide those (they only provide the specifications).
Intel, AMD and Nvidia all provide one, but they have different requirements and limitations. See here for some of the existing implementations
After installing one of these, you'll have the necessary headers and libraries to code against the OpenCL API and link with OpenCL.dll
There are lots of sample sources in the SDKs or online, you have to write the kernel, the rest is mostly boilerplate code for initialization and kernel compilation.
The specific OpenCL extension that allows sharing of OpenCL buffers as textures and vice versa is cl_khr_d3d10_sharing.txt. http://www.khronos.org/registry/cl/extensions/khr/cl_khr_d3d10_sharing.txt
OpenCL has extensions for sharing memory between DirectX and OpenCL (and also between OpenGL and OpenCL.) This allows you to read or write DirectX buffers, including textures from within OpenCL. Ani's answer mentioned the extension for DirectX 10, but since the question is about DirectX 9, the extension you'll actually be using is cl_khr_dx9_media_sharing.
This extension has just 4 functions:
clGetDeviceIDsFromDX9MediaAdapterKHR
This function allows you to get the OpenCL device IDs of the OpenCL device(s) that can share memory with a given Direct3D 9 device.
clCreateFromDX9MediaSurfaceKHR
This function gets an OpenCL cl_mem memory object for a given Direct3D 9 memory object.
clEnqueueAcquireDX9MediaSurfacesKHR
This function locks the specified shared memory object so that you can read and/or write to it from OpenCL.
clEnqueueReleaseDX9MediaSurfacesKHR
This function unlocks the specified memory object from OpenCL, so that Direct3D can read/write it again.
Once you've used the above functions to share and synchronize access to the memory buffers, everything else on both the Direct3D 9 side and the OpenCL side works as it would otherwise with those particular APIs.
Note that your GPU will need to support the cl_khr_dx9_media_sharing extension in order for this to work. You can check the extensions property of the OpenCL platform and device in order to confirm that this extension is supported.
Some NVidia GPUs support a different extension instead, called cl_nv_d3d9_sharing. The basic idea of how it works is the same as with the cl_khr_dx9_media_sharing extension, but the exact details are a bit different. The biggest difference is just that it has different functions for getting cl_mem objects for different types of Direct3D 9 buffers, rather than just one function to cover all of them.
assume that we have enough global memory. Does replacing int with short improve the performance in CUDA? (like short saves the usage of shared memory, registers, etc)
Advices are welcomed. Thanks.
Using short in shared memory will most likely reduce performance due to bank-conflicts, until you use short2.
Also, as far as I know, all registers on GPU are 32-bit, so it's unlikely that using short would reduce register usage.
Depends:
If your program is memory bound then Yes transferring the input as shorts could be beneficial.
If your kernel is computation bound is more likely to be No because the kernel have to do an extra operation to convert from short to int and then back to short each time.
Tesla-class hardware (SM 1.x) has surprisingly rich support for "half registers," so you might get some mileage from using short instead of int on those platforms. You can confirm by using cuobjdump to look at the microcode in the cubin. But Fermi removed that support.
With SM 2.1, NVIDIA added support for "video" instructions that implement 32-bit-wide SIMD operations on 32-bit registers - see section 8.7.9 of the PTX 2.1 spec.
http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/ptx_isa_2.1.pdf