Does replacing int with short help the performance in CUDA - memory

assume that we have enough global memory. Does replacing int with short improve the performance in CUDA? (like short saves the usage of shared memory, registers, etc)
Advices are welcomed. Thanks.

Using short in shared memory will most likely reduce performance due to bank-conflicts, until you use short2.
Also, as far as I know, all registers on GPU are 32-bit, so it's unlikely that using short would reduce register usage.

Depends:
If your program is memory bound then Yes transferring the input as shorts could be beneficial.
If your kernel is computation bound is more likely to be No because the kernel have to do an extra operation to convert from short to int and then back to short each time.

Tesla-class hardware (SM 1.x) has surprisingly rich support for "half registers," so you might get some mileage from using short instead of int on those platforms. You can confirm by using cuobjdump to look at the microcode in the cubin. But Fermi removed that support.
With SM 2.1, NVIDIA added support for "video" instructions that implement 32-bit-wide SIMD operations on 32-bit registers - see section 8.7.9 of the PTX 2.1 spec.
http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/ptx_isa_2.1.pdf

Related

Memory Mapping in Clang OpenMP offloading to GPU

I have some questions related to OpenMP offloading in clang.
1.When clang offloads a certain code segment to a NVIDIA GPU how the data will be mapped to the GPU?
2.How it will decide which data to be mapped to the "shared memory" region in NVIDIA GPU?
3.Will the constants in the code segment be mapped to the constant memory in GPU?
I tried to find answers for these question but i couldn't find any reference. Thanks in advance.
Too general question, please clarify it.
clang-ykt tries to use shared memory at first, when the compiler see that the preallocated buffer is completely used, it uses the global memory. Clang trunk currently uses only the global memory.
No

What is the difference between UMat and Mat in OpenCV?

I have been through the documentation and didn't get a clear detailed description about UMat; however I think it has something to relate with GPU and CPU. Please help me out.
Thank you.
Perhaps section 3 of this document will help: [link now broken]
https://software.intel.com/sites/default/files/managed/2f/19/inde_opencv_3.0_arch_guide.pdf
Specifically, section 3.1:
A unified abstraction cv::UMat that enables the same APIs to be implemented using CPU or OpenCL code, without a requirement to call OpenCL accelerated version explicitly. These functions use an OpenCL-enabled GPU if exists in the system, and automatically switch to CPU operation otherwise.
and section 3.3:
Generally, the cv::UMat is the C++ class, which is very similar to cv::Mat. But the actual UMat data can be located in a regular system memory, dedicated video memory, or shared memory.
Link to usage suggested in the comments by #BourbonCreams:
https://docs.opencv.org/3.0-rc1/db/dfa/tutorial_transition_guide.html#tutorial_transition_hints_opencl

OpenCL subbuffers, why is important?

I try to implement a multi-gpu OpenCL code. In my model, GPUs have to communicate and
exchange data.
I found (I don't remember where, it is been some time) that one solution is to deal with
subbuffers. Can anybody explain, as simple as possible, why subbuffers are important
in OpenCL? As far as I can understand, one can do exactly the same using only buffers.
Thanks a lot,
Giorgos
Supplementary Question:
What is the best way to exchange data between GPUs?
I am not sure(or I do not know) how sub-buffers will provide solutions to your problem when dealing with Multiple GPU's. AFAIK sub buffers provide a view into a buffer i.e a single buffer can be divided into chunks of smaller buffers(sub buffers) providing a layer of software abstraction, Sub-buffers are advantageous in same cases where in you need keep an offset first element to be zero.
To address multiGPU or MultiDevice problem OpenCL 1.2 provides API from where you can copy memory objects directly from One GPU to other using clEnqueueMigrateMemObjects OpenCL API call http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clEnqueueMigrateMemObjects.html

What language and compiler to write ATI GPU code for?

I know Nvidia has CUDA, but what does ATI have? I dont want to use OpenCL because I want to keep as low level to the hardware as possible.
Is it brook, or stream?
The documentation available is pretty pathetic! CUDA seems easy to get programming, but I want to use ATI specifically because of their hardware.
OpenCL is AMD's currently preferred GPU/compute language.
Brook is deprecated.
However, you can write code at a very low level, using AMD's
shader and kernel analyzer
http://developer.amd.com/tools/shader/Pages/default.aspx.
http://developer.amd.com/tools/AMDAPPKernelAnalyzer/Pages/default.aspx
E.g. http://developer.amd.com/tools/shader/PublishingImages/GSA.png
shows OpenCL code, and the Radeon 5870 assembly produced.
You can actually code directly in several forms of "assembly".
Or at least you could - the webpages no longer mention this.
(I used to have this installed for tuning and testing, but do not at the moment.)
More usually, you can code in any of several forms of AMD IL, Intermediate Language,
which is closer to the machine than OpenCL. The kernel analyzer web page says
"If your kernel is an IL kernel Stream, KernelAnalyzer will automatically compile the IL..."
I would recommend that you use OpenCL, and then look at the disassembly and tweak the OpenCL code to be better tuned. But you can work in IL, and probably still can work at an even lower level.

BlackBerry memory usage

I am looking for some advice on memory usage on mobile devices, BlackBerry in particular. Using some profiling tools we have calculated a working set size in RAM of 525kb. Problem is we don't really know whether this is acceptable or too high ?
Can anyone give any insight into their own experience with memory usage on BlackBerry? What sort of number should we be aiming for?
I am also wondering what sort of things we should be looking out for in particular to reduce memory usage.
512KB is perfectly acceptable on the current generation of BlackBerrys devices. You can take a look at JBenchmark to see the exact JVM heap you can expect for each model, but none of the current devices out there go below 20MB of heap. Most are much larger than that.
On JBenchmark you can choose the device you are interested from a drop down on the right side of the page. Then, navigate to the JVM Tab for the device.
When it comes to reducing memory usage I wouldn't worry about the total bytes used for this application if you are truly inline with 525K, just about how often allocation/reallocation is required. Try to pool/reuse objects as much as possible, avoiding any unneeded allocation. For instance, use the StringBuffer class to concatenate strings instead of operators as multiple String objects will be created for each concatenation using the operator, where a StringBuffer will just put the characters in an array and only expand when needed. Google is a good way to find more tips.
Finally, relying on profiling tools, which the BlackBerry JDE has, is a very important part of understanding exactly how you can optimize heap memory usage.
If I'm not mistaken, Blackberry apps are written in Java... which is a managed environment, which means really the only surefire way to use less memory is to create fewer objects. There's not a whole lot you can do about your working set, I think, since it's managed by the runtime (which is actually probably the point of using Java on devices like this).

Resources