I use opencl for image processing. For example, I have one 1000*800 image.
I use a 2D global size as 1000*800, and the local work size is 10*8.
In that case, will the GPU give 100*100 computing units automatic?
And do these 10000 units works at the same time so it can be parallel?
If the hardware has no 10000 units, will one units do the same thing for more than one time?
I tested the local size, I found if we use a very small size (1*1) or a big size(100*80), they are both very slow, but if we use a middle value(10*8) it is faster. So last question, Why?
Thanks!
Work group sizes can be a tricky concept to grasp.
If you are just getting started and you don't need to share information between work items, ignore local work size and leave it NULL. The runtime will pick one itself.
Hardcoding a local work size of 10*8 is wasteful and won't utilize the hardware well. Some hardware, for example, prefers work group sizes that are multiples of 32.
OpenCL doesn't specify what order the work will be done it, just that it will be done. It might do one work group at a time, or it may do them in groups, or (for small global sizes) all of them together. You don't know and you can't control it.
To your question "why?": the hardware may run work groups in SIMD (single instruction multiple data) and/or in "Wavefronts" (AMD) or "Warps" (NVIDIA). Too small of a work group size won't leverage the hardware well. Too large and your registers may spill to global memory (slow). "Just right" will run fastest, but it is hard to pick this without benchmarking. So for now, leave it NULL and let the runtime pick for you. Later, when you become an OpenCL expert and understand more about how the hardware works, you can try specifying the work group size. However, be aware that the optimal size may be different for different hardware, and there are other rules (like global size must be a multiple of local size).
Related
I am a mathematician and not a programmer, I have a notion on the basics of programming and am a quite advanced power-user both in linux and windows.
I know some C and some python but nothing much.
I would like to make an overlay so that when I start a game it can get info about amd and nvidia GPUs like frame time and FPS because I am quite certain the current system benchmarks use to compare two GPUs is flawed because small instances and scenes that bump up the FPS momentarily (but are totally irrelevant in terms of user experience) result in a higher average FPS number and mislead the market either unintentionally or intentionally (for example, I cant remember the name of the game probably COD there was a highly tessellated entity on the map that wasnt even visible to the player which lead AMD GPUs to seemingly under perform when roaming though that area leading to lower average FPS count)
I have an idea on how to calculate GPU performance in theory but I dont know how to harvest the data from the GPU, Could you refer me to api manuals or references to help me making such an overlay possible?
I would like to study as little as possible (by that I mean I would like to learn what I absolutely have to learn in order to get the job done I dont intent to become a coder).
I thank you in advance.
It is generally what the Vulkan Layer system is for, which allows to intercept API commands and inject your own. But it is nontrivial to code it yourself. Here are some pre-existing open-source options for you:
To get to timing info and draw your custom overlay you can use (and modify) a tool like OCAT. It supports Direct3D 11, Direct3D 12, and Vulkan apps.
To just get the timing (and other interesting info) as CSV you can use a command-line tool like PresentMon. Should work in D3D, and I have been using it with Vulkan apps too and it seems to accept them.
I've successfully built the camera demo from the ios_example and it's running perfect. The problem is that the binary size is comparatively large (Around 11MB binary footprint per CPU architecture). What i'm trying to do now is to shrink the binary size as much as possible.
There is a part named 'Reducing the binary size' in the official document: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/ios_examples . In the last paragraph it said:
After that, you can manually look at modifying the list of kernels included in tensorflow/contrib/makefile/tf_op_files.txt to reduce the number of implementations to the ones you're actually using in your own model.
So i removed a bunch of items from tf_op_files.txt and rebuilt iOS binary by executing compile_ios_tensorflow.sh, hoping it would reduce the generated binary size. However, the size didn't change at all, not even a single bit. I've tried serval times, i also tried to clear all the content of tf_op_files.txt, but still got the same result.
I guess i was doing wrong somewhere. Does anyone know how to do it right? Or is there any other way to reduce the binary size except those from the official documentation?
Any information is appreciated. Thanks!
TF already has to tool to do 8-bit Quantize. It can largely decrease the model size but may hurt the precision.
I'm currently working on app that loads blob of tightly packed data which contains different integer types (sized from char to int) that might not be properly aligned.
So, can I use simple *(short*)ptr or similar accesses to that data? Test on my iphone 5 shows no problem with that, but I'm not sure about all cases on all newer processors.
I did find some related informations, like this:
ARMv6 and later, except some microcontroller versions, support unaligned accesses for half-word and single-word load/store instructions with some limitations, such as no guaranteed atomicity.
but in case of words it seems that on 32-bit and 64-bit ARMs word 32 and 64 bit accordingly, which would mean short requires proper alignment on 64-bit machine.
So, can I assume this is safe, or should I use some keywords like __packed?
Or should I rather avoid it completely and recreate my data so it always have proper alignment (or always use memmove when data is from external source and cannot by permanently modified)?
It's ages ago that I tried it. And it worked, but every single access to unaligned memory caused a trap, which took considerable time. I'd suggest you measure how long it takes to add a million aligned shorts vs a million unaligned shorts. If you have a few hundred or thousand unaligned numbers, nothing to worry about.
__packed works reasonably fast. ARM has some clever instructions to do unaligned access with very few instructions. Again, I'd measure how long that takes. My experience with this is not current.
Can anyone comment on the decision to use sprites for images or not? I see the following benefits/trade-offs (some of which can be mitigated):
Sprites over individual images
Pros:
Fewer images to manage
Easier to implement themed images
Image swaps (JS/CSS) happen faster (because they do not require additional image loads)
Faster image loads due to fewer HTTP requests
Fewer images to cache (although virtually no difference in overall KB)
Cons:
More background positions to manage
Image payload may be over-inflated (sprite may contain unused images), causing page load may be slower
Slower images loads because they cannot be downloaded synchronously
I don't think there's one definitive answer to this. Opinions will differ according to need and individual preference.
My guideline is to always evaluate the benefit for the end user vs. the benefit for the developers. ie. what is the real value of the work you're doing as a developer.
Reducing the number of HTTP requests is always one of the first things to fix when optimizing a web page. Proper usage of caching can achieve much of the same thing as using sprites does. After all, very often graphics can be cached for a really long time.
There might be more benefit from minimizing scripts and stylesheets rather than adding graphics into a sprite.
Your code for managing sprites might increase complexity and developer overhead, especially as number of developers increases.
Learning proper use of cache-headers and configure your web-server or code properly might often be a more robust way of improving performance in my opinion.
If you've got a decent amount of menu entries in which you want to do roll-over images I'd recommend going to a sprite system as opposed to doing multiple images, all of which need to be downloaded separately. My reasons for so are pretty much inline with what you have mentioned in your post with a couple modifications:
The image swaps wouldn't be done with javascript; most of the sprites I've seen just use the :hover on the link itself within an unordered list.
Depending on the filetype/compression the download of the image file itself will be negligible. Downloading one image as opposed to multiple is generally faster in overall download and load.
I am looking for some advice on memory usage on mobile devices, BlackBerry in particular. Using some profiling tools we have calculated a working set size in RAM of 525kb. Problem is we don't really know whether this is acceptable or too high ?
Can anyone give any insight into their own experience with memory usage on BlackBerry? What sort of number should we be aiming for?
I am also wondering what sort of things we should be looking out for in particular to reduce memory usage.
512KB is perfectly acceptable on the current generation of BlackBerrys devices. You can take a look at JBenchmark to see the exact JVM heap you can expect for each model, but none of the current devices out there go below 20MB of heap. Most are much larger than that.
On JBenchmark you can choose the device you are interested from a drop down on the right side of the page. Then, navigate to the JVM Tab for the device.
When it comes to reducing memory usage I wouldn't worry about the total bytes used for this application if you are truly inline with 525K, just about how often allocation/reallocation is required. Try to pool/reuse objects as much as possible, avoiding any unneeded allocation. For instance, use the StringBuffer class to concatenate strings instead of operators as multiple String objects will be created for each concatenation using the operator, where a StringBuffer will just put the characters in an array and only expand when needed. Google is a good way to find more tips.
Finally, relying on profiling tools, which the BlackBerry JDE has, is a very important part of understanding exactly how you can optimize heap memory usage.
If I'm not mistaken, Blackberry apps are written in Java... which is a managed environment, which means really the only surefire way to use less memory is to create fewer objects. There's not a whole lot you can do about your working set, I think, since it's managed by the runtime (which is actually probably the point of using Java on devices like this).