Should I programly put computation-heavy tasks on a separate thread on IOS to utilize multi-core? - ios

I am making a real-time image processing app on IOS with my team. I am handling the custom computation kernel (mostly on CPU rather than GPU) and my teammates deals with the GUI. When I tested my kernel on a toy app, the core (ignoring any IO overhead ) runs steadily at 100ms per image. However, when put into the full-functioning one, it is slowed down to 500ms per image.
I have checked that the data is pretty much the same and I am only measuring time consumed within the kernel, on the same iphone6. There are hardly any other computation in the full-functioning app so I am not sure what is pulling behind. Though GPU-processing is definitely an alternative and I am working on it, I would like to know if there is any tricks to use for now.
Currently, there is no explicit multi-threading in the computation part, so my simple guess is: should I programly put the computation part on a separate thread so the second core can be utilized?
[Update]
It turns out that I made some mistakes in packing my code as library, as the copying over the source code works out nicely. I have not figured out my problem yet and am going to post it on a separate question.

GPU Acceleration
This massively depends on the tasks you're performing, the GPU is good a specific subset of tasks and simply utilising it can sometimes even slow things down. Check this out
A lot of image based tasks that are part of the Quartz framework e.t.c are GPU accelerated (like blurring). Also if you use a library like OpenCV you get GPU acceleration on certain tasks out the box.
Unless you're a real pro I would avoid using the GPU specifically and let the frameworks and libraries you use do that for you.
Concurrency
It will certainly help to put intensive tasks on a background thread. Just be aware of what it entails (i.e. you can't make any UIKit calls from a background thread.

The answer heavily depends on how you do the processing. Some methods in the SDK perform their job in a background thread, while others require the caller to create and use one.
In general, in the case of drawing, most methods require you to create one explicitly. This is important especially for the ones that perform their work on the CPU (e.g. using CoreGraphics to draw within a drawRect method). If you're using methods that use GPU for the processing, then creating threads won't be much of use since CPU won't be the cause of the bottleneck.
If you want to determine why your app slows down, use Instruments. (Time Profiler for CPU and Core Animation for drawing)

Related

Should CUDA stream be waited to be complete even if the output data are to be sent to OpenGL instead of CPU?

This is a general question, and although I use OpenCV as a framework, the question is broader than OpenCV's realm.
I am developing an image processing tool that will effectively get image from a webcam (yielding a host-memory located cv::Mat), upload it to a GPU device memory in CUDA (i.e. cv::GpuMat), do some processing using CUDA and get a result finalCudaMat, and finally send the result to OpenGL (i.e. cv::ogl::Buffer::mapDevice + finalCudaMat.copyTo(mappedOglBuffer)). Everything works as intended.
Since the whole process involves multiple steps, I use a CUDA stream object (cv::cuda::Stream) to be able to make CUDA calls asynchronous and not wait on every single operation to be finished on CPU side. Now if someone instead is to eventually copy the result to a CPU matrix (i.e. finalCudaMat.download(finalCpuMat)), as in a customary situation, typically a wait on the stream is required (cudaStream.waitForCompletion()) to ensure the result is ready before using the CPU side matrix.
In my case, the the result never gets back to the CPU as it continues to be rendered on the screen (a bit of OpenGL operations and shaders are also involved).
One way, it might be appropriate to wait for CUDA work to finish before starting to copy the GpuMat to OpenGL Buffer. So if I add the wait on stream, everything is working fine and the CUDA operations take ~2.5ms.
Another way, it feels like I don't need to wait for completion of the stream (all the results are consumed by the GPU anyway -- CPU is never invovled again). Therefore I can remove the cudaStream.waitForCompletion() call before performing finalCudaMat.copyTo(mappedOglBuffer), and everything seems to be working fine. The whole CUDA processing operation (basically any GPU task minus OpenGL related) apparently takes ~1.8ms for me.
In the past I have had bad experience of not properly synchronization GPU work if two different APIs were involved (e.g. do something on Direct3D 9, do not wait for it to finish, and then copy the resulting texture to a Direct3D 10 texture, and clearly on some frames the image becomes empty or torn).
At this point, the difference is tiny and doesn't affect my 60 FPS throughput. But I wonder if I am technically doing a correct work by removing the wait-on-stream operation. Any thoughts on this? Or maybe a document regarding OpenGL/CUDA interop that could help me?
The rules are defined in this document: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#graphics-interoperability
In particular it says that
Accessing a resource through OpenGL, Direct3D, or another CUDA context while it is mapped produces undefined results.
That's a very strong hint that the needed synchronization is performed by cudaGraphicsUnmapResources, which is confirmed by its documentation:
This function provides the synchronization guarantee that any CUDA work issued in stream before cudaGraphicsUnmapResources() will complete before any subsequently issued graphics work begins.
So you won't need to make the CPU wait on CUDA completion, but you must call cudaGraphicsUnmapResources which will put the appropriate barrier in the asynchronous instruction stream. Note that unlike your CPU transfer code, this call goes after CUDA copies data into the OpenGL buffer.
As Ben Voigt already pointed out, CUDA requires explicit synchronization with OpenGL (or any other graphics API that interoperates with it). Now this used the be kind of a chore, where one had to submit callbacks to the compute stream and use them to manually work with e.g. OpenGL fences.
However due to the advent of Vulkan and with it the support for external resources (and OpenGL extensions for that) you can in fact synchonize between CUDA and OpenGL command streams, by having both sides import platform native semaphores (cudaImportExternalSemaphore, GL_EXT_semaphore) and use them for mutual synchronization. It usually still involves a whole round trip through the CPU side driver, but since that part has to manage the command streams anyway it's not really an issue of efficiency.

Understanding hardware / OS level thread CPU throttling in iOS

I am trying to understand under what circumstances and how iOS may throttle my application threads due to excessive CPU consumption. The results I'm getting are kind of strange.
I have an application with OpenGL / GLKViewController rendering a view and a separate logic thread, started in the background using NSThread.detachNewThreadSelector, performing calculations. I find that if I (for purposes of discussion) let my computation thread run flat out as fast as it can, iOS quickly throttles it down. e.g. I monitor the FPS of both the view and my thread and I see that the view maintains e.g. 60fps and my logic thread is humming along but then suddenly drops after a few seconds.
So that makes sense to me that perhaps iOS tries to limit thread consumption. What is weird is that it doesn't just slow down gradually but it seems to "quantize" my logic thread's FPS at approximately some multiple of the GPU frame rate (i.e. 30 or 60fps)!
Now, keep in mind that there is no synchronization between these threads and the logic loop is self contained hard loop equivalent to while(true) so I have no idea how it's even possible for iOS to accomplish this magic unless it is somehow aware of my top level loop and interjecting itself into it.
In case you don't believe me that there is no synchronization point I will tell you that I have created a test case that literally just has an empty GLKViewController loop and an dumb logic thread that churns some numbers and it exhibits the behavior. Screenshots are below and I can post the code if anyone is interested.
The screenshots below are for two different "loads" of the logic thread, printed at intervals of a second, running on an iPad Air with iOS 8.
What's even stranger is - sometimes setting a lower preferred GLK frame rate (e.g. 30fps) can actually make my logic thread run slower. I'd have expected that reducing the work done by the GPU would free up (resources / heat dissipation) and reduce the need for throttling, but it doesn't always seem to be the case.
Does anyone have an explanation for this behavior? Is it documented? Thanks.
EDIT: My only guess at this point is that if the GPU runs to hot they shut down the second core and migrate threads back to the first... and then somehow thread prioritization accounts for the implicit synchronization, although I still can't envision exactly how this happens.

Cost of switching between many EAGLContexts?

I’m working on some code that has a grid view (~20 child views on screen at once). Each child view draws its content in GL, and has its own drawing thread and EAGLContext.
The advantage of this is that each view is relatively insulated from other GL usage in the app, though with 20 such views on screen, we have to glFlush+setCurrentContext: 20 times per frame. My gut tells me this is not the most efficient use of GL.
My questions:
What's the cost of switching contexts?
Does having to glFlush for each context actually slow it down, or does glFlush only stall the current context?
• Does having to glFlush for each context actually slow it down, or does glFlush only stall the current context?
Contexts have their own individual command streams.
All of this stuff eventually has to be serialized for drawing on a single GPU, so flushing the command stream for 20 concurrent contexts is going to put some pressure on whatever part of the driver does that.
Luckily, GL does not guarantee any sort of synchronization between different contexts so GL itself is not going to spend a whole lot of effort making sure commands from different contexts are executed in a particular order relative to one another. However, if you were waiting on a fence sync. object associated with another context in one of the command streams then it would introduce some interesting GL-related overhead.
• What's the cost of switching contexts?
Why are you switching contexts?
You said that each view has its own thread and context, so I am having trouble understanding why you would ever change the context current to a thread.
The cost of switching contexts is very hardware dependent. Newer hardware generations tend to have more efficient context switching support. But it's generally a pretty heavy weight operation in any case.
The cost of a glFlush is neither very small nor very large. Not something you want to do more often than needed, but not very harmful when used in moderation. I would be much more worried about the context switch than the flush. As Andon mentioned in his response, a glFlush will not be enough if you need synchronization between the contexts/threads. That will require either a glFinish, or some kind of fence.
With your setup, you'll also have the price for thread switches on the CPU.
I think your gut feeling is absolutely right. The way I understand your use case, it would probably be much more efficient to render your sub-views in sequence, with a single rendering thread and context. It might make the state management a little more cumbersome, but you should be able to handle that fairly cleanly.
To make this answer more self contained, with credit to Andon: You don't have to make calls to set the current context, since the current context is maintained per thread. But on the GPU level, there will still be context switches as soon as work from multiple contexts is submitted.

NSThread, NSOperation or GCD for CoreMotion and accurate timing purposes?

I'm looking to do some high precision core motion reading (>=100Hz if possible) and motion analysis on the iPhone 4+ which will run continuously for the duration of the main part of the app. It's imperative that the motion response and the signals that the analysis code sends out are as free from lag as possible.
My original plan was to launch a dedicated NSThread based on the code in the metronome project as referenced here: Accurate timing in iOS, along with a protocol for motion analysers to link in and use the thread. I'm wondering whether GCD or NSOperation queues might be better?
My impression after copious reading is that they are designed to handle a quantity of discrete, one-off operations rather than a small number of operations performed over and over again on a regular interval and that using them every millisecond or so might inadvertently create a lot of thread creation/destruction overhead. Does anyone have any experience here?
I'm also wondering about the performance implications of an endless while loop in a thread (such as in the code in the above link). Does anyone know more about how things work under the hood with threads? I know that iPhone4 (and under) are single core processors and use some sort of intelligent multitasking (pre-emptive?) which switches threads based on various timing and I/O demands to create the effect of parallelism...
If you have a thread that has a simple "while" loop running endlessly but only doing any additional work every millisecond or so, does the processor's switching algorithm consider the endless loop a "high demand" on resources thus hogging them from other threads or will it be smart enough to allocate resources more heavily towards other threads in the "downtime" between additional code execution?
Thanks in advance for the help and expertise...
IMO the bottleneck are rather the sensors. The actual update frequency is most often not equal to what you have specified. See update frequency set for deviceMotionUpdateInterval it's the actual frequency? and Actual frequency of device motion updates lower than expected, but scales up with setting
Some time ago I made a couple of measurements using Core Motion and the raw sensor data as well. I needed a high update rate too because I was doing a Simpson integration and thus wnated to minimise errors. It turned out that the real frequency is always lower and that there is limit at about 80 Hz. It was an iPhone 4 running iOS 4. But as long as you don't need this for scientific purposes in most cases 60-70 Hz should fit your needs anyway.

What is the fastest way of loading and re-sizing an image?

I need to display thumbnails of images in a given directory. I use TFileStream to read the image file before loading the image into an image component. The bitmap is then resized to the thumbnail size, and assigned to a TImage component on a TScrollBox.
It seems to work ok, but slows down quite a lot with larger images.
Is there a faster way of loading (image) files from disk and resizing them?
Thanks, Pieter
Not really. What you can do is resize them in a background thread, and use a "place holder" image until the resizing is done. I would then save these resized images to some sort of cache file for later processing (windows does this, and calls the cache thumbs.db in the current directory).
You have several options on the thread architecture itself. A single thread that does all images, or a thread pool where a thread only knows how to process a single image. The AsyncCalls library is even another way and can keep things fairly simple.
I'll complement the answer by skamradt with an attempt to design this for being as fast as possible. For this you should
optimize I/O
use multiple threads to make use of multiple CPU cores, and to keep even a single CPU core working while you read (or write) files
The use of multiple threads implies that using VCL classes for the resizing isn't going to work, as the VCL isn't thread-safe, and all hacks around that don't scale well. efg's Computer Lab has links for image processing code.
It's important to not cause several concurrent I/O operations when using multiple threads. If you choose to write the thumbnail images back to files, then once you have started reading a file you should read it completely, and once you have started writing a file you should also write it completely. Interleaving both operations will kill your I/O, because you potentially cause a lot of seeking operations of the hard disc head.
For best results the reading (and writing) of files should also not happen in the main (GUI) thread of your application. That would suggest the following design:
Have one thread read files into TGraphic objects, and put these into a thread-safe list.
Have a thread pool wait on the list of files in original size, and have one thread process one TGraphic object, resize it into another TGraphic object, and add this to another thread-safe list.
Notify the GUI thread for each thumbnail image added to the list, so it can be displayed.
If thumbnails are to be written to file, do this in the reading thread as well (see above for an explanation).
Edit:
On re-reading your question I notice that you maybe only need to resize one image, in which case a single background thread is of course enough. I'll leave my answer in place anyway, maybe it will be of use to someone else some time. It's what I learned from one of my latest projects, where the final program could have needed a little more speed but was only using about 75% of the quad core machine at peak times. Decoupling I/O from processing would have made the difference.
I often use TJPEGImage with Scale:=jsEighth (in Delphi 7). This is really fast because the JPEG de-compression can skip a lot of the data to fill a bitmap of only an eighth of width and height.
Another option is to use the shell's method to extract a thumbnail, which is pretty fast as well
I'm in the vision business, and I simply upload the images to the GPU using OpenGL. (typically 20x 2048x2000x8bpp per second), a bmp per texture, and let the videocard scale (win32, Mike Lischke's opengl headers)
Upload of such an image costs 5-10ms depending on exact videocard (if not integrated and nvidia 7300 series or newer. Very recent integrated GPUs might be doable also). Scaling and displaying costs 300us. Which means customers can pan and zoom like crazy without touching the app. I draw an overlay (which used to be a tmetafile but is now an own format) on top of it.
My biggest picture is 4096x7000x8bpp which shows and scales in under 30ms. (GF 8600)
A limitation of this technology is max texture size. It can be resolved by fragmenting the picture into multiple textures, but I haven't bothered yet because I deliver the systems with the software.
(some typical sizes:
nv6x00 series: 2k*2k but uploading is just about break even compared to GDI
nv7x00 series: 4k*4k For me the baseline cards. GF7300's are like $20-40
nv8x00 series: 8k*8k
)
Note that this might not be for everybody. But if you are in the lucky situation to specify hardware limits, it might work. The main problem are laptops like Thinkpads, the GPUs of which are older than the avg laptop, which are in turn often a generation behind Desktops.
I chose OpenGL over DirectX because it is more static in time, and easier to find non-game related examples.
Try to look at the Graphics32 library : it's very good at drawing things and works great with Bitmaps. They are Thread - Safe with good example, and it's totally free.
Exploit windows capacity to create thumbnails. Remember that hidden Thumbs.db files in folders that contain images?
I have implemented something like this feature but in VB. My software is able to build thumbnails of 100 files (mixed size) in around 10 seconds.
I am not able to convert it to Delphi though.

Resources