NSThread, NSOperation or GCD for CoreMotion and accurate timing purposes? - ios

I'm looking to do some high precision core motion reading (>=100Hz if possible) and motion analysis on the iPhone 4+ which will run continuously for the duration of the main part of the app. It's imperative that the motion response and the signals that the analysis code sends out are as free from lag as possible.
My original plan was to launch a dedicated NSThread based on the code in the metronome project as referenced here: Accurate timing in iOS, along with a protocol for motion analysers to link in and use the thread. I'm wondering whether GCD or NSOperation queues might be better?
My impression after copious reading is that they are designed to handle a quantity of discrete, one-off operations rather than a small number of operations performed over and over again on a regular interval and that using them every millisecond or so might inadvertently create a lot of thread creation/destruction overhead. Does anyone have any experience here?
I'm also wondering about the performance implications of an endless while loop in a thread (such as in the code in the above link). Does anyone know more about how things work under the hood with threads? I know that iPhone4 (and under) are single core processors and use some sort of intelligent multitasking (pre-emptive?) which switches threads based on various timing and I/O demands to create the effect of parallelism...
If you have a thread that has a simple "while" loop running endlessly but only doing any additional work every millisecond or so, does the processor's switching algorithm consider the endless loop a "high demand" on resources thus hogging them from other threads or will it be smart enough to allocate resources more heavily towards other threads in the "downtime" between additional code execution?
Thanks in advance for the help and expertise...

IMO the bottleneck are rather the sensors. The actual update frequency is most often not equal to what you have specified. See update frequency set for deviceMotionUpdateInterval it's the actual frequency? and Actual frequency of device motion updates lower than expected, but scales up with setting
Some time ago I made a couple of measurements using Core Motion and the raw sensor data as well. I needed a high update rate too because I was doing a Simpson integration and thus wnated to minimise errors. It turned out that the real frequency is always lower and that there is limit at about 80 Hz. It was an iPhone 4 running iOS 4. But as long as you don't need this for scientific purposes in most cases 60-70 Hz should fit your needs anyway.

Related

Multitasking techniques

I have 2 tasks that need to be performed:
One is synchronized with refresh rate via the Present call; does fancy graphics.
The other does a bunch of computations on a virtually infinite workload; does not need to be synchronous with the first task; really does not like being interrupted (encourages coarser workload granularity).
Is there a way to optimally use the GPU in this situation with DirectX?
Perhaps the solution would:
issue Dispatch (or Draw) calls in a way that allows them to run/finish asynchronously.
signal the current shader to stop.
use hardware or driver scheduling.
Right now my soultion is to try and predict how long it would take to run the shaders, which is unreliable, unless I add a bunch of downtime...
Trying to avoid the th**ad word as it means a different thing on GPUs
Create two separate D3D11 devices. Use one for the rendering, and another one (driven from another CPU thread with lower priority) for the computations.
Rework your low-priority computations making each Dispatch() to take a couple milliseconds of GPU time to complete. Don’t submit many compute calls at once: use 2 queries or a single fence to never dispatch more than 2 pending compute calls. Dispatch 2 calls initially, when the first is complete dispatch the 3-rd one, etc.
While 3D rendering on your main thread, lock an std::mutex, release once you rendered the scene before Present. On the background thread, lock that mutex when submitting more compute tasks, but keep it unlocked while waiting for a query or fence.
You still gonna have some interference between these two tasks, but it might be good enough for your use case.
Ideally, consider using timestamp queries to measure GPU time spent computing your background tasks. Then adjust size of the single task dynamically based on these numbers, this should allow to achieve ideal granularity of these tasks regardless on the GPU performance. Don’t forget to apply some rolling average over the last 5-10 completed tasks before using the number for these adjustments.

Should I programly put computation-heavy tasks on a separate thread on IOS to utilize multi-core?

I am making a real-time image processing app on IOS with my team. I am handling the custom computation kernel (mostly on CPU rather than GPU) and my teammates deals with the GUI. When I tested my kernel on a toy app, the core (ignoring any IO overhead ) runs steadily at 100ms per image. However, when put into the full-functioning one, it is slowed down to 500ms per image.
I have checked that the data is pretty much the same and I am only measuring time consumed within the kernel, on the same iphone6. There are hardly any other computation in the full-functioning app so I am not sure what is pulling behind. Though GPU-processing is definitely an alternative and I am working on it, I would like to know if there is any tricks to use for now.
Currently, there is no explicit multi-threading in the computation part, so my simple guess is: should I programly put the computation part on a separate thread so the second core can be utilized?
[Update]
It turns out that I made some mistakes in packing my code as library, as the copying over the source code works out nicely. I have not figured out my problem yet and am going to post it on a separate question.
GPU Acceleration
This massively depends on the tasks you're performing, the GPU is good a specific subset of tasks and simply utilising it can sometimes even slow things down. Check this out
A lot of image based tasks that are part of the Quartz framework e.t.c are GPU accelerated (like blurring). Also if you use a library like OpenCV you get GPU acceleration on certain tasks out the box.
Unless you're a real pro I would avoid using the GPU specifically and let the frameworks and libraries you use do that for you.
Concurrency
It will certainly help to put intensive tasks on a background thread. Just be aware of what it entails (i.e. you can't make any UIKit calls from a background thread.
The answer heavily depends on how you do the processing. Some methods in the SDK perform their job in a background thread, while others require the caller to create and use one.
In general, in the case of drawing, most methods require you to create one explicitly. This is important especially for the ones that perform their work on the CPU (e.g. using CoreGraphics to draw within a drawRect method). If you're using methods that use GPU for the processing, then creating threads won't be much of use since CPU won't be the cause of the bottleneck.
If you want to determine why your app slows down, use Instruments. (Time Profiler for CPU and Core Animation for drawing)

Understanding hardware / OS level thread CPU throttling in iOS

I am trying to understand under what circumstances and how iOS may throttle my application threads due to excessive CPU consumption. The results I'm getting are kind of strange.
I have an application with OpenGL / GLKViewController rendering a view and a separate logic thread, started in the background using NSThread.detachNewThreadSelector, performing calculations. I find that if I (for purposes of discussion) let my computation thread run flat out as fast as it can, iOS quickly throttles it down. e.g. I monitor the FPS of both the view and my thread and I see that the view maintains e.g. 60fps and my logic thread is humming along but then suddenly drops after a few seconds.
So that makes sense to me that perhaps iOS tries to limit thread consumption. What is weird is that it doesn't just slow down gradually but it seems to "quantize" my logic thread's FPS at approximately some multiple of the GPU frame rate (i.e. 30 or 60fps)!
Now, keep in mind that there is no synchronization between these threads and the logic loop is self contained hard loop equivalent to while(true) so I have no idea how it's even possible for iOS to accomplish this magic unless it is somehow aware of my top level loop and interjecting itself into it.
In case you don't believe me that there is no synchronization point I will tell you that I have created a test case that literally just has an empty GLKViewController loop and an dumb logic thread that churns some numbers and it exhibits the behavior. Screenshots are below and I can post the code if anyone is interested.
The screenshots below are for two different "loads" of the logic thread, printed at intervals of a second, running on an iPad Air with iOS 8.
What's even stranger is - sometimes setting a lower preferred GLK frame rate (e.g. 30fps) can actually make my logic thread run slower. I'd have expected that reducing the work done by the GPU would free up (resources / heat dissipation) and reduce the need for throttling, but it doesn't always seem to be the case.
Does anyone have an explanation for this behavior? Is it documented? Thanks.
EDIT: My only guess at this point is that if the GPU runs to hot they shut down the second core and migrate threads back to the first... and then somehow thread prioritization accounts for the implicit synchronization, although I still can't envision exactly how this happens.

Which factors affect the speed of cpu tracing?

When I use YJP to do cpu-tracing profile on our own product, it is really slow.
The product runs in a 16 core machine with 8GB heap, and I use grinder to run a small load test (e.g. 10 grinder threads) which have about 7~10 steps during the profiling. I have a script to start the product with profiler, start profiling (using controller api) and then start grinder to emulate user operations. When all the operations finish, the script tells the profiler to stop profiling and save snapshot.
During the profiling, for each step in the grinder test, it takes more than 1 million ms to finish. The whole profiling often takes more than 10 hours with just 10 grinder threads, and each runs the test 10 times. Without profiler, it finishes within 500 ms.
So... besides the problems with the product to be profiled, is there anything else that affects the performance of the cpu tracing process itself?
Last I used YourKit (v7.5.11, which is pretty old, current version is 12) it had two CPU profiling settings: sampling and tracing, the latter being much faster and less accurate. Since tracing is supposed to be more accurate I used it myself and also observed huge slowdown, in spite of the statement that the slowdown were "average". Yet it was far less than your results: from 2 seconds to 10 minutes. My code is a fragment of a calculation engine, virtually no IO, no waits on whatever, just reading a input, calculating and output the result into the console - so the whole slowdown comes from the profiler, no external influences.
Back to your question: the option mentioned - samping vs tracing, will affect the performance, so you may try sampling.
Now that I think of it: YourKit can be setup such that it does things automatically, like making snapshots periodically or on low memory, profiling memory usage, object allocations, each of this measures will make profiling slowlier. Perhaps you should make an online session instead of script controlled, to see what it really does.
According to some Yourkit Doc:
Although tracing provides more information, it has its drawbacks.
First, it may noticeably slow down the profiled application, because
the profiler executes special code on each enter to and exit from the
methods being profiled. The greater the number of method invocations
in the profiled application, the lower its speed when tracing is
turned on.
The second drawback is that, since this mode affects the execution
speed of the profiled application, the CPU times recorded in this mode
may be less adequate than times recorded with sampling. Please use
this mode only if you really need method invocation counts.
Also:
When sampling is used, the profiler periodically queries stacks of
running threads to estimate the slowest parts of the code. No method
invocation counts are available, only CPU time.
Sampling is typically the best option when your goal is to locate and
discover performance bottlenecks. With sampling, the profiler adds
virtually no overhead to the profiled application.
Also, it's a little confusing what the doc means by "CPU time", because it also talks about "wall-clock time".
If you are doing any I/O, waits, sleeps, or any other kind of blocking, it is important to get samples on wall-clock time, not CPU-only time, because it's dangerous to assume that blocked time is either insignificant or unavoidable.
Fortunately, that appears to be the default (though it's still a little unclear):
The default configuration for CPU sampling is to measure wall time for
I/O methods and CPU time for all other methods.
"Use Preconfigured Settings..." allows to choose this and other
presents. (sic)
If your goal is to make the code as fast as possible, don't be concerned with invocation counts and measurement "accuracy"; do find out which lines of code are on the stack a large fraction of the time, and why.
More on all that.

iOS 4+: lag in CMDeviceMotion time intervals

I'm working on a calculation-intensive app that happens to listen to sensor data (acceleration, but also angular velocity). After a couple filters, these vectors are integrated to track displacement.
I have noticed that the timestamps associated to CMDeviceMotion and CMGyroData are late, because my CMMotionManager's handlers aren't fired at 100 Hz as specified by its accelerometerUpdateInterval and gyroUpdateInterval. It starts around 60 Hz and goes up and down. This affects the integrations majorly.
The same code in a stand-alone app does 100Hz like a charm.
So it looks like computation peaks from other modules of the big app make the sensor updates lag. Which surprises me, since the sensor manager is on a thread of its own and I understood from the doc that the sensor events were triggered by the hardware.
My question is: when the timestamp is unreliable as described, can the data still be used? Can it be extrapolated using another clock?
And I'm confused why big, asynchronous computation on other threads can lag the accelerator updates.
Thanks,
Antho
Bad timestamps are just as bad as inaccurate data since they have the same effect on the integration.
About 50 Hz is enough to track orientation. I was wondering how you track displacement because it is impossible with current sensors.

Resources