Is "Running Time", "CPU Usage" a useful metric under Instruments to draw any conclusions? - ios

Have profiled an app on an iPhone 4 using "Time Profiler" and "CPU Monitor" and trying to make sense of it.
Given that execution time is 8 minutes, CPU "Running Time" is around 2 minutes.
About 67% of that is on the main thread, out of which 52% is coming from "own code".
Now, I can see the majority of time being spent in enumerating over arrays (and associated work), UIKit operations, etc.
The problem is, how do I draw any meaningful conclusions out of this data? i.e. there is something wrong going on here that needs fixing.
I can see a lot of CPU load over that running time (median at 70%) that isn't "justifiable" given the nature of the app.
Having said that, there are some things that do stand out. Parsing HTTP responses on the main thread, creating objects eagerly (backed up by memory profiling as well).
However, what I am looking for here is offending code along with useful conclusions solely based on CPU running time. i.e. spending "too much" time here.
Update
Let me try and elaborate in order to give a better picture.
Based on the functional requirements of this app, I can't see why it shouldn't be able to run on an iPhone 3G. A median CPU usage of around 70%, with a peak of 97% only looks like a red flag on an iPhone 4.
The most obvious response to this is to investigate the code and draw conclusions from that.
What I am hoping for is a categorical answer of the following form
if you spend anywhere between 25% - 50% of your time on CA, there is something wrong with your animations
if you spend 1000ms on anything related to UIKit, better check your processing
Then again, maybe there aren't any answers only indications of things being off when it comes to running time and CPU usage.

Answer for question "is there something wrong going on here that needs fixing" is simple: do you see the problem while using application? If yes (you see glitches in animation, or app hang for a while), you probably want to fix it. If not, you may be looking for premature optimization.
Nonetheless, parsing http responses in main thread, may be a bad idea.

In dev presentations Apple have pointed out that whilst CPU usage is not an accurate indicator in the simulator it is something to hold stock of when profiling on device. Personally I would consider any thread that takes significant CPU time without good reason a problem that needs to be resolved.
Find the time sinks, prioritise by percentage, and start working through them. These may not be visible problems now but they will begin to, if they have not already, degrade the user's experience of the app and potentially the device too.
Check out their documentation on how to effectively use CPU profiling for some handy hints.
If enumeration of arrays is taking a lot of time then I would suggest that dictionaries or other more effective caches could be appropriate, assuming you can spare some memory to ease CPU.
An effective approach may be to remove all business logic from the main thread (a given) and make a good boundary layer between the app and the parsing / business logic. From here you can better hook in some test suites that could better tell you if the code is at fault or if it's simply the significant requirements of the app UI itself...

Eight minutes?
Without beating around the bush, you want to make your application faster, right?
Forget looking at CPU load and wondering if it's the right amount.
Forget guessing if it's HTTP parsing. Maybe it is, but guessing won't tell you.
Forget rummaging around in the code timing things in hopes that you will find the problem(s).
You can find out directly why it is spending so much time.
Here's the method I use,
and here's an (amateurish) video of it.
Here's what will happen if you do that.
First you will find something you would never have guessed, and when you fix it you will lop a big chunk off that 8 minutes, like maybe down to 6 minutes.
Then you do it again, and lop off another big chunk.
You repeat until you can't find anything to fix, and then it will be much faster than your 8 minutes.
OK, now the ball is in your court.

Related

Alternatives to D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS?

This is a followon to this question about using the DX11VideoRenderer sample (a replacement for EVR that uses DirectX11 instead of EVR's DirectX9).
I've been trying to track down why it uses so much more CPU than the EVR. Task Manager shows me that most of that time is kernel mode.
Using profiling tools, I see that a LOT of time is being spent in numerous calls to NtDelayExecution (aka Sleep). How many calls? ~100,000 over the course of ~12 seconds. Ok, yeah, I'm sending a lot of frames in those 12 seconds, but that's still a lot of calls, every one of which requires a kernel mode transition.
The callstack shows the last call in "my" code is to IDXGISwapChain1::Present(0, 0). The actual call seems be Sleep(0) and comes from nvwgf2umx.dll (which is why this question is tagged NVidia: hopefully someone there can call up the code and see what the logic is behind such frequent calls).
I couldn't quite figure out why it would need to do /any/ Sleeping during Present. It's not like we wait for vertical retrace anymore, is it? But the other reason to use Sleep has to do with yielding to other threads. Which led me to a serious clue:
If I use D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS, the CPU utilization drops. Along with some other fixes, the DX11 version is now faster and uses less CPU time than the DX9 version (which is what I would hope/expect). Profiling shows that Sleep has dropped from >30% to <1%.
Unfortunately, this page tells me:
This flag is not recommended for general use.
Oh.
So, any ideas on how to get decent performance without using debug flags?

The same operation takes different execution times in two identical projects

I have an app which I am trying to migrate to a new project. There is a heavy operation which I am handling in main thread asynchronously. In my older project, it takes just a second to complete this task but in my new project, it takes 6-7 seconds for the same task.
I observed the CPU usage and it looks like the new app is using less CPU and getting very few threads while the old one gets lots of threads for the same task. PS: I am using the same device.
What could cause this? Any ideas or suggestions to find out?
Thanks.
Finally, I found the problem. It was caused by Optimization Level setting in Xcode Build Settings. When a new project created, default Debug optimization level is none, and Release optimization level is Fastest, Smallest [-Os] So when I changed Debug to Fastest, Smallest [-Os] my task completion time dropped to 1 sec.
From Apple:
The Xcode compiler supports optimization options that let you choose
whether you prefer a smaller binary size, faster code, or faster build
times. For new projects, Xcode automatically disables optimizations
for the debug build configuration and selects the Fastest, Smallest
option for the release build configuration. Code optimizations of any
kind result in slower build times because of the extra work involved
in the optimization process. If your code is changing, as it does
during the development cycle, you do not want optimizations enabled.
As you near the end of your development cycle, though, the release
build configuration can give you an indication of the size of your
finished product, so the Fastest, Smallest option is appropriate.
If you want to read more about optimization levels and performance: Tuning for Performance and Responsiveness
Side note: Changing the optimization level to fastest, Smallest [-0s] in debug mode might affect the debugger breakpoints and it will behave abruptly.
Cheers.
This probably isn't really a response to your question, you answered it yourself quite nicely, but nonetheless I feel like it's needed.
I'd like to stress out how you should NOT make long running operation on the main thread. For no reason. Actually, if you want the screen to refresh 60 times per second (which should always be your goal) it means that every block of code you submit to the main thread must last less than 0.016 seconds (1/60) to avoid losing some frames. If, in the meantime, you also need to make the main thread do some complex animation and other stuff, well probably you need to go far behind the 0.016 seconds point.
If you block the main thread for much more than that (like 1 second in this case) than the users will experience a stuck interface, they can't scroll a scrollView or navigate the app. They may as well close your app entirely since they may feel like it's stuck.
In your case, for example, you may want to add some nice loading animation, like the ActivityIndicator or some nicer animation, to express you're actually working at that moment and you didn't freeze. That's really expected by users nowadays.
You may (or may not, it's up to you) also wanna add a cancel button, if the user wants to cancel the long running operation and do something else with your app.
To avoid what you say that causes the loss of performances (the task is slowed up to 7-8 seconds) you may want to use a serialQueue with a high quality of service.
Probably userInitiated is what you want.
This way you still have those task be prioritized by the OS, but you won't block the main thread in meantime, allowing you to add that loading animation for example.
If that's still too low of a performance, you could think of splitting the task in subtasks and having them performed in parallel by using DispatchQueue.concurrentPerform(iterations:execute:) (but I don't know if that's doable in your case).
I hope this helps you.
Cheers

How can I investigate failing calibration on Spartan 6 MIG DDR

I’m having problems with a Spartan 6 (XC6SLX16-2CSG225I) and DDR (IS43R86400D) memory interface on some custom hardware. I've tried on a SP601 dev board and all works as expected.
Using the example project, when I enable soft_calibration, it never completes and calib_done stays low.
If I disable calibration I can write to the memory perfectly as far as I can see. But when I try to read from it, I get a variable number of successful read commands before the Xilinx memory controller stops implementing the commands. Once this happens, the command fifo fills up and stays full. The number of successful commands varies from 8 to 300.
I'm fairly convinced it's a timing issue, probably related to DQS centering. But because I can't get calibration to complete when enabled, I don't have continuous DQS Tuning. So I'm assuming it works with calibration disabled until the timing drifts.
Is there any obvious places I should be looking for why calibration fails?
I know this isn't a typical stack overflow question, so if it's an inappropriate place then I'll withdraw.
Thanks
Unfortunately, the calibration process just tries to write and read content successively while adjusting taps internally. It finds one end of success then goes the other direction and identifies that successful tap and then final settles on some where in the middle.
This is probably more HW centric as well, so I post what I think and let someone else move the thread.
Is it just this board? Or is it all of them that are doing it? Have you checked? If it's one board, and the RAM is BGA style, it could be a bad solider job. Push you finger down slightly on the chip and see if you get different results... After this is gets more HW centric
Does the FPGA image you are running on your custom board, have the ability to work on your devkit? A lot of times, that isn't practical I know, but I thought I would ask as it rules out that the image you are using on the devkit has FPGA constraints you aren't getting in your custom image.
Check your length tolerances on the traces. There should have been a length constraint. Plus or minus 50 mils something like that. No one likes to hear they need a board re-spin, but if those are out, it explains a lot.
Signal integrity. Did you get your termination resistors in there and are they the right values? Don't supposed you have an active probe?
Did you get the right DDR memory. Sometimes they use a different speed grade and that can cause all sorts of issue.
Slowing down the interface will usually help items 4 and 5. so if you are just trying to work done, you might ask for a new FPGA image with a slower clock.

iOS app slows down after a few seconds, speeds back up if paused and resumed in debugger

First of all, there's not a lot of detail I can offer, so I realize this question may seem incomplete. At this stage, I'm really looking for any ideas. Frankly, I'm just baffled by this one.
I'm building a graphics-heavy app that really maxes out the CPU. CPU utilization on the devices tends to be around 150% according to XCode (I know that sounds weird, it seems to be of a possible 200% because of the device having two cores). I've instrumented the tasks that do the most processing so I can see how long they take in the debug output. Also note that I am compiling with -Ofast (aggressive optimizations), even for debug builds.
Here's the weird thing. About 5-10 seconds into running the CPU intensive mode of the app, everything slows down. It's very visible. Because of my instrumentation, I can see that suddenly everything takes about 3 times as long as it did before. It's pretty uniform across all tasks, and it doesn't speed back up. Here's the really weird thing. If I break execution in the debugger and resume, I get another 5-10 seconds of fast execution before it slows down again.
Looking at the CPU and memory usage reported by XCode, everything stays about the same. The app uses no more than 90MB of memory at any point.
Is there a feature of iOS that slows down CPU intensive apps or underclocks the device to conserve battery life? I realize I'm sharing resources with the OS, but this is behavior I can reliably reproduce every time.
Again, I realize my question is vague, and there's no relevant code I can post. Any ideas about causes or even debugging methods are welcome.
First of all, thank you #thst for trying really hard to help me out. My question was kind of dumb since it really could have been anything.
I eventually solved the problem by rendering (via OpenGL, a detail I forgot to include, again showing how bad my question was) only when there is actually a change to the state of the objects and textures being rendered. It used to render at 60FPS all the time.
The app also uses CIDetector to detect faces. I think, but I am not sure, that CIDetector uses the GPU to perform its detection. If so, there might have been some contention for GPU resources. CIDetector blocking on a wait may have caused slowdown throughout the app.
Thanks everyone.

How to improve accuracy of profiling

I want to improve the running time of some code.
In order to that I first time the running time of all relevant code, using code like this:
before:= rdtsc;
myobject.run;
after:= rdtsc;
Then I zoom in and time a relevant part, like so:
procedure myobject.part;
begin
StartTime:= rdtsc;
...
EndTime:= rdtsc;
inc(TotalTime, (EndTime- StartTime));
end;
I have some code to copy paste the timings into Excel, a typical outcome would look like:
(the 89.8% and 10.2% adding up to 100% is a coincidence and has nothing to do with the data or the question)
(when the data shows 1 it means 0 to avoid divide by zero errors)
Note the difference between run A and run B.
I have not changed anything yet so run A and B should give the same running time.
Further note that I know that on both runs procedure part was invoked exactly the same number of times (the data is the same and the algorithm is deterministic).
The running time of procedure part is very short (it is just called many times).
If there was some way to block out other processes during these short bursts of runtime (less than 700 CPU cycles) my timings would be much more accurate.
How do I get these timings to be more reliable?
Is there a way to monopolize the CPU to only run my task when timing and nothing else?
Note that I'm not looking for obvious answers like:
- Close other running programs
- Disable the virusscanner etc...
I've tagged the question Delphi because I'm using Delphi right now (and there may be some Delphi specific option to achieve this result).
I've also tagged it language-agnostic because there may be some more general way.
Update
Because I'm using the CPU instruction RDTSC I'm not affected by CPU throttling. If the CPU slows down, the number of cycles stays the same.
Update2
I have 2 answers, but neither answers the question...
The question is how do I prevent these changes in running time?
Do I have to run the code 20x and always compare the lowest running time out of the 20 runs?
Or to I set my program priority to realtime?
Or is there some other trick to use so my code sample does not get interrupted?
To want to improve the running time of some code.
In order to that I first time the running time of all relevant code, ...
OK, I'm a bit of a stuck record on this subject, but lots of people think that to improve running time requires first measuring it accurately.
Not So.
Improving running time requires finding out what's taking a large fraction of time (the exact fraction does not matter) and doing it differently or maybe not at all.
What it's doing is often not revealed by timing individual routines.
Here's the method I use,
and here's a very amateur video of it.
The problem with profiling your code like that, by sticking special statements into it, is that those special statements themselves take time to run. And since the things taking the most time are likely to be things happening in tight loops, the more they run, the more they distort your timings. What you need for good information is something that will observe your program from outside, without modifying the executing code.
In other words, you need a sampling profiler. And there just happens to be a very good one for Delphi available for free, by the rather descriptive name of Sampling Profiler. It runs your program and watches what it's doing, then correlates that against the map file (make sure to set up your project options to generate a Detailed map file) to give you an intelligible readout on what your program is spending its time on.
And if you want to narrow things down, you can use OutputDebugString to output profiling commands to make it only pay attention to specific parts of your code. It's got instructions in the help file.
I've used a lot of different methods, and this is the most useful way I've found to figure out what Delphi programs are spending their time on. And it's free. Give it a try.

Resources