We have added custom code to run on Z1 motes, simulate everything in Cooja and are tracking power consumption using Powertrace.
Since we added additional code to run we would expect the power consumption to go up, but instead it goes down.
Based on the CPU Time reported by Powertrace the CPU seems to be idle for longer time periods.
So we have some questions for a better understanding:
When enabling Powertrace, will the CPU Time of custom code be tracked automatically?
Is it possible, that our code requires to many resources and Powertrace is thus not capable to report correctly?
Related
This is a followon to this question about using the DX11VideoRenderer sample (a replacement for EVR that uses DirectX11 instead of EVR's DirectX9).
I've been trying to track down why it uses so much more CPU than the EVR. Task Manager shows me that most of that time is kernel mode.
Using profiling tools, I see that a LOT of time is being spent in numerous calls to NtDelayExecution (aka Sleep). How many calls? ~100,000 over the course of ~12 seconds. Ok, yeah, I'm sending a lot of frames in those 12 seconds, but that's still a lot of calls, every one of which requires a kernel mode transition.
The callstack shows the last call in "my" code is to IDXGISwapChain1::Present(0, 0). The actual call seems be Sleep(0) and comes from nvwgf2umx.dll (which is why this question is tagged NVidia: hopefully someone there can call up the code and see what the logic is behind such frequent calls).
I couldn't quite figure out why it would need to do /any/ Sleeping during Present. It's not like we wait for vertical retrace anymore, is it? But the other reason to use Sleep has to do with yielding to other threads. Which led me to a serious clue:
If I use D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS, the CPU utilization drops. Along with some other fixes, the DX11 version is now faster and uses less CPU time than the DX9 version (which is what I would hope/expect). Profiling shows that Sleep has dropped from >30% to <1%.
Unfortunately, this page tells me:
This flag is not recommended for general use.
Oh.
So, any ideas on how to get decent performance without using debug flags?
When I use YJP to do cpu-tracing profile on our own product, it is really slow.
The product runs in a 16 core machine with 8GB heap, and I use grinder to run a small load test (e.g. 10 grinder threads) which have about 7~10 steps during the profiling. I have a script to start the product with profiler, start profiling (using controller api) and then start grinder to emulate user operations. When all the operations finish, the script tells the profiler to stop profiling and save snapshot.
During the profiling, for each step in the grinder test, it takes more than 1 million ms to finish. The whole profiling often takes more than 10 hours with just 10 grinder threads, and each runs the test 10 times. Without profiler, it finishes within 500 ms.
So... besides the problems with the product to be profiled, is there anything else that affects the performance of the cpu tracing process itself?
Last I used YourKit (v7.5.11, which is pretty old, current version is 12) it had two CPU profiling settings: sampling and tracing, the latter being much faster and less accurate. Since tracing is supposed to be more accurate I used it myself and also observed huge slowdown, in spite of the statement that the slowdown were "average". Yet it was far less than your results: from 2 seconds to 10 minutes. My code is a fragment of a calculation engine, virtually no IO, no waits on whatever, just reading a input, calculating and output the result into the console - so the whole slowdown comes from the profiler, no external influences.
Back to your question: the option mentioned - samping vs tracing, will affect the performance, so you may try sampling.
Now that I think of it: YourKit can be setup such that it does things automatically, like making snapshots periodically or on low memory, profiling memory usage, object allocations, each of this measures will make profiling slowlier. Perhaps you should make an online session instead of script controlled, to see what it really does.
According to some Yourkit Doc:
Although tracing provides more information, it has its drawbacks.
First, it may noticeably slow down the profiled application, because
the profiler executes special code on each enter to and exit from the
methods being profiled. The greater the number of method invocations
in the profiled application, the lower its speed when tracing is
turned on.
The second drawback is that, since this mode affects the execution
speed of the profiled application, the CPU times recorded in this mode
may be less adequate than times recorded with sampling. Please use
this mode only if you really need method invocation counts.
Also:
When sampling is used, the profiler periodically queries stacks of
running threads to estimate the slowest parts of the code. No method
invocation counts are available, only CPU time.
Sampling is typically the best option when your goal is to locate and
discover performance bottlenecks. With sampling, the profiler adds
virtually no overhead to the profiled application.
Also, it's a little confusing what the doc means by "CPU time", because it also talks about "wall-clock time".
If you are doing any I/O, waits, sleeps, or any other kind of blocking, it is important to get samples on wall-clock time, not CPU-only time, because it's dangerous to assume that blocked time is either insignificant or unavoidable.
Fortunately, that appears to be the default (though it's still a little unclear):
The default configuration for CPU sampling is to measure wall time for
I/O methods and CPU time for all other methods.
"Use Preconfigured Settings..." allows to choose this and other
presents. (sic)
If your goal is to make the code as fast as possible, don't be concerned with invocation counts and measurement "accuracy"; do find out which lines of code are on the stack a large fraction of the time, and why.
More on all that.
I'm trying to put together a model of a computer and run some simulations on it (part of a school assignment). It's a very simple model - a CPU, a disk and a process generator that generates user processes that take turns in using the CPU and accessing the disk (I've decided to omit the various system processes, because according to Process Explorer they use next to no CPU time - I'm basing this on the Microsoft Process Explorer tool, running on Windows 7). And this is where I've stopped at.
I have no idea how to get relevant data on how often do various processes read/write to disk and how much data at once, and how much time they spend using the CPU. Let's say I want to get some statistics for some typical operations on a PC - playing music/movies, browsing the internet, playing games, working with Office, video editing and so on...is there even a way to gather such data?
I'm simulating preemptive multitasking using RR with a time quantum of 15ms for switching processes, and this is how it looks:
->Process gets to CPU
->Process does its work in 0-15ms, gives up the CPU or is cut off
And now, two options arise:
a)process just sits and waits before it gets the CPU again or before it gets some user input if there is nothing to do
b)process requested data from disk, and does not rejoin the queue until said data is available
And i would like the decision between a) and b) in the model be done based on a probability, for example 90% for a) and 10% for b). But I do not know how to get those percentages to be at least a bit realistic for a certain type of process. Also, how much data can and does a process typically access at once?
Any hints, sources, utilities available for this?
I think I found an answer myself, albeit an unreliable one.
The Process Explorer utility for Windows measures disk I/O - by volume and by occurences. So there's a rough way to get the answer:
say a process performs 3 000 reads in 30 minutes, whilst using 2% of CPU during that time (assuming a single core CPU). So the process has used 36000ms of CPU time, divided into ~5200 blocks (this is the unreliable part - the process in all proabbility does not use the whole of the time slot, so I'll just divide by half the time slot). 3000/5200 gives a 57% chance of reading data after using the CPU.
I hope I did not misunderstand the "reads" statistic in Process Explorer.
What coding tricks, compilation flags, software-architecture considerations can be applied in order to keep powerconsumption low in an AIR for iOS application (or to reduce powerconsumption in an existing application that burns too much battery)?
One of the biggest things you can do is adjust the framerate based off of the app state.
You can do this by adding handlers inside your App.mxml
<s:ViewNavigatorApplication xmlns:fx="http://ns.adobe.com/mxml/2009"
xmlns:s="library://ns.adobe.com/flex/spark"
activate="activate()" deactivate="close()" />
Inside your activate and close methods
//activate
FlexGlobals.topLevelApplication.frameRate = 24;
//deactivate
FlexGlobals.topLevelApplication.frameRate = 2;
You can also play around with this number depending on what your app is currently doing. If you're just displaying text try lowering your fps. This should give you the most bang for your buck power savings.
Generally, high power consumption can be the result of:
intensive network usage
no sleep mode for display while app idle
un-needed location services usage
continously high usage of cpu
Regarding (flex/flash) AIR I would suggest that:
First you use the Flex profiler + task-manager and monitor CPU and Memory usage. Try to reduce them as much as possible. As soon as you have this low on windows/mac they will go lower (theoretically on mobile devices)
Next step would be to use a network monitor and reduce the amount and size of the network (webservice) calls. Try to identify unneeded network activity and eliminate it.
Try to detect any idle state of the app (possible in flex, not sure about flash) and maybe put the whole app in an idle mode (if you have fireworks animation running then just call stop())
Also I am not sure about it, but will reduce for sure cpu and use more gpu: by using Stage3D (now available with air 3.2 also for mobile) when you do complex anymations. It may reduce execution time since HW accel is there, therefore power consumption may be lower.
If I am wrong about something please comment/downvote (as you like) but this is my personal impression.
Update 1
As prompted in the comments, there is not a 100% link between cpu usage on a desktop and on a mobile device, but "theoreticaly" on the low level we should have at least the same cpu usage trend.
My tips:
Profile your App in a first step with the profiler from the Flash Builder
If you have a Mac, profile your app with Instruments from XCode
And important:
behaviors of Simulator, IPA-Interpreter packages and IPA-Test builds are different.
Simulator - pro forma optimizations
IPA-Interpreter - Get a feeling of performance
IPA-Test - "real" performance behavior
And finally test the AppStore-Build, it is the fastest (in meaning of performance) package mode.
Additional we saw, that all this modes can vary.
This is a bit strange. When I execute the same app without changing anything in my code (just clicking "Record" in the Activity Monitor Instrument), I will get varying CPU on different runs - and it always varies by 10%.
This isn't switching back and forth in any systematic way, so my app can run at 30% CPU or 40% CPU (allowing the CPU to reach an equilibrium after a few seconds of startup).
What's causing this if nothing is changed in my code. Is it due to internal processes on the device?
EDIT:
Also, I don't retain any information or to my knowledge use any time varying functions (apart from the seed in some random functions...)
I think, CPU % will vary a little, for each execution.
The application should be executed for say 10 times, and the average CPU % should be considered as its actual load.