What are the specifics of the relation between CPU usage and power consumption on iOS? - ios

I've read through Apple's documentation on CPU power consumption, but I still have some lingering questions.
Does 1% CPU usage have the same amount of overhead to keep the CPU powered on as 100% CPU usage?
Does each core power up and down independently?
How long does it take before the CPU powers down after it starts idling?
Will the system commonly be using the CPU for its own tasks, thus keeping the CPU on regardless?
For my app in particular, I'm finding it hard to get the CPU to 0% for any decent amount of time. In a typical session, there's at least going to be a steady stream of UIScrollView delegate calls and UITableViewCell recycle calls as the user moves around the app, let alone interacts with the content. However, based on this post, https://apple.stackexchange.com/questions/2876/why-does-gps-on-the-iphone-use-so-much-power, the CPU sounds like a major energy culprit. I'm hoping even a small pause while they use the app would let the CPU save some power, so then I can just work on getting rid of long-running tasks.

This is probably too broad a question to be really answered in this format. All of it "depends" and is subject to change as the OS and hardware change. However, you should start by watching the various energy-related videos from WWDC (there are a lot of them).
In principle, the difference between 0% CPU usage and 1% CPU usage is enormous. However, the difference between 1% and 100% is also enormous. The system absolutely does turn on and off different parts of the CPU (including a whole core) depending on how busy it is, and if it can get down to zero for even a few tens of milliseconds periodically, it can dramatically improve battery life (this is why recent versions of iOS allow you to specify timer tolerances, so it can batch work together and then get back to low-power mode).
All that said, you shouldn't expect to be at zero CPU while the user is actually interacting with your app. Obviously responding to the user is going to take work. You should explore your energy usage with Instruments to look for places that are excessive, but I would expect the CPU to be "doing things" while the user is scrolling (how could it not?) You shouldn't try to artificially inject pauses or anything to let the CPU sleep. As a general rule of thumb, it is best to use the CPU fully if it helps you get done faster, and then do nothing. By that I mean, it is much better to parse all of a massive download (as an example) as fast as you can and use 100% CPU, than to spread it out and use 20% CPU for 5x as long.
The CPU is there to be used. You just don't want to waste it.

Related

% cpu, memory and network usage android application

Is there any Benchmark to compare cpu, memory and network usage with the one recorded by android profiler when using an application? or is there any best practice: e.g. cpu usage should be kept all the time less than 25%?
Thank you
There's no definitive benchmark. It all depends on your app. If it's a high fidelity game, hitting 100% CPU is not uncommon. But if it's just a simple calculator app, hitting above 10% CPU may be a bit too high. It's also device dependent so 50% on a Pixel 5 doesn't compare to 50% on a Nexus 5.
As an app developer, you should pay more attention to unexpected CPU usage spikes and investigate what's causing them. Similarly for memory usage, you should keep an eye out for memory usage spikes and memory leaks, as opposed to absolute values.
Network usage on the other hand is more about what's expected vs. what's actually being sent over the wires. Take a look at both the bandwidth and number of requests.

Software memory bit-flip detection for platforms without ECC

Most available desktop (cheap) x86 platforms now still nave no ECC memory support (Error Checking & Correction). But the rate of memory bit-flip errors is still growing (not the best SO thread, Large scale CERN 2007 study "Data integrity": "Bit Error Rate of 10-12 for their memory modules ... observed error rate is 4 orders of magnitude lower than expected"; 2009 Google's "DRAM Errors in the Wild: A Large-Scale Field Study"). For current hardware with data-intensive load (8 GB/s of reading) this means that single bit flip may occur every minute (10-12 vendors BER from CERN07) or once in two days (10-16 BER from CERN07). Google09 says that there can be up to 25000-75000 one-bit FIT per Mbit (failures in time per billion hours), which is equal to 1 - 5 bit errors per hour for 8GB of RAM ("mean correctable error rates of 2000–6000 per GB per year").
So, I want to know, is it possible to add some kind of software error detection in system-wide manner (check both user and kernel memory). For example, create a patch for Linux kernel and/or to system compiler to add some checksumming of every memory page, and try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
I also understand that better way of data protection from memory bitflips is to switch to ECC hardware, but most PC there are still non-ECC.
The thing is, ECC is dirt cheap compared to "software ECC countermeasures". You can easily detect if they have ECC modules and complain (or print a warning) when they don't.
http://www.cyberciti.biz/faq/ecc-memory-modules/
For example, can we see all writes to memory (both from user and kernel space), to distinguish between intended memory changes from in-memory bit flips? Or can we somehow instrument all codes with some helper?
Er, you you will never "see" the bit-flips on the bus. They are literally caused by a particle hitting RAM, flipping a bit. Only much later can you notice that you read out something different than your wrote in. To detect this only via the bus, you would need a duplicate copy of all your RAM (i.e. create a shadow copy of what is in your real RAM, so you can verify every read returns what was written to that location.)
try to detect silent memory corruptions (bit-flips) by regular recomputing of checksums?
The Redis guy has a nice write-up on an algorithm for testing RAM for problems. http://antirez.com/news/43 But this is really looking for RAM errors, not random bit-flips.
If "recompute checksums" only works when you are NOT writing to the memory. That might be "good enough" but you'll need to figure out which pages are not being written to.
To catch 100% of the errors, every write must be pre-ceeded by computing the checksum of that block of memory, then comparing it to the recorded checksum (to make sure that block hasn't degraded in RAM). Only then is it safe to do the write and then update the checksum. As you can imagine, the performance of this will be horrible (at least 100x slower) performance.
I understand that any kind of software memory ECC may cost a lot of performance and will not catch all errors, but I think it can be useful to detect at least some memory bit-flips early, before they will be reused in later computations or stored to hard drive.
Well, there is a simple method to detect 100% of the errors, at a cost of 50% performance: Just run the computation on 2 boxes at once (or on one box at two different times, maybe with a RAM test in between if you are paranoid.) If the results differ, you have detected an error.
See also:
https://www.linuxquestions.org/questions/linux-hardware-18/how-to-detect-ecc-memory-errors-under-linux-886011/
The answer to the question is yes, and a proof for that is the software SoftECC posted in the comments!
Just a note that SoftECC is a kernel level solution. If a user-land app is used, it will be a third stage of redundancy, that seems not necessary.

Apple Instruments slows down app when analyzing memory allocations

When running my app in the simulator and analyzing its memory allocations using Instruments, the App runs very slow, it runs at less than 1/30 of its normal speed.
The app uses about 50 MB RAM and has approximately 900,000 life objects (according to Instruments).
Could this be the reason for the slow performance?
When running in the app on the device or in the simulator without using Instruments, it performs well (except the memory issue I am trying to debug).
Do you have any idea on how to solve this issue?
Did you encounter slow performance using the Memory Allocation
instruments?
Would you consider having more than 900,000 life
objects "concerning"?
Considering your Analyzer performance issue
In your specific case monitoring the app over a long period of time will not be necessary, as you reach the state of high memory consumption very soon. You could simply stop recording at this point. Then you won't have problems navigating through the different views and statistics to find the cause of the memory issue.
Analyzing the memory issue
Slowing down is normal. 1/30 sounds quite alarming.
You probably should track how the amount of life objects and the memory usage change while you use the app.
It is difficult to decide if a certain amount of life objects at a specific point in time is critical (though 900,000 seems very high).
In general: if life objects and memory usage grow continuously and don't shrink, that is a bad sign.
If you take a look Statistics -> Object Summary (Screenshot), Live Bytes should be a lot smaller than Overall Bytes and the amount of #Living objects should be a lot smaller than the amount of #Transitory objects.
The second thing you can look at, is the Call Tree view.
It gives you a nice overview of which parts of the application are responsible for reserving large amount of memory:
Possible solutions
Once you detect the parts of your code that are responsible for reserving the large memory amount you can look for retain-cycles or you could try to use more autorelease pools in that spot.
Check that you have enough available disk space. I had 8gb left and it seems like that was too little. Instruments was extreeemely slow. Used a minute just to start and didn't quite get around at all.
I cleared out more disk space and then it suddenly went back to being fast as before.

Which factors affect the speed of cpu tracing?

When I use YJP to do cpu-tracing profile on our own product, it is really slow.
The product runs in a 16 core machine with 8GB heap, and I use grinder to run a small load test (e.g. 10 grinder threads) which have about 7~10 steps during the profiling. I have a script to start the product with profiler, start profiling (using controller api) and then start grinder to emulate user operations. When all the operations finish, the script tells the profiler to stop profiling and save snapshot.
During the profiling, for each step in the grinder test, it takes more than 1 million ms to finish. The whole profiling often takes more than 10 hours with just 10 grinder threads, and each runs the test 10 times. Without profiler, it finishes within 500 ms.
So... besides the problems with the product to be profiled, is there anything else that affects the performance of the cpu tracing process itself?
Last I used YourKit (v7.5.11, which is pretty old, current version is 12) it had two CPU profiling settings: sampling and tracing, the latter being much faster and less accurate. Since tracing is supposed to be more accurate I used it myself and also observed huge slowdown, in spite of the statement that the slowdown were "average". Yet it was far less than your results: from 2 seconds to 10 minutes. My code is a fragment of a calculation engine, virtually no IO, no waits on whatever, just reading a input, calculating and output the result into the console - so the whole slowdown comes from the profiler, no external influences.
Back to your question: the option mentioned - samping vs tracing, will affect the performance, so you may try sampling.
Now that I think of it: YourKit can be setup such that it does things automatically, like making snapshots periodically or on low memory, profiling memory usage, object allocations, each of this measures will make profiling slowlier. Perhaps you should make an online session instead of script controlled, to see what it really does.
According to some Yourkit Doc:
Although tracing provides more information, it has its drawbacks.
First, it may noticeably slow down the profiled application, because
the profiler executes special code on each enter to and exit from the
methods being profiled. The greater the number of method invocations
in the profiled application, the lower its speed when tracing is
turned on.
The second drawback is that, since this mode affects the execution
speed of the profiled application, the CPU times recorded in this mode
may be less adequate than times recorded with sampling. Please use
this mode only if you really need method invocation counts.
Also:
When sampling is used, the profiler periodically queries stacks of
running threads to estimate the slowest parts of the code. No method
invocation counts are available, only CPU time.
Sampling is typically the best option when your goal is to locate and
discover performance bottlenecks. With sampling, the profiler adds
virtually no overhead to the profiled application.
Also, it's a little confusing what the doc means by "CPU time", because it also talks about "wall-clock time".
If you are doing any I/O, waits, sleeps, or any other kind of blocking, it is important to get samples on wall-clock time, not CPU-only time, because it's dangerous to assume that blocked time is either insignificant or unavoidable.
Fortunately, that appears to be the default (though it's still a little unclear):
The default configuration for CPU sampling is to measure wall time for
I/O methods and CPU time for all other methods.
"Use Preconfigured Settings..." allows to choose this and other
presents. (sic)
If your goal is to make the code as fast as possible, don't be concerned with invocation counts and measurement "accuracy"; do find out which lines of code are on the stack a large fraction of the time, and why.
More on all that.

cooperative memory usage across threads?

I have an application that has multiple threads processing work from a todo queue. I have no influence over what gets into the queue and in what order (it is fed externally by the user). A single work item from the queue may take anywhere between a couple of seconds to several hours of runtime and should not be interrupted while processing. Also, a single work item may consume between a couple of megabytes to around 2GBs of memory. The memory consumption is my problem. I'm running as a 64bit process on a 8GB machine with 8 parallel threads. If each of them hits a worst case work item at the same time I run out of memory. I'm wondering about the best way to work around this.
plan conservatively and run 4 threads only. The worst case shouldn't be a problem anymore, but we waste a lot of parallelism, making the average case a lot slower.
make each thread check available memory (or rather total allocated memory by all threads) before starting with a new item. Only start when more than 2GB memory are left. Recheck periodically, hoping that other threads will finish their memory hogs and we may start eventually.
try to predict how much memory items from the queue will need (hard) and plan accordingly. We could reorder the queue (overriding user choice) or simply adjust the number of running worker threads.
more ideas?
I'm currently tending towards number 2 because it seems simple to implement and solve most cases. However, I'm still wondering what standard ways of handling situations like this exist? The operating system must do something very similar on a process level after all...
regards,
Sören
So your current worst-case memory usage is 16GB. With only 8GB of RAM, you'd be lucky to have 6 or 7GB left after the OS and system processes take their share. So on average you're already going to be thrashing memory on a moderately loaded system. How many cores does the machine have? Do you have 8 worker threads because it is an 8-core machine?
Basically you can either reduce memory consumption, or increase available memory. Your option 1, running only 4 threads, under-utilitises the CPU resources, which could halve your throughput - definitely sub-optimal.
Option 2 is possible, but risky. Memory management is very complex, and querying for available memory is no guarantee that you will be able to go ahead and allocate that amount (without causing paging). A burst of disk I/O could cause the system to increase the cache size, a background process could start up and swap in its working set, and any number of other factors. For these reasons, the smaller the available memory, the less you can rely on it. Also, over time memory fragmentation can cause problems too.
Option 3 is interesting, but could easily lead to under-loading the CPU. If you have a run of jobs that have high memory requirements, you could end up running only a few threads, and be in the same situation as option 1, where you are under-loading the cores.
So taking the "reduce consumption" strategy, do you actually need to have the entire data set in memory at once? Depending on the algorithm and the data access pattern (eg. random versus sequential) you could progressively load the data. More esoteric approaches might involve compression, depending on your data and the algorithm (but really, it's probably a waste of effort).
Then there's "increase available memory". In terms of price/performance, you should seriously consider simply purchasing more RAM. Sometimes, investing in more hardware is cheaper than the development time to achieve the same end result. For example, you could put in 32GB of RAM for a few hundred dollars, and this would immediately improve performance without adding any complexity to the solution. With the performance pressure off, you could profile the application to see just where you can make the software more efficient.
I have continued the discussion on Herb Sutter's blog and provoced some very helpful reader comments. Head over to Sutter's Mill if you are interested.
Thanks for all the suggestions so far!
Sören
Difficult to propose solutions without knowing exactly what you're doing, but how about considering:
See if your processing algorithm can access the data in smaller sections without loading the whole work item into memory.
Consider developing a service-based solution so that the work is carried out by another process (possibly a web service). This way you could scale the solution to run over multiple servers, perhaps using a load balancer to distribute the work.
Are you persisting the incoming work items to disk before processing them? If not, they probably should be anyway, particularly if it may be some time before the processor gets to them.
Is the memory usage proportional to the size of the incoming work item, or otherwise easy to calculate? Knowing this would help to decide how to schedule processing.
Hope that helps?!

Resources