Which factors affect the speed of cpu tracing? - profiler

When I use YJP to do cpu-tracing profile on our own product, it is really slow.
The product runs in a 16 core machine with 8GB heap, and I use grinder to run a small load test (e.g. 10 grinder threads) which have about 7~10 steps during the profiling. I have a script to start the product with profiler, start profiling (using controller api) and then start grinder to emulate user operations. When all the operations finish, the script tells the profiler to stop profiling and save snapshot.
During the profiling, for each step in the grinder test, it takes more than 1 million ms to finish. The whole profiling often takes more than 10 hours with just 10 grinder threads, and each runs the test 10 times. Without profiler, it finishes within 500 ms.
So... besides the problems with the product to be profiled, is there anything else that affects the performance of the cpu tracing process itself?

Last I used YourKit (v7.5.11, which is pretty old, current version is 12) it had two CPU profiling settings: sampling and tracing, the latter being much faster and less accurate. Since tracing is supposed to be more accurate I used it myself and also observed huge slowdown, in spite of the statement that the slowdown were "average". Yet it was far less than your results: from 2 seconds to 10 minutes. My code is a fragment of a calculation engine, virtually no IO, no waits on whatever, just reading a input, calculating and output the result into the console - so the whole slowdown comes from the profiler, no external influences.
Back to your question: the option mentioned - samping vs tracing, will affect the performance, so you may try sampling.
Now that I think of it: YourKit can be setup such that it does things automatically, like making snapshots periodically or on low memory, profiling memory usage, object allocations, each of this measures will make profiling slowlier. Perhaps you should make an online session instead of script controlled, to see what it really does.

According to some Yourkit Doc:
Although tracing provides more information, it has its drawbacks.
First, it may noticeably slow down the profiled application, because
the profiler executes special code on each enter to and exit from the
methods being profiled. The greater the number of method invocations
in the profiled application, the lower its speed when tracing is
turned on.
The second drawback is that, since this mode affects the execution
speed of the profiled application, the CPU times recorded in this mode
may be less adequate than times recorded with sampling. Please use
this mode only if you really need method invocation counts.
Also:
When sampling is used, the profiler periodically queries stacks of
running threads to estimate the slowest parts of the code. No method
invocation counts are available, only CPU time.
Sampling is typically the best option when your goal is to locate and
discover performance bottlenecks. With sampling, the profiler adds
virtually no overhead to the profiled application.
Also, it's a little confusing what the doc means by "CPU time", because it also talks about "wall-clock time".
If you are doing any I/O, waits, sleeps, or any other kind of blocking, it is important to get samples on wall-clock time, not CPU-only time, because it's dangerous to assume that blocked time is either insignificant or unavoidable.
Fortunately, that appears to be the default (though it's still a little unclear):
The default configuration for CPU sampling is to measure wall time for
I/O methods and CPU time for all other methods.
"Use Preconfigured Settings..." allows to choose this and other
presents. (sic)
If your goal is to make the code as fast as possible, don't be concerned with invocation counts and measurement "accuracy"; do find out which lines of code are on the stack a large fraction of the time, and why.
More on all that.

Related

What are the specifics of the relation between CPU usage and power consumption on iOS?

I've read through Apple's documentation on CPU power consumption, but I still have some lingering questions.
Does 1% CPU usage have the same amount of overhead to keep the CPU powered on as 100% CPU usage?
Does each core power up and down independently?
How long does it take before the CPU powers down after it starts idling?
Will the system commonly be using the CPU for its own tasks, thus keeping the CPU on regardless?
For my app in particular, I'm finding it hard to get the CPU to 0% for any decent amount of time. In a typical session, there's at least going to be a steady stream of UIScrollView delegate calls and UITableViewCell recycle calls as the user moves around the app, let alone interacts with the content. However, based on this post, https://apple.stackexchange.com/questions/2876/why-does-gps-on-the-iphone-use-so-much-power, the CPU sounds like a major energy culprit. I'm hoping even a small pause while they use the app would let the CPU save some power, so then I can just work on getting rid of long-running tasks.
This is probably too broad a question to be really answered in this format. All of it "depends" and is subject to change as the OS and hardware change. However, you should start by watching the various energy-related videos from WWDC (there are a lot of them).
In principle, the difference between 0% CPU usage and 1% CPU usage is enormous. However, the difference between 1% and 100% is also enormous. The system absolutely does turn on and off different parts of the CPU (including a whole core) depending on how busy it is, and if it can get down to zero for even a few tens of milliseconds periodically, it can dramatically improve battery life (this is why recent versions of iOS allow you to specify timer tolerances, so it can batch work together and then get back to low-power mode).
All that said, you shouldn't expect to be at zero CPU while the user is actually interacting with your app. Obviously responding to the user is going to take work. You should explore your energy usage with Instruments to look for places that are excessive, but I would expect the CPU to be "doing things" while the user is scrolling (how could it not?) You shouldn't try to artificially inject pauses or anything to let the CPU sleep. As a general rule of thumb, it is best to use the CPU fully if it helps you get done faster, and then do nothing. By that I mean, it is much better to parse all of a massive download (as an example) as fast as you can and use 100% CPU, than to spread it out and use 20% CPU for 5x as long.
The CPU is there to be used. You just don't want to waste it.

Do CPU usage numbers/percentages (e.g. Task Manager) include memory I/O times?

This is something I've always wondered but never looked up.
When an OS reports 100% CPU usage, does that necessarily mean that the bottleneck is calculations performed by the CPU, or does that include stall times, loading data from L1, L2, L3 and RAM?
If it does include stall times, is there a tool that allows to break the figure down into its components?
The CPU usage reported by the OS includes time stalled waiting for memory accesses (as well as stalls from data dependencies on high latency computational operations like division).
I suspect that one could use performance counters to get a better handle on what is taking the time, but I am not familiar with the details of using performance monitoring counters.

Detailed multitasking monitoring

I'm trying to put together a model of a computer and run some simulations on it (part of a school assignment). It's a very simple model - a CPU, a disk and a process generator that generates user processes that take turns in using the CPU and accessing the disk (I've decided to omit the various system processes, because according to Process Explorer they use next to no CPU time - I'm basing this on the Microsoft Process Explorer tool, running on Windows 7). And this is where I've stopped at.
I have no idea how to get relevant data on how often do various processes read/write to disk and how much data at once, and how much time they spend using the CPU. Let's say I want to get some statistics for some typical operations on a PC - playing music/movies, browsing the internet, playing games, working with Office, video editing and so on...is there even a way to gather such data?
I'm simulating preemptive multitasking using RR with a time quantum of 15ms for switching processes, and this is how it looks:
->Process gets to CPU
->Process does its work in 0-15ms, gives up the CPU or is cut off
And now, two options arise:
a)process just sits and waits before it gets the CPU again or before it gets some user input if there is nothing to do
b)process requested data from disk, and does not rejoin the queue until said data is available
And i would like the decision between a) and b) in the model be done based on a probability, for example 90% for a) and 10% for b). But I do not know how to get those percentages to be at least a bit realistic for a certain type of process. Also, how much data can and does a process typically access at once?
Any hints, sources, utilities available for this?
I think I found an answer myself, albeit an unreliable one.
The Process Explorer utility for Windows measures disk I/O - by volume and by occurences. So there's a rough way to get the answer:
say a process performs 3 000 reads in 30 minutes, whilst using 2% of CPU during that time (assuming a single core CPU). So the process has used 36000ms of CPU time, divided into ~5200 blocks (this is the unreliable part - the process in all proabbility does not use the whole of the time slot, so I'll just divide by half the time slot). 3000/5200 gives a 57% chance of reading data after using the CPU.
I hope I did not misunderstand the "reads" statistic in Process Explorer.

Why VisualVM Sampler does not provide full information about CPU load (method time execution)?

The problem is: VisualVM sampler shows call tree by time. For some method sampler shows only "Self time" so I can't see what makes this method slow. Here is an example.
How can I increase the depth of profiling?
Unfortunately sampling profilers are rather limited when it comes down to in-depth profiling due to a number of reasons:
Samplers are limited by the sampling period: For example, VisualVM currently has a minimum sampling period of 20ms. Modern processors can execute several million instructions in that time - certainly more than enough to call several short methods and return from them.
While an obvious solution would be to decrease the sampling period, this would also increase the impact of the profiler on your application, presenting a nice example of the uncertainty principle.
Samplers are easily confused by inline code: Both the JVM and any decent compiler will inline trivial and/or frequently-called methods, thus incorporating their code in the code of their caller. Sampling profilers have no way to tell which parts of each method actually belong to it and which belong to inline calls.
In the case of VisualVM Self time actually includes the execution time of both the method and any inlined code.
Samplers can get confused by an advanced VM: For example, in modern JVM implementations methods do not have a stable representation. Imagine for example the following method:
void A() {
...
B();
...
}
When the JVM starts B() is interpreted straight from the bytecode, thus taking quite a bit of time which makes it visible to the sampler. Then, after a while the JVM decides that B() is a good candidate for optimization and compiles it to native code, thus making it much faster. And after yet another while, the JVM might decide to inline the call to B(), incorporating its code in A().
At best, a sampling profiler will show the cost of those first runs and then the cost of any subsequent runs will be included in the time spent by the caller. This, unfortunately, can confuse an inexperienced developer into underestimating the cost of the method that was inlined.
At worst, that cost may be assigned to a sibling call, rather than the caller. For example, I am currently profiling an application using VisualVM, where a hotspot seems to be the ArrayList.size() method. In my Java implementation that method is a simple field getter that any JVM should quickly inline. Yet the profiler shows it as a major time consumer, completely ignoring a bunch of nearby HashMap calls that are obviously far more expensive.
The only way to avoid these weaknesses is to use an instrumenting profiler, rather than a sampling one. Instrumenting profilers, such as the one provided by the Profiler tab in VisualVM essentially record each and every method entry and exit in the selected code. Unfortunately, instrumenting profilers have a rather heavy impact on the profiled code:
They insert their monitoring code around each method, which completely changes the way a method is treated by the JVM. Even simple field getter/setter methods may not be inlined any more due to the extra code, thus skewing any results. The profiler usually tries to account for these changes, but it is not always successful.
They cause massive slow-downs to the profiled code, which makes them completely unsuitable for monitoring complete applications.
For these reasons instrumenting profilers are mostly suitable for analyzing hotspots that have already been detected using another method such as a sampling profiler. By instrumenting only a selected set of classes and/or methods it is possible to restrict the profiling side-effects to specific parts of an application.
There is nothing wrong in the example. It looks like that updateInfoInDirection() calls new SequenceInfo() and SequenceInfo.next(). 'Self time' means that the time is spent in the code of the methods itself (the method updateInfoInDirection() is on the bottom of the stack at the time when thread sample was taken).

cooperative memory usage across threads?

I have an application that has multiple threads processing work from a todo queue. I have no influence over what gets into the queue and in what order (it is fed externally by the user). A single work item from the queue may take anywhere between a couple of seconds to several hours of runtime and should not be interrupted while processing. Also, a single work item may consume between a couple of megabytes to around 2GBs of memory. The memory consumption is my problem. I'm running as a 64bit process on a 8GB machine with 8 parallel threads. If each of them hits a worst case work item at the same time I run out of memory. I'm wondering about the best way to work around this.
plan conservatively and run 4 threads only. The worst case shouldn't be a problem anymore, but we waste a lot of parallelism, making the average case a lot slower.
make each thread check available memory (or rather total allocated memory by all threads) before starting with a new item. Only start when more than 2GB memory are left. Recheck periodically, hoping that other threads will finish their memory hogs and we may start eventually.
try to predict how much memory items from the queue will need (hard) and plan accordingly. We could reorder the queue (overriding user choice) or simply adjust the number of running worker threads.
more ideas?
I'm currently tending towards number 2 because it seems simple to implement and solve most cases. However, I'm still wondering what standard ways of handling situations like this exist? The operating system must do something very similar on a process level after all...
regards,
Sören
So your current worst-case memory usage is 16GB. With only 8GB of RAM, you'd be lucky to have 6 or 7GB left after the OS and system processes take their share. So on average you're already going to be thrashing memory on a moderately loaded system. How many cores does the machine have? Do you have 8 worker threads because it is an 8-core machine?
Basically you can either reduce memory consumption, or increase available memory. Your option 1, running only 4 threads, under-utilitises the CPU resources, which could halve your throughput - definitely sub-optimal.
Option 2 is possible, but risky. Memory management is very complex, and querying for available memory is no guarantee that you will be able to go ahead and allocate that amount (without causing paging). A burst of disk I/O could cause the system to increase the cache size, a background process could start up and swap in its working set, and any number of other factors. For these reasons, the smaller the available memory, the less you can rely on it. Also, over time memory fragmentation can cause problems too.
Option 3 is interesting, but could easily lead to under-loading the CPU. If you have a run of jobs that have high memory requirements, you could end up running only a few threads, and be in the same situation as option 1, where you are under-loading the cores.
So taking the "reduce consumption" strategy, do you actually need to have the entire data set in memory at once? Depending on the algorithm and the data access pattern (eg. random versus sequential) you could progressively load the data. More esoteric approaches might involve compression, depending on your data and the algorithm (but really, it's probably a waste of effort).
Then there's "increase available memory". In terms of price/performance, you should seriously consider simply purchasing more RAM. Sometimes, investing in more hardware is cheaper than the development time to achieve the same end result. For example, you could put in 32GB of RAM for a few hundred dollars, and this would immediately improve performance without adding any complexity to the solution. With the performance pressure off, you could profile the application to see just where you can make the software more efficient.
I have continued the discussion on Herb Sutter's blog and provoced some very helpful reader comments. Head over to Sutter's Mill if you are interested.
Thanks for all the suggestions so far!
Sören
Difficult to propose solutions without knowing exactly what you're doing, but how about considering:
See if your processing algorithm can access the data in smaller sections without loading the whole work item into memory.
Consider developing a service-based solution so that the work is carried out by another process (possibly a web service). This way you could scale the solution to run over multiple servers, perhaps using a load balancer to distribute the work.
Are you persisting the incoming work items to disk before processing them? If not, they probably should be anyway, particularly if it may be some time before the processor gets to them.
Is the memory usage proportional to the size of the incoming work item, or otherwise easy to calculate? Knowing this would help to decide how to schedule processing.
Hope that helps?!

Resources