This question pertains to xperf and xperfview, utilities that are part of the Windows Performance Toolkit (in turn part of Windows SDK 7.1).
Comparing two charts, "CPU sampling by thread" and "CPU Usage by thread", there are several differences I don't understand. I'll use audiodg.exe as an example.
In the Threads pulldown, there is only one thread for audiodg on the CPU Sampling chart; the CPU Usage chart shows several audiodg threads.
Both graphs have a Y-axis marked "% Usage", but the measurements differ. Typically the % usage for a given thread is lower on the CPU Sampling chart than on the CPU Usage chart.
The CPU Sampling summary table shows Weight and % weight for each module/process. If I load symbols, I can dig pretty deep into the audiodg process. The CPU Scheduling Aggregate Summary table (launched from the CPU Usage graph) shows CPU Usage and % CPU usage -- Weight is not available. (Conversely, CPU Usage is not available on the CPU Sampling summary table.) I cannot dig as deep into audiodg -- I only see the main thread and a few ntdll.dll threads.
The numbers for any process in the % CPU usage and % Weight columns are always different. Sometimes they differ by more than 75%.
So my questions ... what is the reliable measure of CPU usage here? Aren't the CPU Usage numbers derived from CPU Samples? Shouldn't the numbers relate somehow?
Xperf does make this a bit confusing, this is my understanding of what's going on:
CPU sample data, enabled with the PROFILE kernel flag. CPU sample data is collected at some regular interval, and records information about what the CPU was doing at that time (e.g. the process, thread Id, and callstack at the time of the sample.)
Context switch data, enabled with the CSWITCH kernel flag. This records data about every context switch that happens (e.g. who was switched in/out and the callstacks.)
CPU sampling by thread shows the number of profile events that were recorded for each thread, aggregated over some interval of time for the duration of the trace. For example, if audiodg was executing 10% of the time for 2 seconds, we would expect to see about 10 "% usage" over that time. However, because this is based on sampling, it's possible that at each sample event, threads from another process happened to be executing--in other words, the 10% was 'missed' by the sample events.
CPU Usage by thread is calculated using the context switch data. The 'usage' is the amount of time between being context switched in and then out later (and of course, this data is aggregated over some small interval).
There are benefits to each data:
CPU sampling will actually tell you what the thread is doing at the time of the sample because it collects call stacks during the execution of the thread. The context switch information will only tell you when a thread gets switched in or out, but nothing between.
Context switch information will tell you exactly how much time every thread got to execute. This data is correct. Sampling, of course, is only probabilistic.
So to answer your question, the CPU Usage chart is "more accurate" for understanding how much time each thread was executing. However, don't rule out the use of the sampling data because it can be much more helpful for understanding where your threads were actually spending their time! For the CPU sampling data, the summary table is more valuable because it will show you the stacks. For the CPU usage data, the chart is probably more helpful than the summary table.
Hope that helps!
Related
I am trying to understand microarchitecture.
When an operating system schedules code to run on a CPU hardware thread (as in Intel HyperThreading), can each execution context issue memory reads in parallel or is the pipeline shared?
I am trying to do some rough calculations and complexity analysis and I want to know if memory bandwidth is shared and if I should divide my calculation by the number of cores or hardware threads (assuming the pipeline is shared) or hardware threads (the memory bandwidth is parallel).
Yes, the pipeline is shared, so it's possible for each the two load execution units in a physical core to be running a uop from a different logical core, accessing L1d in parallel. (e.g. https://www.realworldtech.com/haswell-cpu/5/ / https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram)
Off-core (L2 miss) bandwidth doesn't scale with number of logical cores, and one thread per core can fairly easily saturated it, especially with SIMD, if your code has high throughput (not bottlenecking on latency or branch misses), and low computational intensity (ALU work per load of data into registers. Or into L1d or L2 cache, whichever you're cache-blocking for). e.g. like a dot product.
Well-tuned high-throughput (instructions per cycle) code like linear algebra stuff (especially matmul) often doesn't benefit from more than 1 thread per physical core, instead suffering more cache misses when two threads are competing for the same L1d / L2 cache.
Cache-blocking aka loop tiling can help a lot, if you can loop again over a smaller chunk of data while it's still hot in cache. See How much of ‘What Every Programmer Should Know About Memory’ is still valid? (most of it).
I recently started using the wandb module with my PyTorch script, to ensure that the GPU's are operating efficiently. However, I am unsure as to what exactly the charts indicate.
I have been following the tutorial in this link, https://lambdalabs.com/blog/weights-and-bias-gpu-cpu-utilization/ , and was confused by this plot:
I am uncertain about the GPU % and the GPU Memory Access % charts. The descriptions in the blog are as following:
GPU %: This graph is probably the most important one. It tracks the percent of the time over the past sample period during which one or more kernels was executing on the GPU. Basically, you want this to be close to 100%, which means GPU is busy all the time doing data crunching. The above diagram has two curves. This is because there are two GPUs and only of them (blue) is used for the experiment. The Blue GPU is about 90% busy, which means it is not too bad but still has some room for improvement. The reason for this suboptimal utilization is due to the small batch size (4) we used in this experiment. The GPU fetches a small amount of data from its memory very often, and can not saturate the memory bus nor the CUDA cores. Later we will see it is possible to bump up this number by merely increasing the batch size.
GPU Memory Access %: This is an interesting one. It measures the percent of the time over the past sample period during which GPU memory was being read or written. We should keep this percent low because you want GPU to spend most of the time on computing instead of fetching data from its memory. In the above figure, the busy GPU has around 85% uptime accessing memory. This is very high and caused some performance problem. One way to lower the percent here is to increase the batch size, so data fetching becomes more efficient.
I had the following questions:
The aforementioned values do not sum to 100%. It seems as though our GPU can either be spending time on computation or spending time on reading/writing memory. How can the sum of these two values be greater than 100%?
Why does increasing batch size decrease the time spent accessing GPU Memory?
GPU Utilization and GPU memory access should add up to 100% is true if the hardware is doing both the process sequentially. But modern hardware doesn't do operations like this. GPU will be busy computing the numbers at the same time it will be accessing the memory.
GPU% is actually GPU Utilization %. We want this to be 100%. Thus it will do the desired computation 100% of the time.
GPU memory access % is the amount of time GPU is either reading from or writing to GPU memory. We want this number to be low. If the GPU memory access % is high there can be some delay before GPU can use the data to compute on. That doesn't mean that it's a sequential process.
W&B allows you to monitor both the metrics and take decisions based on them. Recently I implemented a data pipeline using tf.data.Dataset. The GPU utilization was close to 0% while memory access was close to 0% as well. I was reading three different image files and stacking them. Here CPU was the bottleneck. To counter this I created a dataset by stacking the images. The ETA went from 1h per epoch to 3 min.
From the plot, you can infer that the memory access of GPU increased while GPU utilization is close to 100%. CPU utilization decreased which was the bottleneck.
Here's a nice article by Lukas answering this question.
I've read through Apple's documentation on CPU power consumption, but I still have some lingering questions.
Does 1% CPU usage have the same amount of overhead to keep the CPU powered on as 100% CPU usage?
Does each core power up and down independently?
How long does it take before the CPU powers down after it starts idling?
Will the system commonly be using the CPU for its own tasks, thus keeping the CPU on regardless?
For my app in particular, I'm finding it hard to get the CPU to 0% for any decent amount of time. In a typical session, there's at least going to be a steady stream of UIScrollView delegate calls and UITableViewCell recycle calls as the user moves around the app, let alone interacts with the content. However, based on this post, https://apple.stackexchange.com/questions/2876/why-does-gps-on-the-iphone-use-so-much-power, the CPU sounds like a major energy culprit. I'm hoping even a small pause while they use the app would let the CPU save some power, so then I can just work on getting rid of long-running tasks.
This is probably too broad a question to be really answered in this format. All of it "depends" and is subject to change as the OS and hardware change. However, you should start by watching the various energy-related videos from WWDC (there are a lot of them).
In principle, the difference between 0% CPU usage and 1% CPU usage is enormous. However, the difference between 1% and 100% is also enormous. The system absolutely does turn on and off different parts of the CPU (including a whole core) depending on how busy it is, and if it can get down to zero for even a few tens of milliseconds periodically, it can dramatically improve battery life (this is why recent versions of iOS allow you to specify timer tolerances, so it can batch work together and then get back to low-power mode).
All that said, you shouldn't expect to be at zero CPU while the user is actually interacting with your app. Obviously responding to the user is going to take work. You should explore your energy usage with Instruments to look for places that are excessive, but I would expect the CPU to be "doing things" while the user is scrolling (how could it not?) You shouldn't try to artificially inject pauses or anything to let the CPU sleep. As a general rule of thumb, it is best to use the CPU fully if it helps you get done faster, and then do nothing. By that I mean, it is much better to parse all of a massive download (as an example) as fast as you can and use 100% CPU, than to spread it out and use 20% CPU for 5x as long.
The CPU is there to be used. You just don't want to waste it.
This is something I've always wondered but never looked up.
When an OS reports 100% CPU usage, does that necessarily mean that the bottleneck is calculations performed by the CPU, or does that include stall times, loading data from L1, L2, L3 and RAM?
If it does include stall times, is there a tool that allows to break the figure down into its components?
The CPU usage reported by the OS includes time stalled waiting for memory accesses (as well as stalls from data dependencies on high latency computational operations like division).
I suspect that one could use performance counters to get a better handle on what is taking the time, but I am not familiar with the details of using performance monitoring counters.
When I use YJP to do cpu-tracing profile on our own product, it is really slow.
The product runs in a 16 core machine with 8GB heap, and I use grinder to run a small load test (e.g. 10 grinder threads) which have about 7~10 steps during the profiling. I have a script to start the product with profiler, start profiling (using controller api) and then start grinder to emulate user operations. When all the operations finish, the script tells the profiler to stop profiling and save snapshot.
During the profiling, for each step in the grinder test, it takes more than 1 million ms to finish. The whole profiling often takes more than 10 hours with just 10 grinder threads, and each runs the test 10 times. Without profiler, it finishes within 500 ms.
So... besides the problems with the product to be profiled, is there anything else that affects the performance of the cpu tracing process itself?
Last I used YourKit (v7.5.11, which is pretty old, current version is 12) it had two CPU profiling settings: sampling and tracing, the latter being much faster and less accurate. Since tracing is supposed to be more accurate I used it myself and also observed huge slowdown, in spite of the statement that the slowdown were "average". Yet it was far less than your results: from 2 seconds to 10 minutes. My code is a fragment of a calculation engine, virtually no IO, no waits on whatever, just reading a input, calculating and output the result into the console - so the whole slowdown comes from the profiler, no external influences.
Back to your question: the option mentioned - samping vs tracing, will affect the performance, so you may try sampling.
Now that I think of it: YourKit can be setup such that it does things automatically, like making snapshots periodically or on low memory, profiling memory usage, object allocations, each of this measures will make profiling slowlier. Perhaps you should make an online session instead of script controlled, to see what it really does.
According to some Yourkit Doc:
Although tracing provides more information, it has its drawbacks.
First, it may noticeably slow down the profiled application, because
the profiler executes special code on each enter to and exit from the
methods being profiled. The greater the number of method invocations
in the profiled application, the lower its speed when tracing is
turned on.
The second drawback is that, since this mode affects the execution
speed of the profiled application, the CPU times recorded in this mode
may be less adequate than times recorded with sampling. Please use
this mode only if you really need method invocation counts.
Also:
When sampling is used, the profiler periodically queries stacks of
running threads to estimate the slowest parts of the code. No method
invocation counts are available, only CPU time.
Sampling is typically the best option when your goal is to locate and
discover performance bottlenecks. With sampling, the profiler adds
virtually no overhead to the profiled application.
Also, it's a little confusing what the doc means by "CPU time", because it also talks about "wall-clock time".
If you are doing any I/O, waits, sleeps, or any other kind of blocking, it is important to get samples on wall-clock time, not CPU-only time, because it's dangerous to assume that blocked time is either insignificant or unavoidable.
Fortunately, that appears to be the default (though it's still a little unclear):
The default configuration for CPU sampling is to measure wall time for
I/O methods and CPU time for all other methods.
"Use Preconfigured Settings..." allows to choose this and other
presents. (sic)
If your goal is to make the code as fast as possible, don't be concerned with invocation counts and measurement "accuracy"; do find out which lines of code are on the stack a large fraction of the time, and why.
More on all that.