Memory Profiler tool to get an estimation of improvement enabling NUMA - low-latency

I work on a low latency application that in my opinion would greatly benefit from the enabling of NUMA (or improving the memory locality anyway).
Is there a profiling tool that would give me an estimation on what could be the improvement, maybe in terms of percent/factor of reduction of execution time?
I was considering using cachegrind. I would expect a lot of LL cache miss, but still I wouldn't have an idea of the expected improvement.
Thanks a lot.
Edit:
The goal here is trying to reduce the latency. Currently, there is a thread that works on startup and do all the allocations. A better implementation, I believe, would be to pin the threads to cpu cores, and make every threads to make the allocation it needs. Before to do that i'd like to have, somehow, an estimation of the benefit in terms of latency.

Related

Can hardware threads access main memory at the same time?

I am trying to understand microarchitecture.
When an operating system schedules code to run on a CPU hardware thread (as in Intel HyperThreading), can each execution context issue memory reads in parallel or is the pipeline shared?
I am trying to do some rough calculations and complexity analysis and I want to know if memory bandwidth is shared and if I should divide my calculation by the number of cores or hardware threads (assuming the pipeline is shared) or hardware threads (the memory bandwidth is parallel).
Yes, the pipeline is shared, so it's possible for each the two load execution units in a physical core to be running a uop from a different logical core, accessing L1d in parallel. (e.g. https://www.realworldtech.com/haswell-cpu/5/ / https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram)
Off-core (L2 miss) bandwidth doesn't scale with number of logical cores, and one thread per core can fairly easily saturated it, especially with SIMD, if your code has high throughput (not bottlenecking on latency or branch misses), and low computational intensity (ALU work per load of data into registers. Or into L1d or L2 cache, whichever you're cache-blocking for). e.g. like a dot product.
Well-tuned high-throughput (instructions per cycle) code like linear algebra stuff (especially matmul) often doesn't benefit from more than 1 thread per physical core, instead suffering more cache misses when two threads are competing for the same L1d / L2 cache.
Cache-blocking aka loop tiling can help a lot, if you can loop again over a smaller chunk of data while it's still hot in cache. See How much of ‘What Every Programmer Should Know About Memory’ is still valid? (most of it).

Why is pthread_cond_wait a big portion of total perf samples in my program?

I use perf to do performance profiling, and get the following flame graph. Notice that a big portion of the total samples is pthread_cond_wait. I used boost::asio but not sure where is pthread_cond_wait is called. Can anyone give me some clue why this is happening. Thanks.
That means you have a lot of lock contention. Without code there's little useful we can say, except generalities:
keep locks short
use atomics where possible
minimize resource sharing, so there is less need for synchronization of any kind, locking or otherwise
e.g. moving resources with their tasks is a good pattern to remove sharing
you might have too many threads; above a point you risk performance degradation due to increased lock contention
Going from the flame-graph alone I will add the tip that it is possible to optimize Asio in the case of single threaded application. See Concurrency Hint.

What are the specifics of the relation between CPU usage and power consumption on iOS?

I've read through Apple's documentation on CPU power consumption, but I still have some lingering questions.
Does 1% CPU usage have the same amount of overhead to keep the CPU powered on as 100% CPU usage?
Does each core power up and down independently?
How long does it take before the CPU powers down after it starts idling?
Will the system commonly be using the CPU for its own tasks, thus keeping the CPU on regardless?
For my app in particular, I'm finding it hard to get the CPU to 0% for any decent amount of time. In a typical session, there's at least going to be a steady stream of UIScrollView delegate calls and UITableViewCell recycle calls as the user moves around the app, let alone interacts with the content. However, based on this post, https://apple.stackexchange.com/questions/2876/why-does-gps-on-the-iphone-use-so-much-power, the CPU sounds like a major energy culprit. I'm hoping even a small pause while they use the app would let the CPU save some power, so then I can just work on getting rid of long-running tasks.
This is probably too broad a question to be really answered in this format. All of it "depends" and is subject to change as the OS and hardware change. However, you should start by watching the various energy-related videos from WWDC (there are a lot of them).
In principle, the difference between 0% CPU usage and 1% CPU usage is enormous. However, the difference between 1% and 100% is also enormous. The system absolutely does turn on and off different parts of the CPU (including a whole core) depending on how busy it is, and if it can get down to zero for even a few tens of milliseconds periodically, it can dramatically improve battery life (this is why recent versions of iOS allow you to specify timer tolerances, so it can batch work together and then get back to low-power mode).
All that said, you shouldn't expect to be at zero CPU while the user is actually interacting with your app. Obviously responding to the user is going to take work. You should explore your energy usage with Instruments to look for places that are excessive, but I would expect the CPU to be "doing things" while the user is scrolling (how could it not?) You shouldn't try to artificially inject pauses or anything to let the CPU sleep. As a general rule of thumb, it is best to use the CPU fully if it helps you get done faster, and then do nothing. By that I mean, it is much better to parse all of a massive download (as an example) as fast as you can and use 100% CPU, than to spread it out and use 20% CPU for 5x as long.
The CPU is there to be used. You just don't want to waste it.

Do CPU usage numbers/percentages (e.g. Task Manager) include memory I/O times?

This is something I've always wondered but never looked up.
When an OS reports 100% CPU usage, does that necessarily mean that the bottleneck is calculations performed by the CPU, or does that include stall times, loading data from L1, L2, L3 and RAM?
If it does include stall times, is there a tool that allows to break the figure down into its components?
The CPU usage reported by the OS includes time stalled waiting for memory accesses (as well as stalls from data dependencies on high latency computational operations like division).
I suspect that one could use performance counters to get a better handle on what is taking the time, but I am not familiar with the details of using performance monitoring counters.

cooperative memory usage across threads?

I have an application that has multiple threads processing work from a todo queue. I have no influence over what gets into the queue and in what order (it is fed externally by the user). A single work item from the queue may take anywhere between a couple of seconds to several hours of runtime and should not be interrupted while processing. Also, a single work item may consume between a couple of megabytes to around 2GBs of memory. The memory consumption is my problem. I'm running as a 64bit process on a 8GB machine with 8 parallel threads. If each of them hits a worst case work item at the same time I run out of memory. I'm wondering about the best way to work around this.
plan conservatively and run 4 threads only. The worst case shouldn't be a problem anymore, but we waste a lot of parallelism, making the average case a lot slower.
make each thread check available memory (or rather total allocated memory by all threads) before starting with a new item. Only start when more than 2GB memory are left. Recheck periodically, hoping that other threads will finish their memory hogs and we may start eventually.
try to predict how much memory items from the queue will need (hard) and plan accordingly. We could reorder the queue (overriding user choice) or simply adjust the number of running worker threads.
more ideas?
I'm currently tending towards number 2 because it seems simple to implement and solve most cases. However, I'm still wondering what standard ways of handling situations like this exist? The operating system must do something very similar on a process level after all...
regards,
Sören
So your current worst-case memory usage is 16GB. With only 8GB of RAM, you'd be lucky to have 6 or 7GB left after the OS and system processes take their share. So on average you're already going to be thrashing memory on a moderately loaded system. How many cores does the machine have? Do you have 8 worker threads because it is an 8-core machine?
Basically you can either reduce memory consumption, or increase available memory. Your option 1, running only 4 threads, under-utilitises the CPU resources, which could halve your throughput - definitely sub-optimal.
Option 2 is possible, but risky. Memory management is very complex, and querying for available memory is no guarantee that you will be able to go ahead and allocate that amount (without causing paging). A burst of disk I/O could cause the system to increase the cache size, a background process could start up and swap in its working set, and any number of other factors. For these reasons, the smaller the available memory, the less you can rely on it. Also, over time memory fragmentation can cause problems too.
Option 3 is interesting, but could easily lead to under-loading the CPU. If you have a run of jobs that have high memory requirements, you could end up running only a few threads, and be in the same situation as option 1, where you are under-loading the cores.
So taking the "reduce consumption" strategy, do you actually need to have the entire data set in memory at once? Depending on the algorithm and the data access pattern (eg. random versus sequential) you could progressively load the data. More esoteric approaches might involve compression, depending on your data and the algorithm (but really, it's probably a waste of effort).
Then there's "increase available memory". In terms of price/performance, you should seriously consider simply purchasing more RAM. Sometimes, investing in more hardware is cheaper than the development time to achieve the same end result. For example, you could put in 32GB of RAM for a few hundred dollars, and this would immediately improve performance without adding any complexity to the solution. With the performance pressure off, you could profile the application to see just where you can make the software more efficient.
I have continued the discussion on Herb Sutter's blog and provoced some very helpful reader comments. Head over to Sutter's Mill if you are interested.
Thanks for all the suggestions so far!
Sören
Difficult to propose solutions without knowing exactly what you're doing, but how about considering:
See if your processing algorithm can access the data in smaller sections without loading the whole work item into memory.
Consider developing a service-based solution so that the work is carried out by another process (possibly a web service). This way you could scale the solution to run over multiple servers, perhaps using a load balancer to distribute the work.
Are you persisting the incoming work items to disk before processing them? If not, they probably should be anyway, particularly if it may be some time before the processor gets to them.
Is the memory usage proportional to the size of the incoming work item, or otherwise easy to calculate? Knowing this would help to decide how to schedule processing.
Hope that helps?!

Resources