I am in a situation where I need to write a large dataset on disk and have issues avoiding workers to "freeze": the memory keeps increasing up until it reaches the pause fraction (0.5) and then nothing goes on anymore (according to the dashboard).
I tried to work on a minimal example but cannot reproduce such behavior as I am reading a large dataset.
How could I proceed with debugging in such a situation?
Platform is HPC.
distributed is configured according to:
memory:
target: false # target fraction to stay below
spill: false # fraction at which we spill to disk
pause: 0.50 # fraction at which we pause worker threads
terminate: 0.95 # fraction at which we terminate the worker
Here an overview of the graph for the writing of a subset of 2 files:
dask graph
Your workers are running out of memory. Unfortunately I don't know why without knowing a lot more about your program. Some common solutions to this:
Use smaller chunks
Use Dask workers with more memory
Turn off the memory pause feature, and hope that your computation can finish in the provided space
Add more Dask workers
Restructure your computation differently so that Dask can clear out data as it creates it
...
There are many things that could be going on here. Unfortunately it is difficult to tell without knowing much more about what you are trying to do (which is of course hard to do and respect everyone's time).
Related
I am trying to understand microarchitecture.
When an operating system schedules code to run on a CPU hardware thread (as in Intel HyperThreading), can each execution context issue memory reads in parallel or is the pipeline shared?
I am trying to do some rough calculations and complexity analysis and I want to know if memory bandwidth is shared and if I should divide my calculation by the number of cores or hardware threads (assuming the pipeline is shared) or hardware threads (the memory bandwidth is parallel).
Yes, the pipeline is shared, so it's possible for each the two load execution units in a physical core to be running a uop from a different logical core, accessing L1d in parallel. (e.g. https://www.realworldtech.com/haswell-cpu/5/ / https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram)
Off-core (L2 miss) bandwidth doesn't scale with number of logical cores, and one thread per core can fairly easily saturated it, especially with SIMD, if your code has high throughput (not bottlenecking on latency or branch misses), and low computational intensity (ALU work per load of data into registers. Or into L1d or L2 cache, whichever you're cache-blocking for). e.g. like a dot product.
Well-tuned high-throughput (instructions per cycle) code like linear algebra stuff (especially matmul) often doesn't benefit from more than 1 thread per physical core, instead suffering more cache misses when two threads are competing for the same L1d / L2 cache.
Cache-blocking aka loop tiling can help a lot, if you can loop again over a smaller chunk of data while it's still hot in cache. See How much of ‘What Every Programmer Should Know About Memory’ is still valid? (most of it).
When calling another libary to dask such as scikit image contrast stretch, I realise that dask is creating a result for each block, storing in either memory or spilling to disk seperately. Then it attempts to merge all the results. Thats fine if your on a cluster or on a single computer and the dataset for the array is small, everything is fairly controlled. The problems start to happen when you work with data sets that are much larger than your RAM or disk. Is there a way to mitigate this or use the zarr file format to save to updating values as you go along? May be thats too fanciful. Any other ideas bar buy more ram would be helpful.
edit
I was looking at the documentation on dask and the suggestions on chunk sizes for dask, is something like about 100MB. I ended up reducing significantly from this amount to 30-70MB depending on file size. I then ran a contrast stretch (not from a library but with numpy unfunc and I didnt have any issue! In fact i played with the way the compuation is done. Since I start with a uint8 3dim array, when multiplying by the ratio for contrast stretch I am inevitably increasing the array chunk to a float64 array. Which takes up significant memory and computation. So what I have been do is treating the da.array as np.asarray(float64) but only prior to the multiplication by a float number. Then returning to a uint8 to finish the computation. The stretch time has reduced to just under 5 mins for a 20GB file. So I think thats a positive step. Just means image processing without libraries, I will, have a look at rechunker though.
The image processing pipeline i am building is to inevitable be used for a merged dataset of about 250-300GB (definitely outside the limits of my laptop). I also dotn have time to get to grips with cloud or parralell processing in the cloud. Thats for a few months down the line. Right now its trying to get through this analysis.
Yes, you can do the kind of thing you are talking about. I encourage you to check out the rechunker project, which is specialied around changing the layout of the data in zarr storage, but shows the idea of how to save temporary intermediated for the purpose of mitigating memory and communication issues.
My Dask computation is slow. When I look at the status page of the diagnostics dashboard I see that most of the time is spent in disk-read-* and disk-write-* tasks.
What does this mean?
How do I diagnose this issue?
When Dask workers start to run out of memory they write extra data to disk. This is recorded in the status page as a disk-write- task. When that data is needed again it is read from disk and a disk-read- task is shown on the status page. You might confirm this by looking at the upper left plot that shows memory use per worker, or by looking at the solid portion of the progress bars that show the number of tasks of each particular type that are still in memory.
Ways you can address this:
Figure out why Dask needs to keep data in memory. Common causes:
when you persist a lot of data
when Dask has to keep a lot of intermediate results, such as in the case of a full shuffle, or computations that have a high cardinality of results
Get more memory
Get faster disk. Modern disk bandwidth has improved in the last few years. It's possible to get drives on consumer-grade personal laptops with 1-2GB/s bandwidth.
When I use YJP to do cpu-tracing profile on our own product, it is really slow.
The product runs in a 16 core machine with 8GB heap, and I use grinder to run a small load test (e.g. 10 grinder threads) which have about 7~10 steps during the profiling. I have a script to start the product with profiler, start profiling (using controller api) and then start grinder to emulate user operations. When all the operations finish, the script tells the profiler to stop profiling and save snapshot.
During the profiling, for each step in the grinder test, it takes more than 1 million ms to finish. The whole profiling often takes more than 10 hours with just 10 grinder threads, and each runs the test 10 times. Without profiler, it finishes within 500 ms.
So... besides the problems with the product to be profiled, is there anything else that affects the performance of the cpu tracing process itself?
Last I used YourKit (v7.5.11, which is pretty old, current version is 12) it had two CPU profiling settings: sampling and tracing, the latter being much faster and less accurate. Since tracing is supposed to be more accurate I used it myself and also observed huge slowdown, in spite of the statement that the slowdown were "average". Yet it was far less than your results: from 2 seconds to 10 minutes. My code is a fragment of a calculation engine, virtually no IO, no waits on whatever, just reading a input, calculating and output the result into the console - so the whole slowdown comes from the profiler, no external influences.
Back to your question: the option mentioned - samping vs tracing, will affect the performance, so you may try sampling.
Now that I think of it: YourKit can be setup such that it does things automatically, like making snapshots periodically or on low memory, profiling memory usage, object allocations, each of this measures will make profiling slowlier. Perhaps you should make an online session instead of script controlled, to see what it really does.
According to some Yourkit Doc:
Although tracing provides more information, it has its drawbacks.
First, it may noticeably slow down the profiled application, because
the profiler executes special code on each enter to and exit from the
methods being profiled. The greater the number of method invocations
in the profiled application, the lower its speed when tracing is
turned on.
The second drawback is that, since this mode affects the execution
speed of the profiled application, the CPU times recorded in this mode
may be less adequate than times recorded with sampling. Please use
this mode only if you really need method invocation counts.
Also:
When sampling is used, the profiler periodically queries stacks of
running threads to estimate the slowest parts of the code. No method
invocation counts are available, only CPU time.
Sampling is typically the best option when your goal is to locate and
discover performance bottlenecks. With sampling, the profiler adds
virtually no overhead to the profiled application.
Also, it's a little confusing what the doc means by "CPU time", because it also talks about "wall-clock time".
If you are doing any I/O, waits, sleeps, or any other kind of blocking, it is important to get samples on wall-clock time, not CPU-only time, because it's dangerous to assume that blocked time is either insignificant or unavoidable.
Fortunately, that appears to be the default (though it's still a little unclear):
The default configuration for CPU sampling is to measure wall time for
I/O methods and CPU time for all other methods.
"Use Preconfigured Settings..." allows to choose this and other
presents. (sic)
If your goal is to make the code as fast as possible, don't be concerned with invocation counts and measurement "accuracy"; do find out which lines of code are on the stack a large fraction of the time, and why.
More on all that.
I have an application that has multiple threads processing work from a todo queue. I have no influence over what gets into the queue and in what order (it is fed externally by the user). A single work item from the queue may take anywhere between a couple of seconds to several hours of runtime and should not be interrupted while processing. Also, a single work item may consume between a couple of megabytes to around 2GBs of memory. The memory consumption is my problem. I'm running as a 64bit process on a 8GB machine with 8 parallel threads. If each of them hits a worst case work item at the same time I run out of memory. I'm wondering about the best way to work around this.
plan conservatively and run 4 threads only. The worst case shouldn't be a problem anymore, but we waste a lot of parallelism, making the average case a lot slower.
make each thread check available memory (or rather total allocated memory by all threads) before starting with a new item. Only start when more than 2GB memory are left. Recheck periodically, hoping that other threads will finish their memory hogs and we may start eventually.
try to predict how much memory items from the queue will need (hard) and plan accordingly. We could reorder the queue (overriding user choice) or simply adjust the number of running worker threads.
more ideas?
I'm currently tending towards number 2 because it seems simple to implement and solve most cases. However, I'm still wondering what standard ways of handling situations like this exist? The operating system must do something very similar on a process level after all...
regards,
Sören
So your current worst-case memory usage is 16GB. With only 8GB of RAM, you'd be lucky to have 6 or 7GB left after the OS and system processes take their share. So on average you're already going to be thrashing memory on a moderately loaded system. How many cores does the machine have? Do you have 8 worker threads because it is an 8-core machine?
Basically you can either reduce memory consumption, or increase available memory. Your option 1, running only 4 threads, under-utilitises the CPU resources, which could halve your throughput - definitely sub-optimal.
Option 2 is possible, but risky. Memory management is very complex, and querying for available memory is no guarantee that you will be able to go ahead and allocate that amount (without causing paging). A burst of disk I/O could cause the system to increase the cache size, a background process could start up and swap in its working set, and any number of other factors. For these reasons, the smaller the available memory, the less you can rely on it. Also, over time memory fragmentation can cause problems too.
Option 3 is interesting, but could easily lead to under-loading the CPU. If you have a run of jobs that have high memory requirements, you could end up running only a few threads, and be in the same situation as option 1, where you are under-loading the cores.
So taking the "reduce consumption" strategy, do you actually need to have the entire data set in memory at once? Depending on the algorithm and the data access pattern (eg. random versus sequential) you could progressively load the data. More esoteric approaches might involve compression, depending on your data and the algorithm (but really, it's probably a waste of effort).
Then there's "increase available memory". In terms of price/performance, you should seriously consider simply purchasing more RAM. Sometimes, investing in more hardware is cheaper than the development time to achieve the same end result. For example, you could put in 32GB of RAM for a few hundred dollars, and this would immediately improve performance without adding any complexity to the solution. With the performance pressure off, you could profile the application to see just where you can make the software more efficient.
I have continued the discussion on Herb Sutter's blog and provoced some very helpful reader comments. Head over to Sutter's Mill if you are interested.
Thanks for all the suggestions so far!
Sören
Difficult to propose solutions without knowing exactly what you're doing, but how about considering:
See if your processing algorithm can access the data in smaller sections without loading the whole work item into memory.
Consider developing a service-based solution so that the work is carried out by another process (possibly a web service). This way you could scale the solution to run over multiple servers, perhaps using a load balancer to distribute the work.
Are you persisting the incoming work items to disk before processing them? If not, they probably should be anyway, particularly if it may be some time before the processor gets to them.
Is the memory usage proportional to the size of the incoming work item, or otherwise easy to calculate? Knowing this would help to decide how to schedule processing.
Hope that helps?!