Dask performances: workflow doubts - dask

I'm confused about how to get the best from dask.
The problem
I have a dataframe which contains several timeseries (every one has its own key) and I need to run a function my_fun on every each of them. One way to solve it with pandas involves
df = list(df.groupby("key")) and then apply my_fun
with multiprocessing. The performances, despite the huge usage of RAM, are pretty good on my machine and terrible on google cloud compute.
On Dask my current workflow is:
import dask.dataframe as dd
from dask.multiprocessing import get
Read data from S3. 14 files -> 14 partitions
`df.groupby("key").apply(my_fun).to_frame.compute(get=get)
As I didn't set the indices df.known_divisions is False
The resulting graph is
and I don't understand if what I see it is a bottleneck or not.
Questions:
Is it better to have df.npartitions as a multiple of ncpu or it doesn't matter?
From this it seems that is better to set the index as key. My guess is that I can do something like
df["key2"] = df["key"]
df = df.set_index("key2")
but, again, I don't know if this is the best way to do it.

For questions like "what is taking time" in Dask, you are generally recommended to use the "distributed" scheduler rather than multiprocessing - you can run with any number of processes/threads you like, but you have much more information available via the diagnostics dashboard.
For your specific questions, if you are grouping over a column that is not nicely split between partitions and applying anything other than the simple aggregations, you will inevitably need a shuffle. Setting the index does this shuffle for you as a explicit step, or you get the implicit shuffle apparent in your task graph. This is a many-to-many operation, each aggregation tasks needs input from every original partition, hence the bottle-neck. There is no getting around that.
As for number of partitions, yes you can have sub-optimal conditions like 9 partitions on 8 cores (you will calculate 8 tasks, and then perhaps block for the final task on one core while the others are idle); but in general you can depend on dask to make reasonable scheduling decisions so long as you are not using a very small number of partitions. In many cases, it will not matter much.

Related

Dask utilising all ram when saving to parquet

I am having issues using dask. It is very slow compared to pandas especially when reading large datasets of up to 40gig. The data set grows to about 100+ columns which are mainly float64 after some additional processing(This is quite slow especially when I call compute like so: output = df[["date", "permno"]].compute(scheduler='threading'))
I think I could live with delay even if frustrating however, when I try to save the data to parquet: df.to_parquet('my data frame', engine="fastparquet") it runs out of memory in a server with about 110gig ram. I notice that the buff/cache memory when I do free -h goes up from about 40megabytes to 40+gig.
I am confused how this is possible given that dask does not load everything into memory. I use 100 partitions for the dataset in dask.
Dask computations are executed lazily. The underlying operations aren't actually executed until the last possible moment. Here's what I can gather from your question / comment:
you read a 40GB dataset
you run grouping / sorting
you join with other datasets
you try to write to Parquet
The computation bottleneck isn't necessarily related to the Parquet writing part. Your bottleneck may be with the grouping, sorting, or joining.
You may need to perform a broadcast join, strategically persist, or repartition, it's hard to say given the information provided.

Dask DataFrame MemoryError when calling to_csv

I'm currently using Dask in the following way...
there are a list of files on S3 in the following format:
<day1>/filetype1.gz
<day1>/filetype2.gz
<day2>/filetype1.gz
<day2>/filetype2.gz
...etc
my code reads all files of filetype1 and builds up a dataframe and sets the index (e.g: df1 = ddf.read_csv(day1/filetype1.gz, blocksize=None, compression='gzip').set_index(index_col))
reads through all files of filetype2 and builds up a big dataframe (similar to above).
merges the two dataframes together via merged_df = ddf.merge(df1, df2, how='inner', left_index=True, right_index=True).
Writes the results out to S3 via: merged_df.to_csv(<s3_output_location>)
Note: The goal here really is to merge within a particular day (that is, merge filetype1 and filetype2 for a given day), repeat for every day, and store the union of all those joins, but it seemed like doing the join one day at a time would not leverage parallelism, and that letting Dask manage a larger join would be more performant. I thought Dask would manage the larger join in a memory-aware way based on the following line from the docs(https://docs.dask.org/en/latest/dataframe-joins.html):
If enough memory can not be found then Dask will have to read and write data to disk, which may cause other performance costs.
I see that a MemoryError happens in the call to to_csv. I'm guessing this is because to_csv calls compute, which tries to compute the full result of the join, then tries to store that result. The full file contents certainly cannot fit in memory, but I thought (hoped) that Dask would run the computations and store the resulting Dataframe in a memory-aware way. Any guidance or suggestions on how I should be using Dask to accomplish what I am trying to do? Thanks in advance.
I see that a MemoryError happens in the call to to_csv. I'm guessing this is because to_csv calls compute, which tries to compute the full result of the join, then tries to store that result. The full file contents certainly cannot fit in memory, but I thought (hoped) that Dask would run the computations and store the resulting Dataframe in a memory-aware way
In general Dask does chunk things up and operate in the way that you expect. Doing distributed joins in a low-memory way is hard, but generally doable. I don't know how to help more here without more information, which I appreciate is hard to deliver concisely on Stack Overflow. My usual recommendation is to watch the dashboard closely.
Note: The goal here really is to merge within a particular day (that is, merge filetype1 and filetype2 for a given day), repeat for every day, and store the union of all those joins, but it seemed like doing the join one day at a time would not leverage parallelism, and that letting Dask manage a larger join would be more performant
In general your intuition is correct that giving more work to Dask at once is good. However in this case it looks like you know something about your data that Dask doesn't know. You know that each file only interacts with one other file. In general joins have to be done in a way where all rows of one dataset may interact with all rows of the other, and so Dask's algorithms have to be pretty general here, which can be expensive.
In your situation I would use Pandas along with Dask delayed to do all of your computation at once.
lazy_results = []
for fn in filenames:
left = dask.delayed(pd.read_csv, fn + "type-1.csv.gz")
right = dask.delayed(pd.read_csv, fn + "type-1.csv.gz")
merged = left.merge(right)
out = merged.to_csv(...)
lazy_results.append(out)
dask.compute(*lazy_results)

Can Dask computational graphs keep intermediate data so re-compute is not necessary?

I am very impressed with Dask and I am trying to determine if it is the right tool for my problem. I am building a project for interactive data exploration where users can interactively change parameters of a figure. Sometimes these changes requires re-computing the entire pipeline to make the graph (e.g. "show data from a different time interval"), but sometimes not. For instance, "change the smoothing parameter" should not require the system to reload the raw unsmoothed data, because the underlying data is the same, only the processing changes. The system should instead use the existing raw data that has already been loaded. I would like my system to be able to keep around the intermediate data objects and intelligently determine what tasks in the graph need to be re-run based on what parameters of the data visualization have been changed. It looks like the caching system in Dask is close to what I need, but was designed with a bit of a different use-case in mind. I see there is a persist method, but I'm not sure if that would work either. Is there an easy way to accomplish this in Dask, or is there another project that would be more appropriate?
"change the smoothing parameter" should not require the system to reload the raw unsmoothed data
Two options:
The builtin functools.lru_cache will cache every unique input. The check on memory is with the maxsize parameter, which controls how many input/output pairs are stored.
Using persist in the right places will compute that object as mentioned at https://distributed.dask.org/en/latest/manage-computation.html#client-persist. It will not require re-running computation to get the object in later computation; functionally, it's the same as lru_cache.
For example, this code will read from disk twice:
>>> import dask.dataframe as dd
>>> df = dd.read_csv(...)
>>> # df = df.persist() # uncommenting this line → only read from disk once
>>> df[df.x > 0].mean().compute()
24.9
>>> df[df.y > 0].mean().compute()
0.1
Uncommented the line will mean this code only reads from disk once because the task graph for the CSV is computed and the value is stored in memory. For your application is sounds like I would use persist intelligently: https://docs.dask.org/en/latest/best-practices.html#persist-when-you-can
What if two smoothing parameters want to be visualized? In that case, I'd avoid calling compute repeatedly: https://docs.dask.org/en/latest/best-practices.html#avoid-calling-compute-repeatedly
lower, upper = client.compute(df.x.min(), df.x.max())
This will share the task graph for min and max so unnecessary computation is not performed.
I would like my system to be able to keep around the intermediate data objects and intelligently determine what tasks in the graph need to be re-run based on what parameters of the data visualization have been changed.
Dask Distributed has a smart caching ability: https://docs.dask.org/en/latest/caching.html#automatic-opportunistic-caching. Part of the documentation says
Another approach is to watch all intermediate computations, and guess which ones might be valuable to keep for the future. Dask has an opportunistic caching mechanism that stores intermediate tasks that show the following characteristics:
Expensive to compute
Cheap to store
Frequently used
I think this is what you're looking for; it'll store values depending on those attributes.

Merging a huge list of dataframes using dask delayed

I have a function which returns a dataframe to me. I am trying to use this function in parallel by using dask.
I append the delayed objects of the dataframes into a list. However, the run-time of my code is the same with and without dask.delayed.
I use the reduce function from functools along with pd.merge to merge my dataframes.
Any suggestions on how to improve the run-time?
The visualized graph and code are as below.
from functools import reduce
d = []
for lot in lots:
lot_data = data[data["LOTID"]==lot]
trmat = delayed(LOT)(lot, lot_data).transition_matrix(lot)
d.append(trmat)
df = delayed(reduce)(lambda x, y: x.merge(y, how='outer', on=['from', "to"]), d)
Visualized graph of the operations
General rule: if your data comfortable fits into memory (including the base size times a small number for possible intermediates), then there is a good chance that Pandas is fast and efficient for your use case.
Specifically for your case, there is a good chance that the tasks you are trying to parallelise do not release python's internal lock, the GIL, in which case although you have independent threads, only one can run at a time. The solution would be to use the "distributed" scheduler instead, which can have any mix of multiple threads and processed; however using processes comes at a cost for moving data between client and processes, and you may find that the extra cost dominates any time saving. You would certainly want to ensure that you load the data within the workers rather than passing from the client.
Short story, you should do some experimentation, measure well, and read the data-frame and distributed scheduler documentation carefully.

Computing in-place with dask

Short version
I have a dask array whose graph is ultimately based on a bunch of numpy arrays at the bottom, and which applies elementwise operations to them. Is it safe to use da.store to compute the array and store the results back into the original backup numpy arrays, making the whole thing an in-place operation?
If you're thinking "you're using dask wrong" then see the long version below for why I feel the need to do this.
Long version
I'm using dask for an application where the original data is sourced from in-memory numpy arrays that contain data collected from a scientific instrument. The goal is to fill most of the RAM (say 75%+) with the original data, which means that there isn't enough to make an in-memory copy. That makes it semantically a bit like an out-of-core problem, in that any derived value can only be realised in memory in chunks rather than all at once.
Dask is well-suited to this, except for one wrinkle. I'm simplifying a lot, but on most of the data (call it X), we need to apply an element-wise operation f, compute some summary statistics s(f(X)), and use that to compute another result over the data, say t(s(f(X)), f(X)). While all the functions are dask-friendly (can be done on a per-chunk basis), trying to simply run this dask graph would cause f(X) to all be held in memory at once because the chunks are all needed for the second pass. An alternative is to explicitly compute s before asking for t (as suggested by https://github.com/dask/dask/issues/874), and thus pay to compute f(X) twice, but it's a somewhat expensive operation so I'd like to avoid that.
However, once f has been applied, the original data are no longer needed. So I'd like to run da.store(f(X)) and have it store the results in the original backing numpy arrays. Technically I think I know how to set that up, and as long as I can be sure that each piece of data is fully consumed before it is overwritten then there are no race conditions, but I'm worried that I may be breaking an API contract by changing back data underneath dask and that it might go wrong in some way. Is there any way to guarantee that it is safe?
One way I can immediately see this going wrong is if several of the input arrays have the same contents and hence get given the same name in dask, causing them to the unified in the graph. I'm using name=False in da.from_array though, so that shouldn't be an issue.

Resources