Dask DataFrame MemoryError when calling to_csv - dask

I'm currently using Dask in the following way...
there are a list of files on S3 in the following format:
<day1>/filetype1.gz
<day1>/filetype2.gz
<day2>/filetype1.gz
<day2>/filetype2.gz
...etc
my code reads all files of filetype1 and builds up a dataframe and sets the index (e.g: df1 = ddf.read_csv(day1/filetype1.gz, blocksize=None, compression='gzip').set_index(index_col))
reads through all files of filetype2 and builds up a big dataframe (similar to above).
merges the two dataframes together via merged_df = ddf.merge(df1, df2, how='inner', left_index=True, right_index=True).
Writes the results out to S3 via: merged_df.to_csv(<s3_output_location>)
Note: The goal here really is to merge within a particular day (that is, merge filetype1 and filetype2 for a given day), repeat for every day, and store the union of all those joins, but it seemed like doing the join one day at a time would not leverage parallelism, and that letting Dask manage a larger join would be more performant. I thought Dask would manage the larger join in a memory-aware way based on the following line from the docs(https://docs.dask.org/en/latest/dataframe-joins.html):
If enough memory can not be found then Dask will have to read and write data to disk, which may cause other performance costs.
I see that a MemoryError happens in the call to to_csv. I'm guessing this is because to_csv calls compute, which tries to compute the full result of the join, then tries to store that result. The full file contents certainly cannot fit in memory, but I thought (hoped) that Dask would run the computations and store the resulting Dataframe in a memory-aware way. Any guidance or suggestions on how I should be using Dask to accomplish what I am trying to do? Thanks in advance.

I see that a MemoryError happens in the call to to_csv. I'm guessing this is because to_csv calls compute, which tries to compute the full result of the join, then tries to store that result. The full file contents certainly cannot fit in memory, but I thought (hoped) that Dask would run the computations and store the resulting Dataframe in a memory-aware way
In general Dask does chunk things up and operate in the way that you expect. Doing distributed joins in a low-memory way is hard, but generally doable. I don't know how to help more here without more information, which I appreciate is hard to deliver concisely on Stack Overflow. My usual recommendation is to watch the dashboard closely.
Note: The goal here really is to merge within a particular day (that is, merge filetype1 and filetype2 for a given day), repeat for every day, and store the union of all those joins, but it seemed like doing the join one day at a time would not leverage parallelism, and that letting Dask manage a larger join would be more performant
In general your intuition is correct that giving more work to Dask at once is good. However in this case it looks like you know something about your data that Dask doesn't know. You know that each file only interacts with one other file. In general joins have to be done in a way where all rows of one dataset may interact with all rows of the other, and so Dask's algorithms have to be pretty general here, which can be expensive.
In your situation I would use Pandas along with Dask delayed to do all of your computation at once.
lazy_results = []
for fn in filenames:
left = dask.delayed(pd.read_csv, fn + "type-1.csv.gz")
right = dask.delayed(pd.read_csv, fn + "type-1.csv.gz")
merged = left.merge(right)
out = merged.to_csv(...)
lazy_results.append(out)
dask.compute(*lazy_results)

Related

How do Dask bag partitions and workers correlate?

I'm using a vanilla Dask-Kubernetes setup with two workers and one scheduler to iterate over the lines of some JSON file (and apply some functions which don't appear here for simplicity). I see only one worker ever working, where I'd expect to see two of them, instead.
Hoping that repartitioning would help I've experimented with different values for bag.repartition(num) which return different numbers of lines, but they don't change anything about the worker imbalance and memory consumption concentrating only on one worker.
I think I don't understand the correlation between partitions and workers, and I could not find anything in the Dask documentation about it. Any help or pointers are highly welcome!
import dask.bag as db
def grep_buildings():
base = "https://usbuildingdata.blob.core.windows.net/usbuildings-v1-1/"
b = db.text.read_text(f"{base}/Alabama.zip")
# b = b.repartition(2)
lines = b.take(3_000_000)
return lines
len(grep_buildings())
In your example, you are opening on file, and it is compressed
db.text.read_text(f"{base}/Alabama.zip")
Dask is able to open and process multiple files in parallel, with at least one partition per file. Dask is also able to split a single file into chunks (the blocksize parameter); but this only works for uncompressed data. The reason is, that for whole-file compression methods, the only way to get to some point in the uncompressed stream is to read from the start, so every partition would read most of the data.
Finally, repartition doesn't help you when you start with a single partition: you need to read that whole file before splitting the data into pieces for downstream tasks.

Why dask's read_sql_table requires a index_col parameter?

I'm trying to use the read_sql_table from dask but I'm facing some issues related to the index_col parameter. My sql table doesn't have any numeric value and I don't know what to give to the index_col parameter.
I read at the documentation that if the index_col is of type "object" I have to provide the "divisions" parameter, but I don't know what are the values in my index_col before reading the table.
I'm really confused. Don't know why I have to give an index_col when using read_sql_table but don't have to when using read_csv.
I've found in certain situations it's easiest to handle this by scattering DataFrame objects out to the cluster by way of pd.read_sql and its chunksize argument:
from dask import bag as db
sql_text = "SELECT ..."
sql_meta = {"column0": "object", "column1": "uint8"}
sql_conn = connect(...)
dfs_futs = map(client.scatter, # Scatter each object to the cluster
pd.read_sql(sql_text,
sql_conn,
chunksize=10_000, # Iterate in chunks of 10,000
columns=list(sql_meta.keys())))
# Now join our chunks (remotely) into a single frame.
df = db.from_sequence(list(dfs_futs)).to_dataframe(meta=sql_meta)
This is nice since you don't need to handle any potential drivers/packages that would be cumbersome to manage on distributed nodes and/or situations where it's difficult to easily partition your data.
Just a note on performance, for my use case we leverage our database's external table operations to spool data out to a CSV and then read that with pd.read_csv (it's pretty much the same deal as above) while a SELECT ... FROM ... WHERE compared to the way Dask parallelizes and chunks up queries, can be acceptable performance-wise since there is a cost to performing the chunking inside the database.
Dask needs a way to be able to read the partitions of your data independently of one-another. This means being able to phrase the queries of each part with a clause like "WHERE index_col >= val0 AND index_col < val1". If you have nothing numerical, dask cab't guess reasonable values for you, you can still do this if you can determine a way to provide reasonable delimiters, like list(string.ascii_letters). You can also provide your own complete WHERE clauses if you must.
Note that OFFSET/LIMIT does not work for this task, because
the result is not in general guaranteed for any given inputs (this behaviour is database implementation specific)
getting to some particular offset is done by paging through the results of a while query, so the server has to do many time the amount of work necessary

Merging a huge list of dataframes using dask delayed

I have a function which returns a dataframe to me. I am trying to use this function in parallel by using dask.
I append the delayed objects of the dataframes into a list. However, the run-time of my code is the same with and without dask.delayed.
I use the reduce function from functools along with pd.merge to merge my dataframes.
Any suggestions on how to improve the run-time?
The visualized graph and code are as below.
from functools import reduce
d = []
for lot in lots:
lot_data = data[data["LOTID"]==lot]
trmat = delayed(LOT)(lot, lot_data).transition_matrix(lot)
d.append(trmat)
df = delayed(reduce)(lambda x, y: x.merge(y, how='outer', on=['from', "to"]), d)
Visualized graph of the operations
General rule: if your data comfortable fits into memory (including the base size times a small number for possible intermediates), then there is a good chance that Pandas is fast and efficient for your use case.
Specifically for your case, there is a good chance that the tasks you are trying to parallelise do not release python's internal lock, the GIL, in which case although you have independent threads, only one can run at a time. The solution would be to use the "distributed" scheduler instead, which can have any mix of multiple threads and processed; however using processes comes at a cost for moving data between client and processes, and you may find that the extra cost dominates any time saving. You would certainly want to ensure that you load the data within the workers rather than passing from the client.
Short story, you should do some experimentation, measure well, and read the data-frame and distributed scheduler documentation carefully.

Dask performances: workflow doubts

I'm confused about how to get the best from dask.
The problem
I have a dataframe which contains several timeseries (every one has its own key) and I need to run a function my_fun on every each of them. One way to solve it with pandas involves
df = list(df.groupby("key")) and then apply my_fun
with multiprocessing. The performances, despite the huge usage of RAM, are pretty good on my machine and terrible on google cloud compute.
On Dask my current workflow is:
import dask.dataframe as dd
from dask.multiprocessing import get
Read data from S3. 14 files -> 14 partitions
`df.groupby("key").apply(my_fun).to_frame.compute(get=get)
As I didn't set the indices df.known_divisions is False
The resulting graph is
and I don't understand if what I see it is a bottleneck or not.
Questions:
Is it better to have df.npartitions as a multiple of ncpu or it doesn't matter?
From this it seems that is better to set the index as key. My guess is that I can do something like
df["key2"] = df["key"]
df = df.set_index("key2")
but, again, I don't know if this is the best way to do it.
For questions like "what is taking time" in Dask, you are generally recommended to use the "distributed" scheduler rather than multiprocessing - you can run with any number of processes/threads you like, but you have much more information available via the diagnostics dashboard.
For your specific questions, if you are grouping over a column that is not nicely split between partitions and applying anything other than the simple aggregations, you will inevitably need a shuffle. Setting the index does this shuffle for you as a explicit step, or you get the implicit shuffle apparent in your task graph. This is a many-to-many operation, each aggregation tasks needs input from every original partition, hence the bottle-neck. There is no getting around that.
As for number of partitions, yes you can have sub-optimal conditions like 9 partitions on 8 cores (you will calculate 8 tasks, and then perhaps block for the final task on one core while the others are idle); but in general you can depend on dask to make reasonable scheduling decisions so long as you are not using a very small number of partitions. In many cases, it will not matter much.

Computing in-place with dask

Short version
I have a dask array whose graph is ultimately based on a bunch of numpy arrays at the bottom, and which applies elementwise operations to them. Is it safe to use da.store to compute the array and store the results back into the original backup numpy arrays, making the whole thing an in-place operation?
If you're thinking "you're using dask wrong" then see the long version below for why I feel the need to do this.
Long version
I'm using dask for an application where the original data is sourced from in-memory numpy arrays that contain data collected from a scientific instrument. The goal is to fill most of the RAM (say 75%+) with the original data, which means that there isn't enough to make an in-memory copy. That makes it semantically a bit like an out-of-core problem, in that any derived value can only be realised in memory in chunks rather than all at once.
Dask is well-suited to this, except for one wrinkle. I'm simplifying a lot, but on most of the data (call it X), we need to apply an element-wise operation f, compute some summary statistics s(f(X)), and use that to compute another result over the data, say t(s(f(X)), f(X)). While all the functions are dask-friendly (can be done on a per-chunk basis), trying to simply run this dask graph would cause f(X) to all be held in memory at once because the chunks are all needed for the second pass. An alternative is to explicitly compute s before asking for t (as suggested by https://github.com/dask/dask/issues/874), and thus pay to compute f(X) twice, but it's a somewhat expensive operation so I'd like to avoid that.
However, once f has been applied, the original data are no longer needed. So I'd like to run da.store(f(X)) and have it store the results in the original backing numpy arrays. Technically I think I know how to set that up, and as long as I can be sure that each piece of data is fully consumed before it is overwritten then there are no race conditions, but I'm worried that I may be breaking an API contract by changing back data underneath dask and that it might go wrong in some way. Is there any way to guarantee that it is safe?
One way I can immediately see this going wrong is if several of the input arrays have the same contents and hence get given the same name in dask, causing them to the unified in the graph. I'm using name=False in da.from_array though, so that shouldn't be an issue.

Resources