How do Dask bag partitions and workers correlate? - dask

I'm using a vanilla Dask-Kubernetes setup with two workers and one scheduler to iterate over the lines of some JSON file (and apply some functions which don't appear here for simplicity). I see only one worker ever working, where I'd expect to see two of them, instead.
Hoping that repartitioning would help I've experimented with different values for bag.repartition(num) which return different numbers of lines, but they don't change anything about the worker imbalance and memory consumption concentrating only on one worker.
I think I don't understand the correlation between partitions and workers, and I could not find anything in the Dask documentation about it. Any help or pointers are highly welcome!
import dask.bag as db
def grep_buildings():
base = "https://usbuildingdata.blob.core.windows.net/usbuildings-v1-1/"
b = db.text.read_text(f"{base}/Alabama.zip")
# b = b.repartition(2)
lines = b.take(3_000_000)
return lines
len(grep_buildings())

In your example, you are opening on file, and it is compressed
db.text.read_text(f"{base}/Alabama.zip")
Dask is able to open and process multiple files in parallel, with at least one partition per file. Dask is also able to split a single file into chunks (the blocksize parameter); but this only works for uncompressed data. The reason is, that for whole-file compression methods, the only way to get to some point in the uncompressed stream is to read from the start, so every partition would read most of the data.
Finally, repartition doesn't help you when you start with a single partition: you need to read that whole file before splitting the data into pieces for downstream tasks.

Related

Dask DataFrame MemoryError when calling to_csv

I'm currently using Dask in the following way...
there are a list of files on S3 in the following format:
<day1>/filetype1.gz
<day1>/filetype2.gz
<day2>/filetype1.gz
<day2>/filetype2.gz
...etc
my code reads all files of filetype1 and builds up a dataframe and sets the index (e.g: df1 = ddf.read_csv(day1/filetype1.gz, blocksize=None, compression='gzip').set_index(index_col))
reads through all files of filetype2 and builds up a big dataframe (similar to above).
merges the two dataframes together via merged_df = ddf.merge(df1, df2, how='inner', left_index=True, right_index=True).
Writes the results out to S3 via: merged_df.to_csv(<s3_output_location>)
Note: The goal here really is to merge within a particular day (that is, merge filetype1 and filetype2 for a given day), repeat for every day, and store the union of all those joins, but it seemed like doing the join one day at a time would not leverage parallelism, and that letting Dask manage a larger join would be more performant. I thought Dask would manage the larger join in a memory-aware way based on the following line from the docs(https://docs.dask.org/en/latest/dataframe-joins.html):
If enough memory can not be found then Dask will have to read and write data to disk, which may cause other performance costs.
I see that a MemoryError happens in the call to to_csv. I'm guessing this is because to_csv calls compute, which tries to compute the full result of the join, then tries to store that result. The full file contents certainly cannot fit in memory, but I thought (hoped) that Dask would run the computations and store the resulting Dataframe in a memory-aware way. Any guidance or suggestions on how I should be using Dask to accomplish what I am trying to do? Thanks in advance.
I see that a MemoryError happens in the call to to_csv. I'm guessing this is because to_csv calls compute, which tries to compute the full result of the join, then tries to store that result. The full file contents certainly cannot fit in memory, but I thought (hoped) that Dask would run the computations and store the resulting Dataframe in a memory-aware way
In general Dask does chunk things up and operate in the way that you expect. Doing distributed joins in a low-memory way is hard, but generally doable. I don't know how to help more here without more information, which I appreciate is hard to deliver concisely on Stack Overflow. My usual recommendation is to watch the dashboard closely.
Note: The goal here really is to merge within a particular day (that is, merge filetype1 and filetype2 for a given day), repeat for every day, and store the union of all those joins, but it seemed like doing the join one day at a time would not leverage parallelism, and that letting Dask manage a larger join would be more performant
In general your intuition is correct that giving more work to Dask at once is good. However in this case it looks like you know something about your data that Dask doesn't know. You know that each file only interacts with one other file. In general joins have to be done in a way where all rows of one dataset may interact with all rows of the other, and so Dask's algorithms have to be pretty general here, which can be expensive.
In your situation I would use Pandas along with Dask delayed to do all of your computation at once.
lazy_results = []
for fn in filenames:
left = dask.delayed(pd.read_csv, fn + "type-1.csv.gz")
right = dask.delayed(pd.read_csv, fn + "type-1.csv.gz")
merged = left.merge(right)
out = merged.to_csv(...)
lazy_results.append(out)
dask.compute(*lazy_results)

Merging a huge list of dataframes using dask delayed

I have a function which returns a dataframe to me. I am trying to use this function in parallel by using dask.
I append the delayed objects of the dataframes into a list. However, the run-time of my code is the same with and without dask.delayed.
I use the reduce function from functools along with pd.merge to merge my dataframes.
Any suggestions on how to improve the run-time?
The visualized graph and code are as below.
from functools import reduce
d = []
for lot in lots:
lot_data = data[data["LOTID"]==lot]
trmat = delayed(LOT)(lot, lot_data).transition_matrix(lot)
d.append(trmat)
df = delayed(reduce)(lambda x, y: x.merge(y, how='outer', on=['from', "to"]), d)
Visualized graph of the operations
General rule: if your data comfortable fits into memory (including the base size times a small number for possible intermediates), then there is a good chance that Pandas is fast and efficient for your use case.
Specifically for your case, there is a good chance that the tasks you are trying to parallelise do not release python's internal lock, the GIL, in which case although you have independent threads, only one can run at a time. The solution would be to use the "distributed" scheduler instead, which can have any mix of multiple threads and processed; however using processes comes at a cost for moving data between client and processes, and you may find that the extra cost dominates any time saving. You would certainly want to ensure that you load the data within the workers rather than passing from the client.
Short story, you should do some experimentation, measure well, and read the data-frame and distributed scheduler documentation carefully.

Control the tie-breaking choice for loading chunks in the scheduler?

I have some large files in a local binary format, which contains many 3D (or 4D) arrays as a series of 2D chunks. The order of the chunks in the files is random (could have chunk 17 of variable A, followed by chunk 6 of variable B, etc.). I don't have control over the file generation, I'm just using the results. Fortunately the files contain a table of contents, so I know where all the chunks are without having to read the entire file.
I have a simple interface to lazily load this data into dask, and re-construct the chunks as Array objects. This works fine - I can slice and dice the array, do calculations on them, and when I finally compute() the final result the chunks get loaded from file appropriately.
However, the order that the chunks are loaded is not optimal for these files. If I understand correctly, for tasks where there is no difference of cost (in terms of # of dependencies?), the local threaded scheduler will use the task keynames as a tie-breaker. This seems to cause the chunks to be loaded in their logical order within the Array. Unfortunately my files do not follow the logical order, so this results in many seeks through the data (e.g. seek halfway through the file to get chunk (0,0,0) of variable A, then go back near the beginning to get chunk (0,0,1) of variable A, etc.). What I would like to do is somehow control the order that these chunks get read, so they follow the order in the file.
I found a kludge that works for simple cases, by creating a callback function on the start_state. It scans through the tasks in the 'ready' state, looking for any references to these data chunks, then re-orders those tasks based on the order of the data on disk. Using this kludge, I was able to speed up my processing by a factor of 3. I'm guessing the OS is doing some kind of read-ahead when the file is being read sequentially, and the chunks are small enough that several get picked up in a single disk read. This kludge is sufficient for my current usage, however, it's ugly and brittle. It will probably work against dask's optimization algorithm for complex calculations. Is there a better way in dask to control which tasks win in a tie-breaker, in particular for loading chunks from disk? I.e., is there a way to tell dask, "all things being equal, here's the relative order I'd like you to process this group of chunks?"
Your assessment is correct. As of 2018-06-16 there is not currently any way to add in a final tie breaker. In the distributed scheduler (which works fine on a single machine) you can provide explicit priorities with the priority= keyword, but these take precedence over all other considerations.

large data shuffle causes timeouts

I'm trying to read a 100000 data records of about 100kB each simultaneously from 50 disks, shuffling them, and writing it to 50 output disks at disk speed. What's a good way of doing that with Dask?
I've tried creating 50 queues and submitting 50 reader/writer functions using 100 workers (all on different machines, this is using Kubernetes). I ramp up first the writers, then the readers gradually. The scheduler gets stuck at 100% CPU at around 10 readers, and then gets timeouts when any more readers are added. So this approach isn't working.
Most dask operations have something like 1ms of overhead. As a result Dask is not well suited to be placed within innermost loops. Typically it is used at a coarser level, parallelizing across many Python functions, each of which is expected to take 100ms.
In a situation like yours I would push data onto a shared message system like Kafka, and then use Dask to pull off chunks of data when appropriate.
Data transfer
If your problem is in the bandwidth limitation of moving data through dask queues then you might consider turning your data into dask-reference-able futures before placing things into queues. See this section of the Queue docstring: http://dask.pydata.org/en/latest/futures.html#distributed.Queue
Elements of the Queue must be either Futures or msgpack-encodable data (ints, strings, lists, dicts). All data is sent through the scheduler so it is wise not to send large objects. To share large objects scatter the data and share the future instead.
So you probably want something like the following:
def f(queue):
client = get_client()
for fn in local_filenames():
data = read(fn)
future = client.scatter(data)
queue.put(future)
Shuffle
If you're just looking to shuffle data then you could read it with something like dask.bag or dask.dataframe
df = dd.read_parquet(...)
and then sort your data using the set_index method
df.set_index('my-column')

Processing distributed dask collections with external code

I have input data stored as a single large file on S3.
I want Dask to chop the file automatically, distribute to workers and manage the data flow. Hence the idea of using distributed collection, e.g. bag.
On each worker I have a command line tools (Java) that read the data from file(s). Therefore I'd like to write a whole chunk of data into file, call external CLI/code to process the data and then read the results from output file. This looks like processing batches of data instead of record-at-a-time.
What would be the best approach to this problem? Is it possible to write partition to disk on a worker and process it as a whole?
PS. It nor necessary, but desirable, to stay in a distributed collection model because other operations on data might be simpler Python functions that process data record by record.
You probably want the read_bytes function. This breaks the file into many chunks cleanly split by a delimiter (like an endline). It gives you back a list of dask.delayed objects that point to those blocks of bytes.
There is more information on this documentation page: http://dask.pydata.org/en/latest/bytes.html
Here is an example from the docstring:
>>> sample, blocks = read_bytes('s3://bucket/2015-*-*.csv', delimiter=b'\n')

Resources