How to improve Dask read_parquet performance while reading 20000 parquet files (very few are corrupted)? - dask

I have 20000 parquet files stored in Azure Blob and partitioned by name. I tested spark on this dataset and it was able to do this spark.read_parquet(folder_path).count(). However, when I call read_parquet in Dask it takes forever and there is no cpu/memory being peaked.
This is the command I tried,
df = dd.read_parquet(f'folder_path', columns=["Name", "PhoneNumber", "CallRecords"], engine="pyarrow", ignore_metadata_file=True)
I thought read_parquet is lazily evaluated in Dask and I could see reading one parquet file takes less than a second (including count). I also tried passing in the list of parquet files and could notice deteriorating performance with increasing files (from 0 to 20000).
My guess is Dask is trying to find column information by reading each parquet file. May I know how to improve this?
Note: I do have some corrupted files and in Spark I ignore them using Spark configuration. Is it possible to ignore them in Dask?

Related

How do Dask bag partitions and workers correlate?

I'm using a vanilla Dask-Kubernetes setup with two workers and one scheduler to iterate over the lines of some JSON file (and apply some functions which don't appear here for simplicity). I see only one worker ever working, where I'd expect to see two of them, instead.
Hoping that repartitioning would help I've experimented with different values for bag.repartition(num) which return different numbers of lines, but they don't change anything about the worker imbalance and memory consumption concentrating only on one worker.
I think I don't understand the correlation between partitions and workers, and I could not find anything in the Dask documentation about it. Any help or pointers are highly welcome!
import dask.bag as db
def grep_buildings():
base = "https://usbuildingdata.blob.core.windows.net/usbuildings-v1-1/"
b = db.text.read_text(f"{base}/Alabama.zip")
# b = b.repartition(2)
lines = b.take(3_000_000)
return lines
len(grep_buildings())
In your example, you are opening on file, and it is compressed
db.text.read_text(f"{base}/Alabama.zip")
Dask is able to open and process multiple files in parallel, with at least one partition per file. Dask is also able to split a single file into chunks (the blocksize parameter); but this only works for uncompressed data. The reason is, that for whole-file compression methods, the only way to get to some point in the uncompressed stream is to read from the start, so every partition would read most of the data.
Finally, repartition doesn't help you when you start with a single partition: you need to read that whole file before splitting the data into pieces for downstream tasks.

Why does categorizing a Dask DataFrame constructed from a Parquet file drastically increase its size?

Here is the archetypal scenario:
I construct a Dask DataFrame from a set of Parquet files written by FastParquet
I run categorize() on the DataFrame. Quite a few categories become newly "known."
I save the DataFrame to a new Parquet file-set via FastParquet
The new Parquet files now take up several times more disk space than the original set! Now, it's not that I care about disk space—I have enough—but rather I seek understanding:
Even if the original file-set's categories were not "known," they still had to have been in the file-set's disk space somewhere. If anything, I might expect a decrease in disk usage, if the original file-set's categorical columns were not using a dictionary to begin with.
So, yeah, just trying to understand. What gives?

Dask distributed perform computations without returning data

I have a dynamic Dask Kubernetes cluster.
I want to load 35 parquet files (about 1.2GB) from Gcloud storage into Dask Dataframe then process it with apply() and after saving the result to parquet file to Gcloud.
During loading files from Gcloud storage, a cluster memory usage is increasing to about 3-4GB. Then workers (each worker has 2GB of RAM) are terminated/restarted and some tasks getting lost,
so cluster starts computing the same things in a circle.
I removed apply() operation and leave only read_parquet() to test
if my custom code causes a trouble, but the problem was the same, even with just single read_parquet() operation. This is a code:
client = Client('<ip>:8786')
client.restart()
def command():
client = get_client()
df = dd.read_parquet('gcs://<bucket>/files/name_*.parquet', storage_options={'token':'cloud'}, engine='fastparquet')
df = df.compute()
x = client.submit(command)
x.result()
Note: I'm submitting a single command function to run all necessary commands to avoid problems with gcsfs authentication inside a cluster
After some investigation, I understood that problem could be in .compute() which returns all data to a process, but this process (my command function) is running on a worker. Because of that, a worker doesn't have enough RAM, crashes and lose all computed task which triggers tasks re-run.
My goal is:
to read from parquet files
perform some computations with apply()
and without even returning data from a cluster write it back to Gcloud storage in parquet format.
So, simply I want to keep data on a cluster and not return it back. Just compute and save data somewhere else.
After reading Dask distributed docs, I have found client.persist()/compute() and .scatter() methods. They look like what I need, but I don't really understand how to use them.
Could you, please, help me with client.persist() and client.compute() methods for my example
or suggest another way to do it? Thank you very much!
Dask version: 0.19.1
Dask distributed version: 1.23.1
Python version: 3.5.1
df = dd.read_parquet('gcs://<bucket>/files/name_*.parquet', storage_options={'token':'cloud'}, engine='fastparquet')
df = df.compute() # this triggers computations, but brings all of the data to one machine and creates a Pandas dataframe
df = df.persist() # this triggers computations, but keeps all of the data in multiple pandas dataframes spread across multiple machines

How to store large compressed CSV on S3 for use with Dask

I have a large dataset (~1 terabyte of data) spread across several csv files that I want to store (compressed) on S3. I have had issues reading compressed files into dask because they are too large, and so my initial solution was to split each csv into manageable sizes. These files are then read in the following way:
ddf = dd.read_csv('s3://bucket-name/*.xz', encoding = "ISO-8859-1",
compression='xz', blocksize=None, parse_dates=[6])
Before I ingest the full dataset - is this the correct approach, or is there a better way to accomplish what I need?
This seems sensible to me.
The only challenge that arises here is due to compression. If a compression format doesn't support random access then Dask can't break up large files into multiple smaller pieces. This can also be true for formats that do support random access, like xz, but are not configured to for that particular file.
Breaking up the file manually into many small files and using blocksize=None as you've done above is a good solution in this case.

Processing distributed dask collections with external code

I have input data stored as a single large file on S3.
I want Dask to chop the file automatically, distribute to workers and manage the data flow. Hence the idea of using distributed collection, e.g. bag.
On each worker I have a command line tools (Java) that read the data from file(s). Therefore I'd like to write a whole chunk of data into file, call external CLI/code to process the data and then read the results from output file. This looks like processing batches of data instead of record-at-a-time.
What would be the best approach to this problem? Is it possible to write partition to disk on a worker and process it as a whole?
PS. It nor necessary, but desirable, to stay in a distributed collection model because other operations on data might be simpler Python functions that process data record by record.
You probably want the read_bytes function. This breaks the file into many chunks cleanly split by a delimiter (like an endline). It gives you back a list of dask.delayed objects that point to those blocks of bytes.
There is more information on this documentation page: http://dask.pydata.org/en/latest/bytes.html
Here is an example from the docstring:
>>> sample, blocks = read_bytes('s3://bucket/2015-*-*.csv', delimiter=b'\n')

Resources