What's the pattern for processing number of small files in dask distributed - dask

I'm trying to find good pattern to process large amount of files using Dask. Each file is a distinct piece of dataset and can be processed in embarrassingly parallel manner. One idea is to call client.upload_file on every file and pass filename to each worker, but that would distribute all files to all workers, which isn't great. Another idea is to read file sequentially on client node and send contents to dask worker, but since entire dataset doesn't fit into memory, I'd have to handle partitioned reads myself and it's well, sequential. Is there good way to distribute files across dask cluster and then process these files in workers in parallel?

Related

dask scheduler OOM opening a large number of avro files

I'm trying to run a pipeline via dask on a cluster on gcp. The pipeline loads a lot of avro files from cloud storage (~5300 files with around 300MB each) like this
bag = db.read_avro(
'gcs://mybucket/myfiles-*.avro',
blocksize=5000000
)
It then applies some transformations and saves the data back to cloud storage (as parquet files).
I've tested this pipeline with a fraction of the avro files and it works perfectly, but when I tell it to ingest all the files, the scheduler process sits at 100% CPU for a long time and at some point it runs out of memory (I have tried scaling my master node running the scheduler up to 64GB of RAM but that still does not suffice), while the workers are idling. I assume that the problem is that it has to create an excessive amount of tasks that are all held in RAM before being distributed to the workers.
Is this some sort of antipattern that I'm using when trying to open a very large number of files? If so, is there perhaps a built-in way to better cope with this or would I have to split the avro files manually?
Avro with Dask at scale is not particularly well-trodden territory. There is no theoretical reason it should not work. You could inspect the contents of the graph to see if things are getting serialised there that are large, or if simply a massive number of tasks are being generated. If the former, it may be solvable, and you could raise an issue.
As you say, you may be able to keep the load on the scheduler down by processing sub-batches out of the total set of files at a time and waiting for completion.

Orchestration/notification of processing events

I have the following SCDF use case.
I have a couple hundred files to process and put in the db
A producer will get a single file, reads the first N number of rows and send it to source (rabbit mq) , then reads the next N number of rows and sends it to source again, etc, until done.
A consumer will receive these file chunks (from rabbit mq), do some minor enriching, and write it to the DB (sink)
I will have some number of streams > 1 running (say 4 for example) for some parallel processing of these files
My question is: Does SCDF have a mechanism to know when all consumers are completed (and hence the queue(s) are exhausted) so I can know when to start some other process (could be another stream/task/anything) that needs the db fully populated to begin
Yes sink1 is the only consumer of source1. In a streaming application, there is no concept of “COMPLETED”. By definition, stream processing is logically unbounded and stream apps (sources and sinks) are designed to run forever. Tasks, on the other hand, are short lived, finite processes that exit when they are complete. The application logic defines when the task is complete. Processing a file, or a chunk of a file is the most common use case. A stream can monitor a file system, or remote file source such as sftp, or s3, and launch a task whenever a new file appears. The task processes the file and marks the execution as COMPLETE.
This type of use case is better suited for task/batch. See https://dataflow.spring.io/docs/recipes/batch/sftp-to-jdbc/ which details the recommended architecture. You can use define a composed task to run the ingest and then the next task.

Dask client runs out of memory loading from S3

I have a s3 bucket with a lot of small files, over 100K that add up to about 700GB. When loading the objects from a data bag and then persist the client always runs out of memory, consuming gigs very quickly.
Limiting the scope to a few hundred objects will allow the job to run, but a lot of memory is being used by the client.
Shouldn't only futures be tracked by the client? How much memory do they take?
Martin Durant answer on Gitter:
The client needs to do a glob on the remote file-system, i.e.,
download the full defiinition of all the files, in order to be able to
make each of the bad partitions. You may want to structure the files
into sub-directories, and make separate bags out of each of those
The original client was using a glob *, ** to load objects from S3.
With this knowledge, fetching all of the objects first with boto then using the list of objects, no globs, resulted in very minimal memory use by the client and a significant speed improvement.

Does Google Cloud Dataflow optimize around IO bound processes?

I have a Python beam.DoFn which is uploading a file to the internet. This process uses 100% of one core for ~5 seconds and then proceeds to upload a file for 2-3 minutes (and uses a very small fraction of the cpu during the upload).
Is DataFlow smart enough to optimize around this by spinning up multiple DoFns in separate threads/processes?
Yes Dataflow will run up multiple instances of a DoFn using python multiprocessing.
However, keep in mind that if you use a GroupByKey, then the ParDo will process elements for a particular key serially. Though you still achieve parallelism on the worker since you are processing multiple keys at once. However, if all of your data is on a single "hot key" you may not achieve good parallelism.
Are you using TextIO.Write in a batch pipeline? I believe that the files are prepared locally and then uploaded after your main DoFn is processed. That is the file is not uploaded until the PCollection is complete and will not receive more elements.
I don't think it streams out the files as you are producing elements.

Dask array to HDF5 parallel write fails with multiprocessing scheduler

Dask being a well documented scalable library for parallel processing, using graph based workflows is extremely useful in writing many applications that have inherent parallelism associated with them. However while parallel writing to hdf5 files being concerned it is rather difficult especially while using multiprocessing scheduler. The following code works fine if default multi-threaded scheduler is used,
x = da.arange(25000, chunks = (1000,))
da.to_hdf5('hdfstore.h5', '/store', x)
But if you set multiprocessing scheduler globally:
dask.set_options(get=dask.multiprocessing.get)
and again run the code,
TypeError: can't pickle _thread.lock objects
The multithreded scheduler is ok, but it is too slow while reading from a single large csv file and converting it to hdf5 file. With the multiprocessing scheduler its fast and able to use all CPUs in maximum load, but the hdf write fails with the mentioned error (the hdf5 files support simultaneous write access with h5py mpi driver, i think). If you directly do
x.compute()
everything is fine but it loads the entire data into memory, that is not it is so well with large arrays and files. Does anybody came across such scenarios? Please do share valuable suggestions..
Dask version '0.13.0' on a conda virtual env
I think the problem with this code is that it writes simultaneously different chunks of data to same hdf5 file when you using multiprocessing scheduler.
As far as I know, HDF5 format supports HDF5 SWMR. So when 1 Python process have the right to write 1 chunk, it prevent other process to write simultaneously, by locking mechanism.
If you want simutaneous write, maybe this could help.

Resources