Dask being a well documented scalable library for parallel processing, using graph based workflows is extremely useful in writing many applications that have inherent parallelism associated with them. However while parallel writing to hdf5 files being concerned it is rather difficult especially while using multiprocessing scheduler. The following code works fine if default multi-threaded scheduler is used,
x = da.arange(25000, chunks = (1000,))
da.to_hdf5('hdfstore.h5', '/store', x)
But if you set multiprocessing scheduler globally:
dask.set_options(get=dask.multiprocessing.get)
and again run the code,
TypeError: can't pickle _thread.lock objects
The multithreded scheduler is ok, but it is too slow while reading from a single large csv file and converting it to hdf5 file. With the multiprocessing scheduler its fast and able to use all CPUs in maximum load, but the hdf write fails with the mentioned error (the hdf5 files support simultaneous write access with h5py mpi driver, i think). If you directly do
x.compute()
everything is fine but it loads the entire data into memory, that is not it is so well with large arrays and files. Does anybody came across such scenarios? Please do share valuable suggestions..
Dask version '0.13.0' on a conda virtual env
I think the problem with this code is that it writes simultaneously different chunks of data to same hdf5 file when you using multiprocessing scheduler.
As far as I know, HDF5 format supports HDF5 SWMR. So when 1 Python process have the right to write 1 chunk, it prevent other process to write simultaneously, by locking mechanism.
If you want simutaneous write, maybe this could help.
Related
I'm trying to find good pattern to process large amount of files using Dask. Each file is a distinct piece of dataset and can be processed in embarrassingly parallel manner. One idea is to call client.upload_file on every file and pass filename to each worker, but that would distribute all files to all workers, which isn't great. Another idea is to read file sequentially on client node and send contents to dask worker, but since entire dataset doesn't fit into memory, I'd have to handle partitioned reads myself and it's well, sequential. Is there good way to distribute files across dask cluster and then process these files in workers in parallel?
I'm trying to run a pipeline via dask on a cluster on gcp. The pipeline loads a lot of avro files from cloud storage (~5300 files with around 300MB each) like this
bag = db.read_avro(
'gcs://mybucket/myfiles-*.avro',
blocksize=5000000
)
It then applies some transformations and saves the data back to cloud storage (as parquet files).
I've tested this pipeline with a fraction of the avro files and it works perfectly, but when I tell it to ingest all the files, the scheduler process sits at 100% CPU for a long time and at some point it runs out of memory (I have tried scaling my master node running the scheduler up to 64GB of RAM but that still does not suffice), while the workers are idling. I assume that the problem is that it has to create an excessive amount of tasks that are all held in RAM before being distributed to the workers.
Is this some sort of antipattern that I'm using when trying to open a very large number of files? If so, is there perhaps a built-in way to better cope with this or would I have to split the avro files manually?
Avro with Dask at scale is not particularly well-trodden territory. There is no theoretical reason it should not work. You could inspect the contents of the graph to see if things are getting serialised there that are large, or if simply a massive number of tasks are being generated. If the former, it may be solvable, and you could raise an issue.
As you say, you may be able to keep the load on the scheduler down by processing sub-batches out of the total set of files at a time and waiting for completion.
I ported my code that uses pandas to dask (basically replaced pd.read_csv with dd.read_csv). The code essentially applies a series of filters/transformations on the (dask) dataframe. Similarly, I use dask bag/array instead of numpy array-like data. The code works great locally (i.e., on my workstation or on a virtual machine). However, when I run the code to a cluster (SGE), my jobs gets killed by the scheduler saying that memory/cpu usage have exceeded the allocated limit. It looks like dask is trying to consume all memory/threads available on the node as opposed to what has been allocated by the scheduler. There seem to be two approaches to fix this issue - (a) set memory/cpu limits for dask from within the code as soon as we load the library (just like we set matplotlib.use("Agg") as soon as we load matplotlib when we need to set the backend) and/or (b) have dask understand the memory limits set by the scheduler. Could you please help on how to go about this issue. My preference would be to set mem/cpu limits for dask from within the code. PS: I understand there is a way to spin up dask workers in a cluster environment and specify these limits, but I am interested in running my code that works great locally on the cluster with minimal additional changes. Thanks!
I have a Python beam.DoFn which is uploading a file to the internet. This process uses 100% of one core for ~5 seconds and then proceeds to upload a file for 2-3 minutes (and uses a very small fraction of the cpu during the upload).
Is DataFlow smart enough to optimize around this by spinning up multiple DoFns in separate threads/processes?
Yes Dataflow will run up multiple instances of a DoFn using python multiprocessing.
However, keep in mind that if you use a GroupByKey, then the ParDo will process elements for a particular key serially. Though you still achieve parallelism on the worker since you are processing multiple keys at once. However, if all of your data is on a single "hot key" you may not achieve good parallelism.
Are you using TextIO.Write in a batch pipeline? I believe that the files are prepared locally and then uploaded after your main DoFn is processed. That is the file is not uploaded until the PCollection is complete and will not receive more elements.
I don't think it streams out the files as you are producing elements.
I have a 200GB RDF file in .nt format. I want to load it in Virtuoso (using Virtuoso Open-Source Edition 6.1.6). I used Virtuoso bulk loader from command line but loading gets hang after couple of hours of running. Do you have any idea how I can load this large file to Virtuoso efficiently? I want to load it fast.
I also tried to query my 200GB RDF graph from Apache Jena. However after running for 30 minutes it gives me some heap size space related error. If you have any solution for the above problem then kindly let me know.
Jena TDB has a bulk loader which has been used on large data input (hundred's of millions of triples).
What is the actual dataset you are loading? Is it actually just one file? We would recommend splitting into files of about 1GB max, and loading multiple files at a time with the bulk loader.
Have you done any performance tuning of the Virtuoso Server for the resources available on the machine in use, as detailed in the RDF Performance Tuning guide?
Please check with the status(''); command how many buffers are in use as, if you run out during a load, you will be swapping to disk continuously, which will lead to the sort of apparent hangs you report.
Note you can also load the Virtuoso LD Meter functions to monitor the progress of the dataset loads.