Dask utilising all ram when saving to parquet - dask

I am having issues using dask. It is very slow compared to pandas especially when reading large datasets of up to 40gig. The data set grows to about 100+ columns which are mainly float64 after some additional processing(This is quite slow especially when I call compute like so: output = df[["date", "permno"]].compute(scheduler='threading'))
I think I could live with delay even if frustrating however, when I try to save the data to parquet: df.to_parquet('my data frame', engine="fastparquet") it runs out of memory in a server with about 110gig ram. I notice that the buff/cache memory when I do free -h goes up from about 40megabytes to 40+gig.
I am confused how this is possible given that dask does not load everything into memory. I use 100 partitions for the dataset in dask.

Dask computations are executed lazily. The underlying operations aren't actually executed until the last possible moment. Here's what I can gather from your question / comment:
you read a 40GB dataset
you run grouping / sorting
you join with other datasets
you try to write to Parquet
The computation bottleneck isn't necessarily related to the Parquet writing part. Your bottleneck may be with the grouping, sorting, or joining.
You may need to perform a broadcast join, strategically persist, or repartition, it's hard to say given the information provided.

Related

How to load a huge model on Dask with limited RAM?

I want to load a model (ANNOY model) on Dask. The size of the model is 60GB and Dask RAM is 2GB only. Is there a way to load the model in distributed manner as well?
If by "load" you mean: "store in memory", then obviously there is no way to do this. If you need access to the whole dataset in memory at once, you'll need a machine that can handle this.
However, you very probably meant that you want to do some processing to the data and get a result (prediction, statistical score...) which does fit in memory.
Since I don't know what ANNOY is (array? dataframe? something else?), I can only give you general rules. For dask to work, it needs to be able to split a job into tasks. For data IO, this commonly means that the input is in multiple files, or that the files have some natural internal structure such that they can be loaded chunk-wise. For example, zarr (for arrays) stores each chunk of a logical dataset as a separate file, parquet (for dataframes) chunks up data into pages within columns within groups within files, and even CSV can be loaded chunkwise by looking for newline characters.
I suspect annoy ( https://github.com/spotify/annoy ?) has complex internal storage structure, and you may eed to raise an issue on their repo asking about dask support.

Does xarray.Dataset.to_array() load the array into memory and how efficiently sample mini batches from an xarray?

I am currently trying to load a big multi-dimensional array (>5 GB) into a python script. Since I use the array as training data for a machine learning model, it is important to efficiently load the data in mini batches but avoid loading the whole data set in memory once.
My idea was to use the xarray library.
I load the data set with X=xarray.open_dataset("Test_file.nc"). To the best of my knowledge, this command does not load the data set in memory - so far, so good. However, I want to convert X to an array with the command X=X.to_array().
My first question is: Does X=X.to_array() load it into memory or not?
If that is done, I wonder how to best load minibatches in memory. The shape of the array is (variable,datetime,x1_position,x2_position). I want to load minibatches per datetime, which would lead to:
ind=np.random.randint(low=0,high=n_times,size=(BATCH_SIZE))
mini_batch=X[:,ind]
The other approach would be to transpose the array before with X.transpose("datetime","variable","x1_position","x2_position") and then sample via:
ind=np.random.randint(low=0,high=n_times,size=(BATCH_SIZE))
mini_batch=X[ind,:]
My second question is:
Does transposing an xarray affect the efficiency of indexing? More specifically, does X[ind,:] take as long as X[:,ind]?
My first question is: Does X=X.to_array() load it into memory or not?
xarray makes use of dask to chunk (load) parts of the data into memory. You can compare X through
X = xarray.open_dataset("Test_file.nc")
# or
X = xarray.open_dataset("Test_file.nc",
chunks={'datetime':1, 'x1_position':x1_count, 'x2_position':x2_count})
and see (print(X)) the differences between loaded datasets, or specify the chunks accordingly.
The latter way means chunking (load) only one datetime slice data into memory. I don't think you need X=X.to_array() but you can also compare the results after to_array(). My experience is that to_array() does not change the actual chunking (loading) but just the view of the data.
My second question is: Does transposing an xarray affect the efficiency of indexing? More specifically, does X[ind,:] take as long as X[:,ind]?
I think one goal of xarray is to let users forget the details of the underlying implementation (based on numpy). Transposing may only modify the view rather than the underlying structure of the data. There certainly are some efficiency differences between the two indexing ways, depending on which one is accessing data along contiguous memory. But such difference would not be overhead. Feel free to use both.

Merging a huge list of dataframes using dask delayed

I have a function which returns a dataframe to me. I am trying to use this function in parallel by using dask.
I append the delayed objects of the dataframes into a list. However, the run-time of my code is the same with and without dask.delayed.
I use the reduce function from functools along with pd.merge to merge my dataframes.
Any suggestions on how to improve the run-time?
The visualized graph and code are as below.
from functools import reduce
d = []
for lot in lots:
lot_data = data[data["LOTID"]==lot]
trmat = delayed(LOT)(lot, lot_data).transition_matrix(lot)
d.append(trmat)
df = delayed(reduce)(lambda x, y: x.merge(y, how='outer', on=['from', "to"]), d)
Visualized graph of the operations
General rule: if your data comfortable fits into memory (including the base size times a small number for possible intermediates), then there is a good chance that Pandas is fast and efficient for your use case.
Specifically for your case, there is a good chance that the tasks you are trying to parallelise do not release python's internal lock, the GIL, in which case although you have independent threads, only one can run at a time. The solution would be to use the "distributed" scheduler instead, which can have any mix of multiple threads and processed; however using processes comes at a cost for moving data between client and processes, and you may find that the extra cost dominates any time saving. You would certainly want to ensure that you load the data within the workers rather than passing from the client.
Short story, you should do some experimentation, measure well, and read the data-frame and distributed scheduler documentation carefully.

large data shuffle causes timeouts

I'm trying to read a 100000 data records of about 100kB each simultaneously from 50 disks, shuffling them, and writing it to 50 output disks at disk speed. What's a good way of doing that with Dask?
I've tried creating 50 queues and submitting 50 reader/writer functions using 100 workers (all on different machines, this is using Kubernetes). I ramp up first the writers, then the readers gradually. The scheduler gets stuck at 100% CPU at around 10 readers, and then gets timeouts when any more readers are added. So this approach isn't working.
Most dask operations have something like 1ms of overhead. As a result Dask is not well suited to be placed within innermost loops. Typically it is used at a coarser level, parallelizing across many Python functions, each of which is expected to take 100ms.
In a situation like yours I would push data onto a shared message system like Kafka, and then use Dask to pull off chunks of data when appropriate.
Data transfer
If your problem is in the bandwidth limitation of moving data through dask queues then you might consider turning your data into dask-reference-able futures before placing things into queues. See this section of the Queue docstring: http://dask.pydata.org/en/latest/futures.html#distributed.Queue
Elements of the Queue must be either Futures or msgpack-encodable data (ints, strings, lists, dicts). All data is sent through the scheduler so it is wise not to send large objects. To share large objects scatter the data and share the future instead.
So you probably want something like the following:
def f(queue):
client = get_client()
for fn in local_filenames():
data = read(fn)
future = client.scatter(data)
queue.put(future)
Shuffle
If you're just looking to shuffle data then you could read it with something like dask.bag or dask.dataframe
df = dd.read_parquet(...)
and then sort your data using the set_index method
df.set_index('my-column')

Dask performances: workflow doubts

I'm confused about how to get the best from dask.
The problem
I have a dataframe which contains several timeseries (every one has its own key) and I need to run a function my_fun on every each of them. One way to solve it with pandas involves
df = list(df.groupby("key")) and then apply my_fun
with multiprocessing. The performances, despite the huge usage of RAM, are pretty good on my machine and terrible on google cloud compute.
On Dask my current workflow is:
import dask.dataframe as dd
from dask.multiprocessing import get
Read data from S3. 14 files -> 14 partitions
`df.groupby("key").apply(my_fun).to_frame.compute(get=get)
As I didn't set the indices df.known_divisions is False
The resulting graph is
and I don't understand if what I see it is a bottleneck or not.
Questions:
Is it better to have df.npartitions as a multiple of ncpu or it doesn't matter?
From this it seems that is better to set the index as key. My guess is that I can do something like
df["key2"] = df["key"]
df = df.set_index("key2")
but, again, I don't know if this is the best way to do it.
For questions like "what is taking time" in Dask, you are generally recommended to use the "distributed" scheduler rather than multiprocessing - you can run with any number of processes/threads you like, but you have much more information available via the diagnostics dashboard.
For your specific questions, if you are grouping over a column that is not nicely split between partitions and applying anything other than the simple aggregations, you will inevitably need a shuffle. Setting the index does this shuffle for you as a explicit step, or you get the implicit shuffle apparent in your task graph. This is a many-to-many operation, each aggregation tasks needs input from every original partition, hence the bottle-neck. There is no getting around that.
As for number of partitions, yes you can have sub-optimal conditions like 9 partitions on 8 cores (you will calculate 8 tasks, and then perhaps block for the final task on one core while the others are idle); but in general you can depend on dask to make reasonable scheduling decisions so long as you are not using a very small number of partitions. In many cases, it will not matter much.

Resources