Computation of two Dask arrays freezing at the end - dask

I have generated two random dask arrays of length 450,000,000 that I want to divide by each other. When I go to compute them, the calculation always freezes at the end.
I have an 8 core 32GB instance running to run the code.
I have tried the code below and some modification that I've tried is not persisting the data in x or y.
x = da.random.random(450000000, chunks=(10000,))
x = client.persist(x)
z1 = dd.from_array(x)
y = da.random.random(450000000, chunks=(10000,))
y = client.persist(y)
z2 = dd.from_array(y)
flux_ratio_sq = z1.div(z2)
flux_ratio_sq.compute()
Actual results I am getting is that the persist holds the x and y in memory (total of 8GB of memory) which is expected and then compute adds more to memory. Some errors I'm getting are below.
A lot of these errors:
distributed.core - INFO - Event loop was unresponsive in Scheduler for
3.74s. This is often caused by long-running GIL-holding functions
or moving large chunks of data. This can cause timeouts and instability.
tornado.application - ERROR - Exception in callback <bound method
BokehTornado._keep_alive of <bokeh.server.tornado.BokehTornado
object at 0x7fb48562a4a8>>
raise StreamClosedError(real_error=self.error)
tornado.iostream.StreamClosedError: Stream is closed
I want the final result to be in a dask Series so I can merge it with my existing data.

I'll try to expand my comment here. Fist: given than numpy performs better than pandas (DataFrame or Series) it is better to do your calculation using numpy and then append the result to a DataFrame if is needed. With Dask it is exactly the same. Second following the documentation you should persist only in case you need to call the same dataframe several times.
So for your specific problem what you could do is
import dask.array as da
N = int(4.5e7)
x = da.random.random(N, chunks=(10000,))
y = da.random.random(N, chunks=(10000,))
flux_ratio_sq = da.divide(x, y).compute()
Addendum: with a dask.dataframe you could use to_parquet() instead of compute() and have your results stored to file. In embarrassingly parallel problems like this one the impact on RAM is less than using compute(). It will be interesting to know if something similar could apply in case of dask.array

Related

How can I sort a big text file with Dask?

I have a text file which is way bigger than my memory. I want to sort the lines of that file lexicographically. I know how to do it manually:
Split into chunks which fit into memory
Sort the chunks
Merge the chunks
I wanted to do it with dask. I thought dealing with big amounts of data would be one use case of dask. How can I sort the whole data with Dask?
My Try
You can execute generate_numbers.py -n 550_000_000 which will take about 30 minutes and generate a 20 GB file.
import dask.dataframe as dd
filename = "numbers-large.txt"
print("Create ddf")
ddf = dd.read_csv(filename, sep = ',', header = None).set_index(0)
print("Compute ddf and sort")
df = ddf.compute().sort_values(0)
print("Write")
with open("numbers-large-sorted-dask.txt", "w") as fp:
for number in df.index.to_list():
fp.write(f"{number}\n")
when I execute this, I get
Create ddf
Compute ddf and sort
[2] 2437 killed python dask-sort.py
I guess the process is killed because it consumes too much memory?
Try the following code:
import dask
import dask.dataframe as dd
inpFn = "numbers-large.txt"
outFn = "numbers-large-sorted-dask.txt"
blkSize = 500 # For test on a small file - increase it
print("Create ddf")
ddf = dd.read_csv(inpFn, header = None, blocksize=blkSize)
print("Sort")
ddf_sorted = ddf.set_index(0)
print("Write")
fut = ddf_sorted.to_csv(outFn, compute=False, single_file=True, header=None)
dask.compute(fut)
print("Stop")
Note that I set so low blkSize parameter just for test purpose.
In the target version either increase its value or drop, along with
blocksize=blkSize, to accept the default value.
As set_index provides the sort, there is no need to call sort_values()
and other detail is that dask does not support this method.
As far as writing is concerned, I noticed that you want to generate a
single output file, instead of a sequence of files (one file for each
partition), so I passed single_file=True.
I also added header=None to block writing the column name, in this
case (not very meaningful) 0.
The last detail to mention is compute=False, so that dask
generates a sequence of future objects, without executing them
(computing it) - for now.
All operations so far only constructed the computation tree,
without its execution.
As late as now, compute(...) runs the whole computation tree.
Edit
Your code probably failed due to:
df = ddf.compute().sort_values(0)
Note that you:
first compute(), to generate the whole pandasonic DataFrame,
after that, at the Pandas level, you attempt to sort it.
The problem is probably that the memory in your computer is not
big enough to hold the whole result of compute().
So most likely your code failed just at this moment, without any
chance to sort this DataFrame.

How do I get xarray.interp() to work in parallel?

I'm using xarray.interp on a large 3D DataArray (weather data: lat, lon, time) to map the values (wind speed) to new values based on a discrete mapping function f.
The interpolation method seems to only utilise one core for computation, making the process horibly inefficient. I can not figure out how to make xarray to use more than one core for this task.
I did monitor the computation via htop and a dask dashboard for xarray.interp.
htop only shows one core to be in use, the dashboard doesn't show any activity in any of the workers. The only dask activity I can observe is from loading the netcdf data file from disk. If I preload the data using .load(), this dask activity is gone.
I also tried using using a scipy.interpolate.interp1d function with xarray.apply_ufunc() to achieve the equivalent result I am aiming for but did not observe any parallel utilisation (htop) or activity (dask dashboard) either.
The fastest approach for me right now is using numpy.interp and then recasting it back to a xr.DataArray with the coordinates of the original DataArray. But that's also not parallelised and only some percent faster.
In the following MWE I don't see any dask activity after the da.load() statement in block 4.
edit:
The code has to be run in the separate blocks 1 - 4 when evaluting using e.g. htop. Because load() is causing multi-core activity and happens either explicitly (block 2) or implicitly (triggered by 4), it's easy to missattribute the multi-core activity to .interp() when its caused by data loading if you run the script as a whole.
# 1: For the dask dashboard
from dask.distributed import Client
client = Client()
display(client)
import xarray as xr
import numpy as np
da = xr.tutorial.open_dataset("air_temperature", chunks={})['air']
# 2: Preload data into memory
da.load()
# 3: Dummy interpolation function
xp = np.linspace(0,400,21)
fp = -1*(xp-300)**2
xr_interp_da = xr.DataArray(fp, [('xp', xp)], name='interpolation function')
# 4: I expect this to run in parallel but it does not
f = xr_interp_da.interp({'xp':da})

Batch results of intermediate dask computation

I have a large (10s of GB) CSV file that I want to load into dask, and for each row, perform some computation. I also want to write the results of the manipulated CSV into BigQuery, but it'd be better to batch network requests to BigQuery in groups of say, 10,000 rows each, so I don't incur network overhead per row.
I've been looking at dask delayed and see that you can create an arbitrary computation graph, but I'm not sure if this is the right approach: how do I collect and fire off intermediate computations based on some group size (or perhaps time elapsed). Can someone provide a simple example on that? Say for simplicity we have these functions:
def change_row(r):
# Takes 10ms
r = some_computation(r)
return r
def send_to_bigquery(rows):
# Ideally, in large-ish groups, say 10,000 rows at a time
make_network_request(rows)
# And here's how I'd use it
import dask.dataframe as dd
df = dd.read_csv('my_large_dataset.csv') # 20 GB
# run change_row(r) for each r in df
# run send_to_big_query(rows) for each appropriate size group based on change_row(r)
Thanks!
The easiest thing that you can do is provide a block size parameter to read_csv, which will get you approximately the right number of rows per block. You may need to measure some of your data or experiment to get this right.
The rest of your task will work the same way as any other "do this generic thing to blocks of data-frame": the `map_partitions' method (docs).
def alter_and_send(df):
rows = [change_row(r) for r in df.iterrows()]
send_to_big_query(rows)
return df
df.map_partitions(alter_and_send)
Basically, you are running the function on each piece of the logical dask dataframe, which are real pandas dataframes.
You may actually want map, apply or other dataframe methods in the function.
This is one way to do it - you don't really need the "output" of the map, and you could have used to_delayed() instead.

Parallelization on cluster dask

I'm looking for the best way to parallelize on a cluster the following problem. I have several files
folder/file001.csv
folder/file002.csv
:
folder/file100.csv
They are disjoints with respect to the key I want to use to groupby, that is if a set of keys is in file1.csv any of these keys has an item in any other files.
In one side I can just run
df = dd.read_csv("folder/*")
df.groupby("key").apply(f, meta=meta).compute(scheduler='processes')
But I'm wondering if there is a better/smarter way to do so in a sort of
delayed-groupby way.
Every filexxx.csv fits in memory on a node. Given that every node has n cores it will be ideal use all of them. For every single file I can use this hacky way
import numpy as np
import multiprocessing as mp
cores = mp.cpu_count() #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want
def parallelize(data, func):
data_split = np.array_split(data, partitions)
pool = mp.Pool(cores)
data = pd.concat(pool.map(func, data_split))
pool.close()
pool.join()
return data
data = parallelize(data, f);
And, again, I'm not sure if there is an efficent dask way to do so.
you could use a Client (will run in multi process by default) and read your data with a certain blocksize. you can get the amount of workers (and number of cores per worker) with the ncores method and then calculate the optimal blocksize.
however according to the documantaion blocksize is by default "computed based on available physical memory and the number of cores."
so i think the best way to do it is a simple:
from distributed import Client
# if you run on a single machine just do: client = Client()
client = Client('cluster_scheduler_path')
ddf = dd.read_csv("folder/*")
EDIT: after that use map_partitions and do the gorupby for each partition:
# Note ddf is a dask dataframe and df is a pandas dataframe
new_ddf = ddf.map_partitions(lambda df: df.groupby("key").apply(f), meta=meta)
don't use compute because it will result in a single pandas.dataframe, instead use a dask output method to keep the entire process parallel and larger then ram compatible.

Computation on sample of dask dataframe takes much longer than on all the data

I have a dask dataframe, backed by parquet. It's 131million rows, when I do some basic operations on the whole frame they take a couple of minutes.
df = dd.read_parquet('data_*.pqt')
unique_locations = df.location.unique()
https = unique_locations.str.startswith('https:')
http = unique_locations.str.startswith('http:')
total_locations = len(unique_locations)
n_https = https.sum().compute()
n_http = http.sum().compute()
Time:
CPU times: user 2min 49s, sys: 23.9 s, total: 3min 13s
Wall time: 1min 53s
I naively thought that if I took a sample of the data that I could bring this time down, and did:
df = dd.read_parquet('data_*.pqt')
df = df.sample(frac=0.05)
unique_locations = df.location.unique()
https = unique_locations.str.startswith('https:')
http = unique_locations.str.startswith('http:')
total_locations = len(unique_locations)
n_https = https.sum().compute()
n_http = http.sum().compute()
Time:
Unknown, I stopped it after 45minutes.
I'm guessing that my sample can't be accessed efficiently for all my follow-on computations, but I don't know how to fix it.
I'm interested in the best way to sample data from a dask dataframe and then work with that sample.
I don't have a definitive / simple answer, but I do have a number of things that all together solve my problem.
1) My code is inefficient, picking out the specific columns I need to work on makes everything work. My new code:
import dask.dataframe as dd
from dask.distributed import Client, progress
client = Client() # Took me a little while to get the settings correct
def get_df(*columns):
files = '../cache_new/sample_*.pqt'
df = dd.read_parquet(files, columns=columns, engine='pyarrow')
return df
# All data - Takes 31s
df_all = get_df('location')
unique_locations = df_all.location.unique()
https = unique_locations.str.startswith('https:')
http = unique_locations.str.startswith('http:')
_total_locations = unique_locations.size.persist()
_n_https = https.sum().persist()
_n_http = http.sum().persist()
progress(_total_locations, _n_https, _n_http)
# 1% sample data - Takes 21s
df_sample = get_df('location').sample(frac=0.01)
unique_locations = df_sample.location.unique()
https = unique_locations.str.startswith('https:')
http = unique_locations.str.startswith('http:')
_total_locations = unique_locations.size.persist()
_n_https = https.sum().persist()
_n_http = http.sum().persist()
progress(_total_locations, _n_https, _n_http)
This turns out to not be a big speed up. The time taken for the whole computation is dominated by reading in the data. If the computation was very expensive I imagine I would see more of a speed up.
2) I switched to using the distributed scheduler locally so I could see what was happening. But this was not without problems:
I was experiencing some kind of bug with fastparquet that caused
my processes to die and I needed to use pyarrow (this was not a problem when not using distributed client)
I had to manually set the number of threads and memory_limit
3) I discovered a bug in reading the same data in multiple times in a notebook - https://github.com/dask/dask/issues/3268
4) I am also being hit by a memory leak bug in pandas https://github.com/pandas-dev/pandas/issues/19941#issuecomment-371960712
With (3) and (4) and the fact that in my original code I was inefficiently reading in all the columns, I see a number of reasons why my sample never worked although I never found a definitive answer.
What's happening here is that by adding sample you're stopping an optimization for happening. When you do the following:
df = dd.read_parquet('data_*.pqt')
df.x.sum()
Dask cleverly rearranges this to actually be the following:
df = dd.read_parquet('data_*.pqt', columns=['x'])
df.x.sum()
Dask.dataframe only reads in the one column that you need. This is one of the few optimizations that dask.dataframe provides (it doesn't do much high-level optimization).
However, when you throw a sample in there (or any operation)
df = dd.read_parquet('data_*.pqt', columns=['x'])
df.sample(...).x.sum()
Then you don't get the optimization, and so everything is slow.
So here it's not that sample is slow, it's that the entire dataset from parquet is slow, and that having sample in between the read_parquet and column access steps blocks the optimization from happening.
Always specify columns in read_parquet
To avoid this, you should always specify the columns you need explicitly in dd.read_parquet.
Eventually it would be nice to see some high level framework provide query optimization that is more intelligent than what Dask dataframe has today. If you felt like pushing this forward, you would probably raise an issue on Ibis

Resources