Performance and data manipulation on Dask - dask

I have imported a parquet file of approx. 800MB with ~50 millions rows into dask dataframe.
There are 5 columns: DATE, TICKER, COUNTRY, RETURN, GICS
Questions:
How can I specify data type in read_parquet or I have to do it with astype?
Can I parse date within read_parquet
I simply tried to do the follow:
import dask.dataframe as dd
dd.read_parquet('.\abc.gzip')
df['INDUSTRY'] = df.GICS.str[0:4]
n = df.INDUSTRY.unique().compute()
and it takes forever to return. Am I doing anything wrong here? partitions are automatically set to 1.
I am trying to do something like df[df.INDUSTRY == '4010'].compute(), it also takes forever to return or crash.

To answer your questions:
A parquet file has types stored, as noted in the Apache docs here, thus you won't be able to change the data type when you read the file in, meaning you'll have to use astype.
You can't convert a string to date within a the read, though if you use the map_partitions function, documented here you can convert the column to date, as in this example:
import dask.dataframe as dd
df = dd.read_parquet(your_file)
meta = ('date', 'datetime64[ns]')
# you can add your own date format, or just let pandas guess
to_date_time = lambda x: pd.to_datetime(x, format='%Y-%m-%d')
df['date_clean'] = df.date.map_partitions(to_date_time, meta=meta)
The map_partitions function will convert the dates on each chunk of the parquet when the file is computed, making it functionally the same as converting the date when the file is read in.
Here I think again you would benefit from using the map_partitions function, so you might try something like this
import dask.dataframe as dd
df = dd.read_parquet('.\abc.gzip')
df['INDUSTRY']df.GICS.map_partitions(lambda x: x.str[0:4], meta=('INDUSTRY', 'str'))
df[df.INDUSTRY == '4010']
Note that if you run compute the object is converted to pandas. If the file is too large than Dask won't be able to compute it, and thus nothing will be returned. Without seeing the data it's harder to say more, but do checkout these tools to profile your computations to see if you are leveraging all your CPUs.

Related

How can I sort a big text file with Dask?

I have a text file which is way bigger than my memory. I want to sort the lines of that file lexicographically. I know how to do it manually:
Split into chunks which fit into memory
Sort the chunks
Merge the chunks
I wanted to do it with dask. I thought dealing with big amounts of data would be one use case of dask. How can I sort the whole data with Dask?
My Try
You can execute generate_numbers.py -n 550_000_000 which will take about 30 minutes and generate a 20 GB file.
import dask.dataframe as dd
filename = "numbers-large.txt"
print("Create ddf")
ddf = dd.read_csv(filename, sep = ',', header = None).set_index(0)
print("Compute ddf and sort")
df = ddf.compute().sort_values(0)
print("Write")
with open("numbers-large-sorted-dask.txt", "w") as fp:
for number in df.index.to_list():
fp.write(f"{number}\n")
when I execute this, I get
Create ddf
Compute ddf and sort
[2] 2437 killed python dask-sort.py
I guess the process is killed because it consumes too much memory?
Try the following code:
import dask
import dask.dataframe as dd
inpFn = "numbers-large.txt"
outFn = "numbers-large-sorted-dask.txt"
blkSize = 500 # For test on a small file - increase it
print("Create ddf")
ddf = dd.read_csv(inpFn, header = None, blocksize=blkSize)
print("Sort")
ddf_sorted = ddf.set_index(0)
print("Write")
fut = ddf_sorted.to_csv(outFn, compute=False, single_file=True, header=None)
dask.compute(fut)
print("Stop")
Note that I set so low blkSize parameter just for test purpose.
In the target version either increase its value or drop, along with
blocksize=blkSize, to accept the default value.
As set_index provides the sort, there is no need to call sort_values()
and other detail is that dask does not support this method.
As far as writing is concerned, I noticed that you want to generate a
single output file, instead of a sequence of files (one file for each
partition), so I passed single_file=True.
I also added header=None to block writing the column name, in this
case (not very meaningful) 0.
The last detail to mention is compute=False, so that dask
generates a sequence of future objects, without executing them
(computing it) - for now.
All operations so far only constructed the computation tree,
without its execution.
As late as now, compute(...) runs the whole computation tree.
Edit
Your code probably failed due to:
df = ddf.compute().sort_values(0)
Note that you:
first compute(), to generate the whole pandasonic DataFrame,
after that, at the Pandas level, you attempt to sort it.
The problem is probably that the memory in your computer is not
big enough to hold the whole result of compute().
So most likely your code failed just at this moment, without any
chance to sort this DataFrame.

dask.distributed not utilising the cluster

I'm not able to process this block using the distributed cluster.
import pandas as pd
from dask import dataframe as dd
import dask
df = pd.DataFrame({'reid_encod': [[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10],[1,2,3,4,5,6,7,8,9,10]]})
dask_df = dd.from_pandas(df, npartitions=3)
save_val = []
def add(dask_df):
for _, outer_row in dask_df.iterrows():
for _, inner_row in dask_df.iterrows():
for base_encod in outer_row['reid_encod']:
for compare_encod in inner_row['reid_encod']:
val = base_encod + compare_encod
save_val.append(val)
return save_val
from dask.distributed import Client
client = Client(...)
dask_compute = dask.delayed(add)(dask_df)
dask_compute.compute()
Also I have few queries
Does dask.delayed use the available clusters to do the computation.
Can I paralleize the for loop iteratition of this pandas DF using delayed, and use multiple computers present in the cluster to do computations.
does dask.distributed work on pandas dataframe.
can we use dask.delayed in dask.distributed.
If the above programming approach is wrong, can you guide me whether to choose delayed or dask DF for the above scenario.
For the record, some answers, although I wish to note my earlier general points about this question
Does dask.delayed use the available clusters to do the computation.
If you have created a client to a distributed cluster, dask will use it for computation unless you specify otherwise.
Can I paralleize the for loop iteratition of this pandas DF using delayed, and use multiple computers present in the cluster to do computations.
Yes, you can in general use delayed with pandas dataframes for parallelism if you wish. However, your dataframe only has one row, so it is not obvious in this case how - it depends on what you really want to achieve.
does dask.distributed work on pandas dataframe.
Yes, you can do anything that python can do with distributed, since it is just python processes executing code. Whether it brings you the performance you are after is a separate question
can we use dask.delayed in dask.distributed.
Yes, distributed can execute anything that dask in general can, including delayed functions/objects
If the above programming approach is wrong, can you guide me whether to choose delayed or dask DF for the above scenario.
Not easily, it is not clear to me that this is a dataframe operation at all. It seems more like an array - but, again, I note that your function does not actually return anything useful at all.
In the tutorial: passing pandas dataframes to delayed ; same with dataframe API.
The main problem with your code is sketched in this section of the best practices: don't pass Dask collections to delayed functions. This means, you should use either the delayed API or the dataframe API. While you can convert dataframes<->delayed, simply passing like this is not recommended.
Furthermore,
you only have one row in your dataframe, so you only get one partition and no parallelism whatever. You can only slow things down like this.
this appears to be an everything-to-everything (N^2) operation, so if you had many rows (the normal case for Dask), it would presumably take extremely long, no matter how many cores you used
passing lists in a pandas row is not a great idea, perhaps you wanted to use an array?
the function doesn't return anything useful, so it's not at all clear what you are trying to achieve. Under the description of MVCE, you will see references to "expected outcome" and "what went wrong". To get more help, please be more precise.

Forcing Locality on Dask Dataframe Subsets

I'm trying to distribute a large Dask Dataframe across multiple machines for (later) distributed computations on the dataframe. I'm using dask-distributed for this.
All the dask-distributed examples/docs I see are populating the initial data load from a network resource (hdfs, s3, etc) and does not appear to extend the DAG optimization to the load portion (seems to assume that a network load is a necessary evil and just eats the initial cost.) This is underscored on the answer to another question: Does Dask communicate with HDFS to optimize for data locality?
However, I can see cases where we would want this. For example, if we have a sharded database + dask workers co-located on nodes of this DB, we would want to force records from only the local shard to be populated into the local dask workers. From the documentation/examples, network cris-cross seems like a necessarily assumed cost. Is it possible to force parts of a single dataframe to be obtained from specific workers?
The alternative, which I've tried, is to try and force each worker to run a function (iteratively submitted to each worker) where the function loads only the data local to that machine/shard. This works, and I have a bunch of optimally local dataframes with the same column schema -- however -- now I don't have a single dataframe but n dataframes. Is it possible to merge/fuse dataframes across multiple machines so there is a single dataframe reference, but portions have affinity (within reason, as decided by the task DAG) to specific machines?
You can produce dask "collections" such as a dataframe from futures and delayed objects, which inter-operate nicely with each other.
For each partition, where you know which machine should load it, you can produce a future as follows:
f = c.submit(make_part_function, args, workers={'my.worker.ip'})
where c is the dask client and the address is the machine you'd want to see it happen on. You can also give allow_other_workers=True is this is a preference rather than a requirement.
To make a dataframe, from a list of such futures, you could do
df = dd.from_delayed([dask.delayed(f) for f in futures])
and ideally provide a meta=, giving a description of the expected dataframe. Now, further operations on a given partition will prefer to be scheduled on the same worker which already holds the data.
I am also interested in having the capability to restrict computation to a specific node (and data localized to that node). I have tried to implement the above with a simple script (see below) but looking at the resulting data frame, results the error (from dask/dataframe/utils.py::check_meta()):
ValueError: Metadata mismatch found in `from_delayed`.
Expected partition of type `DataFrame` but got `DataFrame`
Example:
from dask.distributed import Client
import dask.dataframe as dd
import dask
client = Client(address='<scheduler_ip>:8786')
client.restart()
filename_1 = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'
filename_2 = 'http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv'
future_1 = client.submit(dd.read_csv, filename_1, workers='w1')
future_2 = client.submit(dd.read_csv, filename_2, workers='w2')
client.has_what()
# Returns: {'tcp://<w1_ip>:41942': ('read_csv-c08b231bb22718946756cf46b2e0f5a1',),
# 'tcp://<w2_ip>:41942': ('read_csv-e27881faa0f641e3550a8d28f8d0e11d',)}
df = dd.from_delayed([dask.delayed(f) for f in [future_1, future_2]])
type(df)
# Returns: dask.dataframe.core.DataFrame
df.head()
# Returns:
# ValueError: Metadata mismatch found in `from_delayed`.
# Expected partition of type `DataFrame` but got `DataFrame`
Note The dask environment has a two worker nodes (aliased to w1 and w2) a scheduler node and the script is running on an external host.
dask==1.2.2, distributed==1.28.1
It is odd to call many dask dataframe functions in parallel. Perhaps you meant to call many Pandas read_csv calls in parallel instead?
# future_1 = client.submit(dd.read_csv, filename_1, workers='w1')
# future_2 = client.submit(dd.read_csv, filename_2, workers='w2')
future_1 = client.submit(pandas.read_csv, filename_1, workers='w1')
future_2 = client.submit(pandas.read_csv, filename_2, workers='w2')
See https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections for more information

Dask DataFrame .head() very slow after indexing

Not reproducible, but can someone fill in why a .head() call is greatly slowed after indexing?
import dask.dataframe as dd
df = dd.read_parquet("Filepath")
df.head() # takes 10 seconds
df = df.set_index('id')
df.head() # takes 10 minutes +
As stated in the docs, set_index sorts your data according to the new index, such that the divisions along that index split the data into its logical partitions. The sorting is the thing that requires the extra time, but will make operations working on that index much faster once performed. head() on the raw file will fetch from the first data chunk on disc without regard for any ordering.
You are able to set the index without this ordering either with the index= keyword to read_parquet (maybe the data was inherently ordered already?) or with .map_partitions(lambda df: df.set_index(..)), but this raises the obvious question, why would you bother, what are you trying to achieve? If the data were already sorted, then you could also have used set_index(.., sorted=True) and maybe even the divisions keyword, if you happen to have the information - this would not need the sort, and be correspondingly faster.

Batch results of intermediate dask computation

I have a large (10s of GB) CSV file that I want to load into dask, and for each row, perform some computation. I also want to write the results of the manipulated CSV into BigQuery, but it'd be better to batch network requests to BigQuery in groups of say, 10,000 rows each, so I don't incur network overhead per row.
I've been looking at dask delayed and see that you can create an arbitrary computation graph, but I'm not sure if this is the right approach: how do I collect and fire off intermediate computations based on some group size (or perhaps time elapsed). Can someone provide a simple example on that? Say for simplicity we have these functions:
def change_row(r):
# Takes 10ms
r = some_computation(r)
return r
def send_to_bigquery(rows):
# Ideally, in large-ish groups, say 10,000 rows at a time
make_network_request(rows)
# And here's how I'd use it
import dask.dataframe as dd
df = dd.read_csv('my_large_dataset.csv') # 20 GB
# run change_row(r) for each r in df
# run send_to_big_query(rows) for each appropriate size group based on change_row(r)
Thanks!
The easiest thing that you can do is provide a block size parameter to read_csv, which will get you approximately the right number of rows per block. You may need to measure some of your data or experiment to get this right.
The rest of your task will work the same way as any other "do this generic thing to blocks of data-frame": the `map_partitions' method (docs).
def alter_and_send(df):
rows = [change_row(r) for r in df.iterrows()]
send_to_big_query(rows)
return df
df.map_partitions(alter_and_send)
Basically, you are running the function on each piece of the logical dask dataframe, which are real pandas dataframes.
You may actually want map, apply or other dataframe methods in the function.
This is one way to do it - you don't really need the "output" of the map, and you could have used to_delayed() instead.

Resources