Safe & performant way to modify dask dataframe - dask

As a part of data workflow I need to modify values in a subset of dask dataframe columns and pass the results for further computation. In particular, I'm interested in 2 cases: mapping columns and mapping partitions. What is the recommended safe & performant way to act on the data? I'm running it a distributed setup on a cluster with multiple worker processes on each host.
Case1.
I want to run:
res = dataframe.column.map(func, ...)
this returns a data series so I assume that original dataframe is not modified. Is it safe to assign a column back to the dataframe e.g. dataframe['column']=res? Probably not. Should I make a copy with .copy() and then assign result to it like:
dataframe2 = dataframe.copy()
dataframe2['column'] = dataframe.column.map(func, ...)
Any other recommended way to do it?
Case2
I need to map partitions of the dataframe:
df.map_partitions(mapping_func, meta=df)
Inside the mapping_func() I want to modify values in chosen columns, either by using partition[column].map or simply by creating a list comprehension. Again, how do modify the partition safely and return it from the mapping function?
Partition received by mapping function is a Pandas dataframe (copy of original data?) but while modifying data in-place I'm seeing some crashes (no exception/error messages though). Same goes for calling partition.copy(deep=False), it doesn't work. Should partition be deep copied and then modified in-place? Or should I always construct a new dataframe out of new/mapped column data and original/unmodified series/columns?

You can safely modify a dask.dataframe
Operations like the following are supported and safe
df['col'] = df['col'].map(func)
This modifies the task graph in place but does not modify the data in place (assuming that the function func creates a new series).
You can not safely modify a partition
Your second case when you map_partitions a function that modifies a pandas dataframe in place is not safe. Dask expects to be able to reuse data, call functions twice if necessary, etc.. If you have such a function then you should create a copy of the Pandas dataframe first within that function.

Related

Get individual dask dataframe partition status

I like to run an asynchronous dask dataframe computation with dd.persist() and then been able to track an individual partition status. The goal is to get access to partial results in a non-blocking way.
Here the desired pseudo code:
dd = dd.persist()
if dd.partitions[0].__dask_status__ == 'finished':
# Partial non-blocking result access
df = dd.partitions[0].compute()
Using dask futures works well, but submitting many individual partitions is very slow compared to a single dd.persist() and having one future per partition breaks the dashboard "groups" tab by showing too many blocks.
futures = list(map(client.compute, dd.partitions))
Broken dask dashboard "groups" tab
The function you probably want is distributed.futures_of, which lists the running futures of a collection. You can either examine this list yourself, looking at the status of the futures, or use with distributed.as_completed and a for-loop to process the partitions as they become available. The keys of the futures are like (collection-name, partition-index), so you know which partition each belongs to.
The reason dd.partitions[i] (or looping over these with list) doesn't work, is that this creates a new graph for each partition, and so you end up submitting much more to the scheduler than the single call to .persist().

Copy variable between SPSS Datasets using Syntax

EDITED based on feedback...
Is there a way to copy a variable form one open dataset to another in SPSS? What I have tried is to create a scratch variable that captures the value of the variable, and use that scratch variable in a Compute command into the next dataset:
DATASET ACTIVATE DataSet1.
COMPUTE #IDSratch = ID.
Dataset Activate DataSet2.
Compute ID = #IDScratch.
This fails because activating Dataset2 causes the scratch variable to be dropped from memory.
Match Files and/or STAR JOIN syntax will work for most scenarios but in my case because Dataset1 has many more records than Dataset2 AND there are no matching keys in both datasets, this yields extra records.
My original question was "Is there a simple, direct way of copying a variable between datasets?" and the answer still appears to be that merging the files via syntax is the best/only method if using syntax.
Since SPSS version 21.0, the STAR JOIN command (see documentation here) allows you to use SQL syntax to join datasets. So basically, you could get only the variables you want from each dataset.
Assume your first dataset is called data_1 and has id and var_1a. Your second data is called data_2, has the same id and var_2a; and you just want to pull var_2a to the first dataset. If both datasets are open, you can run:
dataset activate data_1.
STAR JOIN
/SELECT t0.var_1a, t1.var_2a
/FROM * AS t0
/JOIN 'data_2' AS t1
ON t0.id=t1.id
/OUTFILE FILE=*.
The link I provided above has plenty of examples on how to join variables of files that are saved in your computer.

Checking if two Dasks are the same

What is the right way to determine if two Dask objects refer to the same result? Is it as simple as comparing the name attributes of both or are there other checks that need to be run?
In the case of any of the dask collections in the main library (array, bag, delayed, dataframe) yes, equal names should imply equal values.
However the opposite is not always true. We don't use deterministic hashing everywhere. Sometimes we use uuid's instead. For example, random arrays always get random UUIDs for keys, but two random arrays might be equal by chance.
No guarantees are given for collections made outside of the Dask library. No enforcement is made at the scheduler level.

How to go from DataFrame to multiple Series efficiently in Dask?

I'm trying to find an efficient way to transform a DataFrame into a bunch of persisted Series (columns) in Dask.
Consider a scenario where the data size is much larger than the sum of worker memory and most operations will be wrapped by read-from-disk / spill-to-disk. For algorithms which operate only on individual columns (or pairs of columns), reading-in the entire DataFrame from disk for every column operation is inefficient. In such a case, it would be nice to locally switch from a (possibly persisted) DataFrame to persisted columns instead. Implemented naively:
persisted_columns = {}
for column in subset_of_columns_to_persist:
persisted_columns[column] = df[column].persist()
This works, but it is very inefficient because df[column] will re-read the entire DataFrame N = len(subset_of_columns_to_persist) times from disk. Is it possible to extract and persist multiple columns individually based on a single read-from-disk deserialization operation?
Note: len(subset_of_columns_to_persist) is >> 1, i.e., simply projecting the DataFrame to df[subset_of_columns_to_persist] is not the solution I'm looking for, because it still has a significant I/O overhead over persisting individual columns.
You can persist many collections at the same time with the dask.persist function. This will share intermediates.
columns = [df[column] for column in df.columns]
persisted_columns = dask.persist(*columns)
d = dict(zip(df.columns, persisted_columns))

How to do lookups (or joins) with map-reduce?

How can I use take the input set
{worker-id:1 name:john supervisor-id:3}
{worker-id:2 name:jane supervisor-id:3}
{worker-id:3 name:bob}
and produce the output set
{worker-id:1 name:john supervisor-name:bob}
{worker-id:2 name:jane supervisor-name:bob}
using a "pure" map-reduce framework, i.e. one with only a map phase and a reduce phase but without any extra feature such as CouchDB's lookup?
Exact details will depend on your map-reduce framework. But the idea is this. In your map phase, you emit two types of key/value pairs. (1, {name:john type:boss}) and (3, {worker-id:1 name:john type:worker}). In your reduce phase you get all of the values for the key grouped together. If there is a record of type boss in there, then you remove that record and populate the supervisor-name of the other records. If there isn't, then you drop those records on the floor.
Basically you use the fact that data gets grouped by key then processed together in the reduce to do the join.
(In some map-reduce implementations you incrementally get key/value pairs put together in the reduce. In those implementations you can't throw away records that don't have a boss already, so you wind up needing to map-reduce-reduce for that final filtering step.)
There is Only one input file or more??
I mean, is it possible a case which we have a file that one of its worker-id have a supervisor-id which its descriptions(name of that supervisor-id) be in another file??

Resources