Dask - custom aggregation - dask

I've just learned about Dask yesterday and now am upgrading my work from pandas... but am stuck trying to translate simple custom aggregations.
I'm not fully (or probably even at all) understanding how the Series are represented inside those internal lambda functions; normally I would step in with breakpoint() to inspect them, but this isn't an option here. When I try to get an element of x with index, I get an "Meta" error.
Any help/pointers would be appreciated.
import dask.dataframe as dd
import pandas as pd
#toy df
df = dd.from_pandas(pd.DataFrame(dict(a = [1, 1, 2, 2], \
b = [100, 100, 200, 250])), npartitions=2)
df.compute()
a b
0 1 100
1 1 100
2 2 200
3 2 250
# PART 1 - for conceptual understanding - replicating trivial list
# intended result
df.groupby('a').agg(list).compute()
b
a
1 [100, 100]
2 [200, 250]
# replicate manually with custom aggregation
list_manual = dd.Aggregation('list_manual', lambda x: list(x), \
lambda x1: list(x1))
res = df.groupby('a').agg(list_manual).compute()
res
b
0 (0, [(1, 0 100\n1 100\nName: b, dtype: i...
res.b[0]
(0,
0 (1, [100, 100])
0 (2, [200, 250])
Name: list_manual-b-6785578d38a71d6dbe0d1ac6515538f7, dtype: object)
# looks like the grouping tuple wasn't even unpacked (1, 2 groups are there)
# ... instead it all got collapsed into one thing
# PART 2 - custom function
# with pandas - intended behavior
collect_uniq = lambda x: list(set(x))
dfp = df.compute()
dfp.groupby('a').agg(collect_uniq)
b
a
1 [100]
2 [200, 250]
#now trying the same in Dask
collect_uniq_dask = dd.Aggregation('collect_uniq_dask', \
lambda x: list(set(x)), lambda x1: list(x1))
res = df.groupby('a').agg(collect_uniq_dask).compute()
# gives TypeError("unhashable type: 'Series'")

Related

How to take a derivative of one of the outputs of a neural network (involving batched inputs) with respect to inputs?

I am solving a PDE using a neural network. My neural network is as follows:
def f(params, inputs):
for w, b in params:
outputs = jnp.dot(inputs, w) + b
inputs = jnn.swish(outputs)
return outputs
The layer architecture of the network is as follows - [1,5,2]. Hence, i have one input neuron and two output neurons. Therefore, if I pass 10 batches of input, I am supposed to get a (10,2) array as output. Now let the output neurons be termed as 'p' and 'q' respectively. How do I find dp/dx, dq/dx? I don't want to pick values from jacobians and hessians, and want to have a more explicit functionality. What I mean is, I want something like this below:
p = lambda inputs: f(params, inputs)[:,0].reshape(-1,1)
q = lambda inputs: f(params, inputs)[:,1].reshape(-1,1)
p_x = lambda inputs: vmap(jacfwd(p,argnums=0))(inputs)
q_x = lambda inputs: vmap(jacfwd(q,argnums=0))(inputs)
k_p_x = lambda inputs: kappa(inputs).reshape(-1,1) * p_x(inputs)
##And other calculations proceed..
When I execute p(inputs) it's working as expected (as it should), but as soon as I execute p_x(inputs) I am getting an error: IndexError: Too many indices for array: 2 non-None/Ellipsis indices for dim 1.
How do I get around this?
The reason you are seeing an index error is that your p function expects a two-dimensional input, and when you wrap it in vmap it means you are effectively passing the function a single one-dimensional row at a time.
You can fix this by changing your function so that it accepts a one-dimensional input, and then use vmap as appropriate to compute the batched result.
Here is a complete example with the modified versions of your functions:
import jax
import jax.numpy as jnp
from jax import nn as jnn
from jax import vmap, jacfwd
def f(params, inputs):
for w, b in params:
outputs = jnp.dot(inputs, w)
inputs = jnn.swish(outputs)
return outputs
# Some example inputs and parameters
inputs_x = jnp.ones((10, 1))
params = [
(jnp.ones((1, 5)), 1),
(jnp.ones((5, 2)), 1)
]
inputs = jnp.arange(10.0).reshape(10, 1)
# p and q map a length-1 input to a length-1 output
p = lambda inputs: f(params, inputs)[0].reshape(1)
q = lambda inputs: f(params, inputs)[1].reshape(1)
p_batched = vmap(p)
q_batched = vmap(q)
p_x = lambda inputs: vmap(jacfwd(p,argnums=0))(inputs)
q_x = lambda inputs: vmap(jacfwd(q,argnums=0))(inputs)
print(p_batched(inputs).shape)
# (10, 1)
print(q_batched(inputs).shape)
# (10, 1)
# Note: since p and q map a size 1 input to size-1 output,
# p_x and q_x compute a sequence of 10 1x1 jacobians.
print(p_x(inputs).shape)
# (10, 1, 1)
print(q_x(inputs).shape)
# (10, 1, 1)

How to add an additional label to a huggingface model?

I'm following the multiple choice QA tutorial and trying to modify it slightly to fit my data. My data is exactly the same, except that I have 5 labels instead of 4:
# original data:
from datasets import load_dataset
swag = load_dataset("swag", "regular")
set(swag["train"]['label'])
>>> {0, 1, 2, 3}
# my data:
set(train_dataset["train"]['label'])
>>>
{0, 1, 2, 3, 4}
I'm running the code in the tutorial and getting the error:
nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [9,0,0] Assertion t >= 0 && t < n_classes failed.
I found from here and here that this can be caused when the target values are out of bounds, which can happend when using nn.CrossEntropyLoss which expects a torch.LongTensor with values in the range [0, nb_classes-1].
I will not copy the entire script from the tutorial since it's in the link above, but I found that the error can be replicated by modifying the DataCollatorForMultipleChoice function by adding an extra label as follows:
from random import choices
#dataclass
class DataCollatorForMultipleChoice:
"""
Data collator that will dynamically pad the inputs for multiple choice received.
"""
tokenizer: PreTrainedTokenizerBase
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
def __call__(self, features):
label_name = "label" if "label" in features[0].keys() else "labels"
labels = [feature.pop(label_name) for feature in features]
labels = [random.choice(range(5)) for _ in range(16)] #<<<---ADDING EXTRA LABEL HERE. INSTEAD OF 0-4 THIS IS BETWEEN 0-5
print(len(labels))
print(labels)
batch_size = len(features)
num_choices = len(features[0]["input_ids"])
flattened_features = [
[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
]
flattened_features = sum(flattened_features, [])
batch = self.tokenizer.pad(
flattened_features,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors="pt",
)
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
batch["labels"] = torch.tensor(labels, dtype=torch.int64)
return batch
Then when I run the trainer I get:
16 # batch
[0, 0, 2, 1, 1, 1, 0, 4, 0, 4, 3, 0, 0, 0, 1, 1] # labels
... nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [7,0,0] Assertion `t >= 0 && t < n_classes` failed.
I tried changing the number of labels in the model:
# original:
# model = AutoModelForMultipleChoice.from_pretrained("bert-base-uncased")
# my modification:
model = AutoModelForMultipleChoice.from_pretrained("bert-base-uncased", num_labels=5)
but I got the same error.
The script runs just fine with my data if I modify the added line from above to
labels = [random.choice(range(4)) for _ in range(16)] # note that now it's from 0-4 and not from 0-5

How do I operate on groups returned by Dask's group by?

I have the following table.
value category
0 2 A
1 20 B
2 4 A
3 40 B
I want to add a mean column that contains the mean of the values for each category.
value category mean
0 2 A 3.0
1 20 B 30.0
2 4 A 3.0
3 40 B 30.0
I can do this in pandas like so
p = pd.DataFrame({"value":[2, 20, 4, 40], "category": ["A", "B", "A", "B"]})
groups = []
for _, group in p.groupby("category"):
group.loc[:,"mean"] = group.loc[:,"value"].mean()
groups.append(group)
pd.concat(groups).sort_index()
How do I do the same thing in Dask?
I can't use the pandas functions as-is because you can't enumerate over a groupby object in Dask. This
import dask.dataframe as dd
d = dd.from_pandas(p, chunksize=100)
list(d.groupby("category"))
raises KeyError: 'Column not found: 0'.
I can use an apply function to calculate the mean in Dask.
import dask.dataframe as dd
d = dd.from_pandas(p, chunksize=100)
q = d.groupby(["category"]).apply(lambda group: group["value"].mean(), meta="object")
q.compute()
returns
category
A 3.0
B 30.0
dtype: float64
But I can't figure how how to fold these back into the rows of the original table.
I would use a merge to achieve this operation:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({
'value': [2, 20, 4, 40],
'category': ['A', 'B', 'A', 'B']
})
ddf = dd.from_pandas(df, npartitions=1)
# Lazy-compute mean per category
mean_by_category = (ddf
.groupby('category')
.agg({'value': 'mean'})
.rename(columns={'value': 'mean'})
).persist()
mean_by_category.head()
# Assign 'mean' value to each corresponding category
ddf = ddf.merge(mean_by_category, left_on='category', right_index=True)
ddf.head()
Which should then output:
category value mean
0 A 2 3.0
2 A 4 3.0
1 B 20 30.0
3 B 40 30.0

Accessing ghosted chunks with dask

Using dask, I would like to break up an image array into overlapping tiles, perform a computation (on all the tiles simultaneously), and then stitch the results back into an image.
The following works, but feels clumsy:
from dask import array as da
from dask.array import ghost
import numpy as np
test_data = np.random.random((50, 50))
x = da.from_array(test_data, chunks=(10, 10))
depth = {0: 1, 1: 1}
g = ghost.ghost(x, depth=depth, boundary='reflect')
# Calculate the shape of the array in terms of chunks
chunk_shape = [len(c) for c in g.chunks]
chunk_nr = np.prod(chunk_shape)
# Allocate a list for results (as many entries as there are chunks)
blocks = [None,] * chunk_nr
def pack_block(block, block_id):
"""Store `block` at the correct position in `blocks`,
according to its `block_id`.
E.g., with ``block_id == (0, 3)``, the block will be stored at
``blocks[3]`.
"""
idx = np.ravel_multi_index(block_id, chunk_shape)
blocks[idx] = block
# We don't really need to return anything, but this will do
return block
g.map_blocks(pack_block).compute()
# Do some operation on the blocks; this is an over-simplified example.
# Typically, I want to do an operation that considers *all*
# blocks simultaneously, hence the need to first unpack into a list.
blocks = [b**2 for b in blocks]
def retrieve_block(_, block_id):
"""Fetch the correct block from the results set, `blocks`.
"""
idx = np.ravel_multi_index(block_id, chunk_shape)
return blocks[idx]
result = g.map_blocks(retrieve_block)
# Slice off excess from each computed chunk
result = ghost.trim_internal(result, depth)
result = result.compute()
Is there a cleaner way to achieve the same end result?
The user-facing api for this is map_overlap method
>>> x = np.array([1, 1, 2, 3, 3, 3, 2, 1, 1])
>>> x = da.from_array(x, chunks=5)
>>> def derivative(x):
... return x - np.roll(x, 1)
>>> y = x.map_overlap(derivative, depth=1, boundary=0)
>>> y.compute()
array([ 1, 0, 1, 1, 0, 0, -1, -1, 0])
Two additional notes for your use case
Avoid hashing costs by supplying name=False to from_array. This saves you about 400MB/s assuming you don't have any fancy hashing libraries around.
x = da.from_array(x, name=False)
Be careful of computing inplace. Dask doesn't guarantee correct behavior if user functions mutate data inplace. In this particular case it's probably fine, since we're copying for ghosting anyway, but it's something to be aware of.
Second answer
Given the comment by #stefan-van-der-walt we'll try another solution.
Consider using the .to_delayed() method to get an array of chunks as dask.delayed objects
depth = {0: 1, 1: 1}
g = ghost.ghost(x, depth=depth, boundary='reflect')
blocks = g.todelayed()
This gives you a numpy array of dask.delayed objects, each of which point to a block. You can now perform arbitrary parallel computations on these blocks. If I wanted them all to arrive at the same function then I might call the following:
result = dask.delayed(f)(blocks.tolist())
The function f will then get a list of lists of numpy arrays, each of which corresponds to one block in the dask.array g.

Hashfunction to map combinations of 5 to 7 cards

Referring to the original problem: Optimizing hand-evaluation algorithm for Poker-Monte-Carlo-Simulation
I have a list of 5 to 7 cards and want to store their value in a hashtable, which should be an array of 32-bit-integers and directly accessed by the hashfunctions value as index.
Regarding the large amount of possible combinations in a 52-card-deck, I don't want to waste too much memory.
Numbers:
7-card-combinations: 133784560
6-card-combinations: 20358520
5-card-combinations: 2598960
Total: 156.742.040 possible combinations
Storing 157 million 32-bit-integer values costs about 580MB. So I would like to avoid increasing this number by reserving memory in an array for values that aren't needed.
So the question is: How could a hashfunction look like, that maps each possible, non duplicated combination of cards to a consecutive value between 0 and 156.742.040 or at least comes close to it?
Paul Senzee has a great post on this for 7 cards (deleted link as it is broken and now points to a NSFW site).
His code is basically a bunch of pre-computed tables and then one function to look up the array index for a given 7-card hand (represented as a 64-bit number with the lowest 52 bits signifying cards):
inline unsigned index52c7(unsigned __int64 x)
{
const unsigned short *a = (const unsigned short *)&x;
unsigned A = a[3], B = a[2], C = a[1], D = a[0],
bcA = _bitcount[A], bcB = _bitcount[B], bcC = _bitcount[C], bcD = _bitcount[D],
mulA = _choose48x[7 - bcA], mulB = _choose32x[7 - (bcA + bcB)], mulC = _choose16x[bcD];
return _offsets52c[bcA] + _table4[A] * mulA +
_offsets48c[ (bcA << 4) + bcB] + _table [B] * mulB +
_offsets32c[((bcA + bcB) << 4) + bcC] + _table [C] * mulC +
_table [D];
}
In short, it's a bunch of lookups and bitwise operations powered by pre-computed lookup tables based on perfect hashing.
If you go back and look at this website, you can get the perfect hash code that Senzee used to create the 7-card hash and repeat the process for 5- and 6-card tables (essentially creating a new index52c7.h for each). You might be able to smash all 3 into one table, but I haven't tried that.
All told that should be ~628 MB (4 bytes * 157 M entries). Or, if you want to split it up, you can map it to 16-bit numbers (since I believe most poker hand evaluators only need 7,462 unique hand scores) and then have a separate map from those 7,462 hand scores to whatever hand categories you want. That would be 314 MB.
Here's a different answer based on the colex function concept. It works with bitsets that are sorted in descending order. Here's a Python implementation (both recursive so you can see the logic and iterative). The main concept is that, given a bitset, you can always calculate how many bitsets there are with the same number of set bits but less than (in either the lexicographical or mathematical sense) your given bitset. I got the idea from this paper on hand isomorphisms.
from math import factorial
def n_choose_k(n, k):
return 0 if n < k else factorial(n) // (factorial(k) * factorial(n - k))
def indexset_recursive(bitset, lowest_bit=0):
"""Return number of bitsets with same number of set bits but less than
given bitset.
Args:
bitset (sequence) - Sequence of set bits in descending order.
lowest_bit (int) - Name of the lowest bit. Default = 0.
>>> indexset_recursive([51, 50, 49, 48, 47, 46, 45])
133784559
>>> indexset_recursive([52, 51, 50, 49, 48, 47, 46], lowest_bit=1)
133784559
>>> indexset_recursive([6, 5, 4, 3, 2, 1, 0])
0
>>> indexset_recursive([7, 6, 5, 4, 3, 2, 1], lowest_bit=1)
0
"""
m = len(bitset)
first = bitset[0] - lowest_bit
if m == 1:
return first
else:
t = n_choose_k(first, m)
return t + indexset_recursive(bitset[1:], lowest_bit)
def indexset(bitset, lowest_bit=0):
"""Return number of bitsets with same number of set bits but less than
given bitset.
Args:
bitset (sequence) - Sequence of set bits in descending order.
lowest_bit (int) - Name of the lowest bit. Default = 0.
>>> indexset([51, 50, 49, 48, 47, 46, 45])
133784559
>>> indexset([52, 51, 50, 49, 48, 47, 46], lowest_bit=1)
133784559
>>> indexset([6, 5, 4, 3, 2, 1, 0])
0
>>> indexset([7, 6, 5, 4, 3, 2, 1], lowest_bit=1)
0
"""
m = len(bitset)
g = enumerate(bitset)
return sum(n_choose_k(bit - lowest_bit, m - i) for i, bit in g)

Resources