scikitlearn remove less frequent categorical classes - machine-learning

I'm doing a classification task where the number of distinct classes are 1500. From these, I would like to remove those classes (and corresponding record) whose frequency is less than 10.
I can write a function something like this:
code_freq_hash = {}
for code in y:
code_freq_hash.setdefault(code, 0)
code_freq_hash[code] += 1
to get the frequency for each class and then remove corresponding records.
However, I'm wondering whether there is an inbuilt function to do this in scikit learn or keras

Here's a sample solution using numpy and pandas.
Creating a data set with two features and a one class column
data = np.hstack((np.array(np.random.randn(20,2)), np.random.choice(np.arange(20), (20,1))))
Numpy
val, count = np.unique(data[:,-1], return_counts=True)
val[count>2]
out = data[np.isin(data[:, -1], val[np.isin(val, val[count>2])])] # replace 2 with 10 for your problem
Pandas
Converting the dataset (numpy array) to a pandas dataframe
df = pd.DataFrame(data)
# renamming the last column to the name "class"
df.rename(columns={ df.columns[-1]: "class" }, inplace=True)
0 1 class
0 0.542154 -0.434981 3.0
1 1.513857 -0.606722 17.0
2 0.372834 -0.120914 0.0
3 -1.357369 1.575805 5.0
4 0.547217 0.719883 4.0
5 0.818016 -0.243919 9.0
6 -0.400552 0.066519 19.0
7 0.463596 1.020041 6.0
8 0.850465 -0.814260 14.0
9 1.693060 0.186741 17.0
10 -0.287775 -0.190247 3.0
11 -0.390932 -0.418964 6.0
12 0.209542 0.797151 5.0
13 0.126585 -0.345196 5.0
14 -0.151729 -1.260708 4.0
15 -1.042408 1.050194 6.0
16 -0.221668 1.763742 5.0
17 -0.045617 1.159383 5.0
18 1.452508 -0.785115 5.0
19 2.125601 1.745009 2.0
Counting the occurrences and filter only the classes that occur more than twice (set 2 to 10 in your case)
d = df.loc[df['class'].isin(df['class'].value_counts().index[df['class'].value_counts() > 2])]
You can obtain the numpy array as d.values
array([[-1.35736852, 1.57580524, 5. ],
[ 0.46359614, 1.02004142, 6. ],
[-0.39093188, -0.41896435, 6. ],
[ 0.20954221, 0.79715056, 5. ],
[ 0.12658469, -0.34519613, 5. ],
[-1.04240815, 1.05019427, 6. ],
[-0.2216682 , 1.76374209, 5. ],
[-0.0456175 , 1.15938322, 5. ],
[ 1.45250806, -0.78511526, 5. ]])

There is no direct solution for this in Sklearn but as you mentioned it could be achieved by custom functions.
import pandas as pd
import numpy as np
df = pd.DataFrame({'labels': np.random.randint(0,10,size=50000),
'input': np.random.choice(['sample text 1','sample text 1'],size=50000)})
threshold = 5000
labels_df=df.labels.value_counts()
filtered_labels = labels_df[labels_df>threshold].index
new_df = df.loc[df['labels'].isin(filtered_labels),:]
new_df.shape
#(25290, 2)

One solution could be the following code snippet:
import numpy as np
unique, appearances = np.unique(a, return_counts=True)
code_freq_hash = [(unique[i], appearances[i]) for i in range(len(unique)) if appearances[i] >= 10]
Even more elegant, as stated below, relevant_labels = unique[appearances >= 10]

Related

Why isn't RandomCrop inserting the padding in pytorch?

I am getting that RandomCrop isn't putting the padding when I create my images. Why is it?
Reproducible script 1
todo with cifar...
Reproducible script 2:
code:
def check_size_of_mini_imagenet_original_img():
import random
import numpy as np
import torch
import os
seed = 0
os.environ["PYTHONHASHSEED"] = str(seed)
torch.manual_seed(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(seed)
random.seed(seed)
import learn2learn
batch_size = 5
kwargs: dict = dict(name='mini-imagenet', train_ways=2, train_samples=2, test_ways=2, test_samples=2)
kwargs['data_augmentation'] = 'lee2019'
benchmark: learn2learn.BenchmarkTasksets = learn2learn.vision.benchmarks.get_tasksets(**kwargs)
tasksets = [(split, getattr(benchmark, split)) for split in splits]
for i, (split, taskset) in enumerate(tasksets):
print(f'{taskset=}')
print(f'{taskset.dataset.dataset.transform=}')
for task_num in range(batch_size):
X, y = taskset.sample()
print(f'{X.size()=}')
assert X.size(2) == 84
print(f'{y.size()=}')
print(f'{y=}')
for img_idx in range(X.size(0)):
visualize_pytorch_tensor_img(X[img_idx], show_img_now=True)
if img_idx >= 5: # print 5 images only
break
# visualize_pytorch_batch_of_imgs(X, show_img_now=True)
print()
if task_num >= 4: # so to get a MI image finally (note omniglot does not have padding at train...oops!)
break
break
break
and
def visualize_pytorch_tensor_img(tensor_image: torch.Tensor, show_img_now: bool = False):
"""
Due to channel orders not agreeing in pt and matplot lib.
Given a Tensor representing the image, use .permute() to put the channels as the last dimension:
ref: https://stackoverflow.com/questions/53623472/how-do-i-display-a-single-image-in-pytorch
"""
from matplotlib import pyplot as plt
assert len(tensor_image.size()) == 3, f'Err your tensor is the wrong shape {tensor_image.size()=}' \
f'likely it should have been a single tensor with 3 channels' \
f'i.e. CHW.'
if tensor_image.size(0) == 3: # three chanels
plt.imshow(tensor_image.permute(1, 2, 0))
else:
plt.imshow(tensor_image)
if show_img_now:
plt.tight_layout()
plt.show()
images here: https://github.com/learnables/learn2learn/issues/376#issuecomment-1319368831
first one:
I am getting odd images despite printing the transform the data is using:
-- splits[i]='train'
taskset=<learn2learn.data.task_dataset.TaskDataset object at 0x7fbc38345880>
taskset.dataset.dataset.datasets[0].dataset.transform=Compose(
ToPILImage()
RandomCrop(size=(84, 84), padding=8)
ColorJitter(brightness=[0.6, 1.4], contrast=[0.6, 1.4], saturation=[0.6, 1.4], hue=None)
RandomHorizontalFlip(p=0.5)
ToTensor()
Normalize(mean=[0.47214064400000005, 0.45330829125490196, 0.4099612805098039], std=[0.2771838538039216, 0.26775040952941176, 0.28449041290196075])
)
but the padding is missing:
but when I use this instead:
train_data_transform = Compose([
RandomResizedCrop((size - padding*2, size - padding*2), scale=scale, ratio=ratio),
Pad(padding=padding),
ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4),
RandomHorizontalFlip(),
ToTensor(),
Normalize(mean=mean, std=std),
])
it seems to work:
why don't both have the 8 and 8 padding on both sides I expect?
I tried seeing the images with mini-imagenet for torch-meta and it also didn't seem the padding was there:
task_num=0
Compose(
RandomCrop(size=(84, 84), padding=8)
RandomHorizontalFlip(p=0.5)
ColorJitter(brightness=[0.6, 1.4], contrast=[0.6, 1.4], saturation=[0.6, 1.4], hue=[-0.2, 0.2])
ToTensor()
Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
)
X.size()=torch.Size([25, 3, 84, 84])
The code is much harder to make compact and reproducible but you can see my torchmeta_plot_images_is_the_padding_there ultimate-utils library.
For now since 2 data sets say that padding is not being inserted despite the transform saying it should be I am concluding there is a bug in pytorch or my pytorch version or I just don't understand RandomCrop. But the description is clear to me:
padding (int or sequence, optional) –
Optional padding on each border of the image. Default is None. If a single int is provided this is used to pad all borders.
and the normal padding Pad(...) says something very similar:
padding (int or sequence) –
Padding on each border. If a single int is provided this is used to pad all borders.
so what else could go wrong? The bottom img I provided with a pad is done with the above Pad() function not with RandomCrop.
cross:
gitissues: https://github.com/learnables/learn2learn/issues/376
pytorch forum: https://discuss.pytorch.org/t/why-isnt-randomcrop-inserting-the-padding-in-pytorch/166244
They are padded to 84+8 then cropped back to 84: you can see the black padding on each image (eg, on the left for the 2nd image).
I discovered & confirmed that by doing it on cifar. But note this NOT what the docs say for RandomCrop:
Optional padding on each border of the image. Default is None. If a single int is provided this is used to pad all borders.
it says something very similar to pad:
Padding on each border. If a single int is provided this is used to pad all borders.
See: https://github.com/learnables/learn2learn/issues/376#issuecomment-1319405466
I am going to report this to pytorch as a bug https://github.com/pytorch/pytorch/issues/89253. Reproducible code in cifar:
def check_padding_random_crop_cifar_pure_torch():
# -
import sys
print(f'python version: {sys.version=}')
import torch
print(f'{torch.__version__=}')
# -
from uutils.plot.image_visualization import visualize_pytorch_tensor_img
from torchvision.transforms import RandomCrop
# - for determinism
import random
random.seed(0)
import torch
torch.manual_seed(0)
import numpy as np
np.random.seed(0)
# -
from pathlib import Path
root = Path('~/data/').expanduser()
import torch
import torchvision
# - test tensor imgs
from torchvision.transforms import Resize
from torchvision.transforms import Pad
from torchvision.transforms import ToTensor
from torchvision.transforms import Compose
# -- see if pad doubles length
print(f'--- test padding doubles length with Pad(...)')
transform = Compose([Resize((32, 32)), Pad(padding=4), ToTensor()])
train = torchvision.datasets.CIFAR100(root=root, train=True, download=True,
transform=transform,
target_transform=lambda data: torch.tensor(data, dtype=torch.long))
transform = Compose([Resize((32, 32)), Pad(padding=8), ToTensor()])
test = torchvision.datasets.CIFAR100(root=root, train=True, download=True,
transform=transform,
target_transform=lambda data: torch.tensor(data, dtype=torch.long))
# - test padding doubles length
from torch.utils.data import DataLoader
loader = DataLoader(train)
x, y = next(iter(loader))
print(f'{x[0].size()=}')
assert x[0].size(2) == 32 + 4 * 2
assert x[0].size(2) == 32 + 8
visualize_pytorch_tensor_img(x[0], show_img_now=True)
#
loader = DataLoader(test)
x, y = next(iter(loader))
print(f'{x[0].size()=}')
assert x.size(2) == 32 + 8 * 2
assert x.size(2) == 32 + 16
visualize_pytorch_tensor_img(x[0], show_img_now=True)
# -- see if RandomCrop also puts the pad
print(f'--- test RandomCrop indeed puts padding')
transform = Compose([Resize((32, 32)), RandomCrop(28, padding=8), ToTensor()])
train = torchvision.datasets.CIFAR100(root=root, train=True, download=True,
transform=transform,
target_transform=lambda data: torch.tensor(data, dtype=torch.long))
transform = Compose([Resize((32, 32)), RandomCrop(28), ToTensor()])
test = torchvision.datasets.CIFAR100(root=root, train=True, download=True,
transform=transform,
target_transform=lambda data: torch.tensor(data, dtype=torch.long))
# - test that the padding is there visually
from torch.utils.data import DataLoader
loader = DataLoader(train)
x, y = next(iter(loader))
print(f'{x[0].size()=}')
assert x[0].size(2) == 28
visualize_pytorch_tensor_img(x[0], show_img_now=True)
#
loader = DataLoader(test)
x, y = next(iter(loader))
print(f'{x[0].size()=}')
assert x.size(2) == 28
visualize_pytorch_tensor_img(x[0], show_img_now=True

RandomForestRegressor in Julia

I'm trying to train a RandomForestRegressor using DecisionTree.jl
and RandomizedSearchCV (contained in ScikitLearn.jl) in Julia. Primary datasets like x_train and y_train etc. are provided in my google drive as well, So you can test it on your machine. The code is as follows:
using CSV
using DataFrames
using ScikitLearn: fit!, predict
using ScikitLearn.GridSearch: RandomizedSearchCV
using DecisionTree
x = CSV.read("x.csv", DataFrames.DataFrame)
x_test = CSV.read("x_test.csv", DataFrames.DataFrame)
y_train = CSV.read("y_train.csv", DataFrames.DataFrame)
mod = RandomForestRegressor()
param_dist = Dict("n_trees"=>[50 , 100, 200, 300],
"max_depth"=> [3, 5, 6 ,8 , 9 ,10])
model = RandomizedSearchCV(mod, param_dist, n_iter=10, cv=5)
fit!(model, Matrix(x), Matrix(DataFrames.dropmissing(y_train)))
predict(x_test)
This throws a MethodError like this:
ERROR: MethodError: no method matching fit!(::RandomForestRegressor, ::Matrix{Float64}, ::Matrix{Float64})
Closest candidates are:
fit!(::ScikitLearn.Models.FixedConstant, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:26
fit!(::ScikitLearn.Models.ConstantRegressor, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:10
fit!(::ScikitLearn.Models.LinearRegression, ::AbstractArray{XT}, ::AbstractArray{yT}) where {XT, yT} at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\linear_regression.jl:27
...
Stacktrace:
[1] _fit!(self::RandomizedSearchCV, X::Matrix{Float64}, y::Matrix{Float64}, parameter_iterable::Vector{Any})
# ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:332
[2] fit!(self::RandomizedSearchCV, X::Matrix{Float64}, y::Matrix{Float64})
# ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:748
[3] top-level scope
# c:\Users\Shayan\Desktop\AUT\Thesis\test.jl:17
If you're curious about the shape of the data:
julia> size(x)
(1550, 71)
julia> size(y_train)
(1550, 10)
How can I solve this problem?
PS: Also I tried:
julia> fit!(model, Matrix{Any}(x), Matrix{Any}(DataFrames.dropmissing(y_train)))
ERROR: MethodError: no method matching fit!(::RandomForestRegressor, ::Matrix{Any}, ::Matrix{Any})
Closest candidates are:
fit!(::ScikitLearn.Models.FixedConstant, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:26
fit!(::ScikitLearn.Models.ConstantRegressor, ::Any, ::Any) at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\constant_model.jl:10
fit!(::ScikitLearn.Models.LinearRegression, ::AbstractArray{XT}, ::AbstractArray{yT}) where {XT, yT} at C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\models\linear_regression.jl:27
...
Stacktrace:
[1] _fit!(self::RandomizedSearchCV, X::Matrix{Any}, y::Matrix{Any}, parameter_iterable::Vector{Any})
# ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:332
[2] fit!(self::RandomizedSearchCV, X::Matrix{Any}, y::Matrix{Any})
# ScikitLearn.Skcore C:\Users\Shayan\.julia\packages\ScikitLearn\ssekP\src\grid_search.jl:748
[3] top-level scope
# c:\Users\Shayan\Desktop\AUT\Thesis\MyWork\Thesis.jl:327
Looking at Random Forest Regression example docs in DecisionTree.jl, the example doesn't follow the fit!() / predict() design pattern. The error confirms that fit!() doesn't support RandomForestRegression. Alternatively, you might look at RandomForest.jl package which does follow fit!() / predict() pattern.
As stated here, DecisionTree.jl doesn't support Multi-output RF yet. So I gave up on using DecisionTree.jl, And ScikitLearn.jl is adequate in my case:
using ScikitLearn: #sk_import, fit!, predict
#sk_import ensemble: RandomForestRegressor
using ScikitLearn.GridSearch: RandomizedSearchCV
using CSV
using DataFrames
x = CSV.read("x.csv", DataFrames.DataFrame)
x_test = CSV.read("x_test.csv", DataFrames.DataFrame)
y_train = CSV.read("y_train.csv", DataFrames.DataFrame)
x_test = reshape(x_test, 1,length(x_test))
mod = RandomForestRegressor()
param_dist = Dict("n_estimators"=>[50 , 100, 200, 300],
"max_depth"=> [3, 5, 6 ,8 , 9 ,10])
model = RandomizedSearchCV(mod, param_dist, n_iter=10, cv=5)
fit!(model, Matrix(x), Matrix(DataFrames.dropmissing(y_train)))
predict(model, x_test)
This works fine for me, But it's super slow! Much slower than Python. I'll add the benchmarking with the same data sets across these two languages.
Benchmarking
Here I report the result of benchmarking with the same action, the same values, and the same data. All the data and code files are available in my Google Drive. So feel free to test it by yourself. First, I start with Julia.
Julia
using CSV
using DataFrames
using ScikitLearn: #sk_import, fit!, predict
#sk_import ensemble: RandomForestRegressor
using ScikitLearn.GridSearch: RandomizedSearchCV
using BenchmarkTools
x = CSV.read("x.csv", DataFrames.DataFrame)
y_train = CSV.read("y_train.csv", DataFrames.DataFrame)
mod = RandomForestRegressor(max_leaf_nodes=2)
param_dist = Dict("n_estimators"=>[50 , 100, 200, 300],
"max_depth"=> [3, 5, 6 ,8 , 9 ,10])
model = RandomizedSearchCV(mod, param_dist, n_iter=10, cv=5, n_jobs=1)
#btime fit!(model, Matrix(x), Matrix(DataFrames.dropmissing(y_train)))
# 52.123 s (6965 allocations: 44.34 MiB)
Python
>>> import cProfile, pstats
>>> import pandas as pd
>>> from sklearn.ensemble import RandomForestRegressor
>>> from sklearn.model_selection import RandomizedSearchCV
>>> x = pd.read_csv("x.csv")
>>> y_train = pd.read_csv("y_train.csv")
>>> mod = RandomForestRegressor(max_leaf_nodes=2)
>>> parameters = {
'n_estimators': [50 , 100, 200, 300],
'max_depth': [3, 5, 6 ,8 , 9 ,10]}
>>> model = RandomizedSearchCV(mod, param_distributions=parameters, cv=5, n_iter=10, n_jobs=1)
>>> pr = cProfile.Profile()
>>> pr.enable()
>>> model.fit(x , y_train)
>>> pr.disable()
>>> stats = pstats.Stats(pr).strip_dirs().sort_stats("cumtime")
>>> stats.print_stats(5)
12097437 function calls (11936452 primitive calls) in 73.452 seconds
Ordered by: cumulative time
List reduced from 736 to 5 due to restriction <5>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 73.445 73.445 _search.py:738(fit)
102/2 0.027 0.000 73.370 36.685 parallel.py:960(__call__)
12252/152 0.171 0.000 73.364 0.483 parallel.py:798(dispatch_one_batch)
12150/150 0.058 0.000 73.324 0.489 parallel.py:761(_dispatch)
12150/150 0.025 0.000 73.323 0.489 _parallel_backends.py:206(apply_async)
So I conclude that Julia performs better than Python in this specific problem in case of speed.

Repartition dask dataframe while maintaining its original sequence

I am trying to repartition multiple .parquet files in order to save a specific number of parquet files. I have a time-series data that is dependent on the number of observations (instead of timestamps) for each client, so I need to ensure that the partitioning will not split a series over two files. In addition, I want to preserve the order, since I have the labels stored elsewhere. Here is an example of what I am trying to do:
import pandas as pd
import dask.dataframe as dd
ids = [9635, 1536, 8477, 1088, 6411, 2251]
df = df = pd.DataFrame({
"partition" : [0]*3 + [1]*3 + [2]*3 + [3]*3 + [4]*3 + [5]*3,
"customer_id" : [ids[0]]*3 + [ids[1]]*3 + [ids[2]]*3 + [ids[3]]*3 + [ids[4]]*3 + [ids[5]]*3,
"x": range(18)})
# indexing on "customer_id" here
df = df.set_index("customer_id")
ddf = dd.from_pandas(df, npartitions=6)
ddf.to_parquet("my_parquets")
read_ddf = dd.read_parquet("my_parquets/*.parquet")
last_idx = [ids[-1]]
my_divisions = ids + last_idx
read_ddf.divisions = my_divisions
# Split into two equal partitions with three customers each
new_divisions = [my_divisions[0], my_divisions[3], my_divisions[5]]
new_ddf = read_df.repartition(divisions=new_divisions)
which raises an error:
ValueError: New division must be sorted
I have tried an alternative approach, which involves setting the "partition" column as the index and modifying the index to "ids" later, but this sorts my entire dataframe, which is undesired because the new sequence no longer matches the labels stored. This is shown here:
import pandas as pd
import dask.dataframe as dd
ids = [9635, 1536, 8477, 1088, 6411, 2251]
df = df = pd.DataFrame({
"partition" : [0]*3 + [1]*3 + [2]*3 + [3]*3 + [4]*3 + [5]*3,
"customer_id" : [ids[0]]*3 + [ids[1]]*3 + [ids[2]]*3 + [ids[3]]*3 + [ids[4]]*3 + [ids[5]]*3,
"x": range(18)})
# indexing on the defined "partition" instead
df = df.set_index("partition")
ddf = dd.from_pandas(df, npartitions=6)
ddf.to_parquet("my_parquets")
read_ddf = dd.read_parquet("my_parquets/*.parquet")
# my_range is equivalent to the list of partitions
my_range = [i for i in range(0,6)]
last_idx = [my_range[-1]]
my_divisions = my_range + last_idx
read_ddf.divisions = my_divisions
new_divisions = [0, 2, 4, 5]
new_ddf = read_ddf.repartition(divisions=new_divisions)
# Need the "customer_id" as index
new_ddf = new_ddf.set_index("customer_id", drop = True)
But this sorts the dataframe by the index and messes up the structure, while I would like to keep the original order.
print("Partition 0")
print(new_ddf.get_partition(0).compute())
print("-------------------")
print("Partition 1")
print(new_ddf.get_partition(1).compute())
print("-------------------")
print("Partition 2")
print(new_ddf.get_partition(2).compute())
Partition 0
Empty DataFrame
Columns: [x]
Index: []
-------------------
Partition 1
x
customer_id
1088 9
1088 10
1088 11
1536 3
1536 4
1536 5
-------------------
Partition 2
x
customer_id
2251 15
2251 16
2251 17
6411 12
6411 13
6411 14
8477 6
8477 7
8477 8
9635 0
9635 1
9635 2
Are there any workarounds for this issue? I am aware that set_index in dask is quite expensive, but none of the approaches are currently working. Also, in my case I already have the .parquet files with the preprocessed data, so I only created the initial dataframe using pandas for demonstration purposes (it would have been much easier to specify the number of partitions in the first step if I had all the data in pandas).

How do I operate on groups returned by Dask's group by?

I have the following table.
value category
0 2 A
1 20 B
2 4 A
3 40 B
I want to add a mean column that contains the mean of the values for each category.
value category mean
0 2 A 3.0
1 20 B 30.0
2 4 A 3.0
3 40 B 30.0
I can do this in pandas like so
p = pd.DataFrame({"value":[2, 20, 4, 40], "category": ["A", "B", "A", "B"]})
groups = []
for _, group in p.groupby("category"):
group.loc[:,"mean"] = group.loc[:,"value"].mean()
groups.append(group)
pd.concat(groups).sort_index()
How do I do the same thing in Dask?
I can't use the pandas functions as-is because you can't enumerate over a groupby object in Dask. This
import dask.dataframe as dd
d = dd.from_pandas(p, chunksize=100)
list(d.groupby("category"))
raises KeyError: 'Column not found: 0'.
I can use an apply function to calculate the mean in Dask.
import dask.dataframe as dd
d = dd.from_pandas(p, chunksize=100)
q = d.groupby(["category"]).apply(lambda group: group["value"].mean(), meta="object")
q.compute()
returns
category
A 3.0
B 30.0
dtype: float64
But I can't figure how how to fold these back into the rows of the original table.
I would use a merge to achieve this operation:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({
'value': [2, 20, 4, 40],
'category': ['A', 'B', 'A', 'B']
})
ddf = dd.from_pandas(df, npartitions=1)
# Lazy-compute mean per category
mean_by_category = (ddf
.groupby('category')
.agg({'value': 'mean'})
.rename(columns={'value': 'mean'})
).persist()
mean_by_category.head()
# Assign 'mean' value to each corresponding category
ddf = ddf.merge(mean_by_category, left_on='category', right_index=True)
ddf.head()
Which should then output:
category value mean
0 A 2 3.0
2 A 4 3.0
1 B 20 30.0
3 B 40 30.0

Torch - repeat tensor like numpy repeat

I am trying to repeat a tensor in torch in two ways. For example repeating the tensor {1,2,3,4} 3 times both ways to yield;
{1,2,3,4,1,2,3,4,1,2,3,4}
{1,1,1,2,2,2,3,3,3,4,4,4}
There is a built in torch:repeatTensor function which will generate the first of the two (like numpy.tile()) but I can't find one for the latter (like numpy.repeat()). I'm sure that I could call sort on the first to give the second but I think this might be computationally expensive for larger arrays?
Thanks.
Try torch.repeat_interleave() method: https://pytorch.org/docs/stable/torch.html#torch.repeat_interleave
>>> x = torch.tensor([1, 2, 3])
>>> x.repeat_interleave(2)
tensor([1, 1, 2, 2, 3, 3])
Quoting https://discuss.pytorch.org/t/how-to-tile-a-tensor/13853 -
z = torch.FloatTensor([[1,2,3],[4,5,6],[7,8,9]])
1 2 3
4 5 6
7 8 9
z.transpose(0,1).repeat(1,3).view(-1, 3).transpose(0,1)
1 1 1 2 2 2 3 3 3
4 4 4 5 5 5 6 6 6
7 7 7 8 8 8 9 9 9
This will give you a intuitive feel of how it works.
a = torch.Tensor([1,2,3,4])
To get [1., 2., 3., 4., 1., 2., 3., 4., 1., 2., 3., 4.] we repeat the tensor thrice in the 1st dimension:
a.repeat(3)
To get [1,1,1,2,2,2,3,3,3,4,4,4] we add a dimension to the tensor and repeat it thrice in the 2nd dimension to get a 4 x 3 tensor, which we can flatten.
b = a.reshape(4,1).repeat(1,3).flatten()
or
b = a.reshape(4,1).repeat(1,3).view(-1)
Here's a generic function that repeats elements in tensors.
def repeat(tensor, dims):
if len(dims) != len(tensor.shape):
raise ValueError("The length of the second argument must equal the number of dimensions of the first.")
for index, dim in enumerate(dims):
repetition_vector = [1]*(len(dims)+1)
repetition_vector[index+1] = dim
new_tensor_shape = list(tensor.shape)
new_tensor_shape[index] *= dim
tensor = tensor.unsqueeze(index+1).repeat(repetition_vector).reshape(new_tensor_shape)
return tensor
If you have
foo = tensor([[1, 2],
[3, 4]])
By calling repeat(foo, [2,1]) you get
tensor([[1, 2],
[1, 2],
[3, 4],
[3, 4]])
So you duplicated every element along dimension 0 and left elements as they are on dimension 1.
Use einops:
from einops import repeat
repeat(x, 'i -> (repeat i)', repeat=3)
# like {1,2,3,4,1,2,3,4,1,2,3,4}
repeat(x, 'i -> (i repeat)', repeat=3)
# like {1,1,1,2,2,2,3,3,3,4,4,4}
This code works identically for any framework (numpy, torch, tf, etc.)
Can you try something like:
import torch as pt
#1 work as numpy tile
b = pt.arange(10)
print(b.repeat(3))
#2 work as numpy tile
b = pt.tensor(1).repeat(10).reshape(2,-1)
print(b)
#3 work as numpy repeat
t = pt.tensor([1,2,3])
t.repeat(2).reshape(2,-1).transpose(1,0).reshape(-1)

Resources