Why isn't my custom transformer transforming the test set? - machine-learning

I am trying to build a custom transformer for standardizing the code. If I use fit_transform on training set, it works correctly but it only returns NaNs if I apply the transform function on the test test. I have mentioned the code below. Input to the code is a pandas dataframe. Lets say a random 3*3 dataframe with integer values in a range (0, 4). The output that my transform returns is the array of Nans with rows = rows of the test data passed and columns = double the number of colums of the test data with NaNs present everywhere like this
array([[nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan, nan]])
This is my custom transformer code:
from sklearn.base import TransformerMixin, BaseEstimator
class smooth_score(TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
self.mean = np.mean(X)
self.std = np.std(X)
return self
def transform(self, X):
X = (X - self.mean) / self.std
return np.array(X)

Here's modified version of your code:
import pandas as pd
import numpy as np
from sklearn.base import TransformerMixin
class smooth_score(TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
self.mean = np.mean(X, axis=(0, 1))
self.std = np.std(X, axis=(0, 1))
return self
def transform(self, X):
X = (X - self.mean) / self.std
return X
df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
tf.fit(df.values)
new = tf.transform(df.values)
where new is:
array([[-1.54919334, -1.161895 , -0.77459667],
[-0.38729833, 0. , 0.38729833],
[ 0.77459667, 1.161895 , 1.54919334]])
np.std() and np.mean() work per axis, so if you want calculate those across all axes (i.e. get single number, not a 1D vector), you need to specify all axes - hence axis=(0, 1) parameter. This should fix your dimension issues.
np.std() and np.mean() will not work on pandas data, thus df.values which gets the underlying numpy array. Alternatively, you can use X.mean().mean() and X.std().std() where X is pandas dataframe (double mean() and std() are not an error!)
I would check for self.std == 0, that will also give you NaN's

Related

PyTorch - scaling data for training and then rescaling results back

I am working on an autoencoder network using pytorch. I have a dataset of rows that have 10 columns each containing values in roughly [-0.2, 0.2].
Since all builtin function for automated data preparation I know about work for images and other data types, I assume I have to rescale these into [0, 1] range myself, train the network and then scale every result back into the original dataset's size scale.
The scaling algorithm I used was (input is scaled data for training, output is result of network):
input -= min(data)
input /= max(input)
output *= (abs(min(data)) + max(data)) //last division was by "shifted" max
output += min(data)
Here is an actual code:
class AirfoilDataset(torch.utils.data.Dataset):
def __init__(self, data):
self.airfoils = np.copy(data)
self.airfoils -= self.airfoils.min()
self.airfoils /= self.airfoils.max()
def __len__(self):
return len(self.airfoils)
def __getitem__(self, idx):
return torch.from_numpy(self.airfoils[idx]), idx
class Autoencoder(torch.nn.Module):
def __init__(self):
super().__init__()
self.encoder = torch.nn.Sequential(
torch.nn.Linear(10, 5),
torch.nn.Sigmoid()
)
self.decoder = torch.nn.Sequential(
torch.nn.Linear(5, 10),
torch.nn.Sigmoid()
)
def forward(self, x):
x = self.encoder(x)
x = self.decoder(x)
return x
The results I get from this are really bad, but somehow deformed (don't know the proper terminology). It visibly follows the shape of the original dataset, but really badly.
On the other hand, if I don't scale the data put into training, the positive range of original dataset is represented perfectly by the autoencoder, without distortions. Obviously, the negative part is reduced to zero.
How to preserve "shape" of input dataset through training?
You can use sklearn for that
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit_transform(input)
#here your model
output = scaler.inverse_transform(output)

How to use grad convolution in google-jax?

Thanks for reading my question!
I was just learning about custom grad functions in Jax, and I found the approach JAX took with defining custom functions is quite elegant.
One thing troubles me though.
I created a wrapper to make lax convolution look like PyTorch conv2d.
from jax import numpy as jnp
from jax.random import PRNGKey, normal
from jax import lax
from torch.nn.modules.utils import _ntuple
import jax
from jax.nn.initializers import normal
from jax import grad
torch_dims = {0: ('NC', 'OI', 'NC'), 1: ('NCH', 'OIH', 'NCH'), 2: ('NCHW', 'OIHW', 'NCHW'), 3: ('NCHWD', 'OIHWD', 'NCHWD')}
def conv(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1):
n = len(input.shape) - 2
if type(stride) == int:
stride = _ntuple(n)(stride)
if type(padding) == int:
padding = [(i, i) for i in _ntuple(n)(padding)]
if type(dilation) == int:
dilation = _ntuple(n)(dilation)
return lax.conv_general_dilated(lhs=input, rhs=weight, window_strides=stride, padding=padding, lhs_dilation=dilation, rhs_dilation=None, dimension_numbers=torch_dims[n], feature_group_count=1, batch_group_count=1, precision=None, preferred_element_type=None)
The problem is that I could not find a way to use its grad function:
init = normal()
rng = PRNGKey(42)
x = init(rng, [128, 3, 224, 224])
k = init(rng, [64, 3, 3, 3])
y = conv(x, k)
grad(conv)(y, k)
This is what I got.
ValueError: conv_general_dilated lhs feature dimension size divided by feature_group_count must equal the rhs input feature dimension size, but 64 // 1 != 3.
Please help!
When I run your code with the most recent releases of jax and jaxlib (jax==0.2.22; jaxlib==0.1.72), I see the following error:
TypeError: Gradient only defined for scalar-output functions. Output had shape: (128, 64, 222, 222).
If I create a scalar-output function that uses conv, the gradient seems to work:
result = grad(lambda x, k: conv(x, k).sum())(x, k)
print(result.shape)
# (128, 3, 224, 224)
If you are using an older version of JAX, you might try updating to a more recent version – perhaps the error you're seeing is due to a bug that has already been fixed.

Map Dask bincount over 2d array columns

I am trying to use bincount over a 2D array. Specifically I have this code:
import numpy as np
import dask.array as da
def dask_bincount(weights, x):
da.bincount(x, weights)
idx = da.random.random_integers(0, 1024, 1000)
weight = da.random.random((1000, 2))
bin_count = da.apply_along_axis(dask_bincount, 1, weight, idx)
The idea is that the bincount can be made with the same idx array on each one of the weight columns. That would return an array of size (np.amax(x) + 1, 2) if I am correct.
However when doing this I get this error message:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-17-5b8eed89ad32> in <module>
----> 1 bin_count = da.apply_along_axis(dask_bincount, 1, weight, idx)
~/.local/lib/python3.9/site-packages/dask/array/routines.py in apply_along_axis(func1d, axis, arr, dtype, shape, *args, **kwargs)
454 if shape is None or dtype is None:
455 test_data = np.ones((1,), dtype=arr.dtype)
--> 456 test_result = np.array(func1d(test_data, *args, **kwargs))
457 if shape is None:
458 shape = test_result.shape
<ipython-input-14-34fd0eb9b775> in dask_bincount(weights, x)
1 def dask_bincount(weights, x):
----> 2 da.bincount(x, weights)
~/.local/lib/python3.9/site-packages/dask/array/routines.py in bincount(x, weights, minlength, split_every)
670 raise ValueError("Input array must be one dimensional. Try using x.ravel()")
671 if weights is not None:
--> 672 if weights.chunks != x.chunks:
673 raise ValueError("Chunks of input array x and weights must match.")
674
AttributeError: 'numpy.ndarray' object has no attribute 'chunks'
I thought that when dask array were created the library automatically assigns them chunks, so the error does not say much. How can I fix this?
I made an script that does it on numpy with map.
idx_np = np.random.randint(0, 1024, 1000)
weight_np = np.random.random((1000,2))
f = lambda y: np.bincount(idx_np, weight_np[:,y])
result = map(f, [i for i in range(2)])
np.array(list(result))
array([[0.9885341 , 0.9977873 , 0.24937023, ..., 0.31024526, 1.40754883,
0.87609759],
[1.77406303, 0.84787723, 0.14591474, ..., 0.54584068, 0.38357015,
0.85202672]])
I would like to the same but with dask
There are multiple problems at play.
Weights should be (2, 1000)
You discover this by trying to write the same function in numpy using apply_along_axis.
idx_np = np.random.random_integers(0, 1024, 1000)
weight_np = np.random.random((2, 1000)) # <- transposed
# This gives the same result as the code you provided
np.apply_along_axis(lambda weight, idx: np.bincount(idx, weight), 1, weight_np, idx_np)
da.apply_along_axis applies the function to numpy arrays
You're getting the error
AttributeError: 'numpy.ndarray' object has no attribute 'chunks'
This suggests that what makes it into the da.bincount method is actually a numpy array. The fact is that da.apply_along_axis actually takes each row of weight and sends it to the function as a numpy array.
Your function should therefore actually be a numpy function:
def bincount(weights, x):
return np.bincount(x, weights)
However, if you try this, you will still get the same error. I believe that happens for a whole another reason though:
Dask doesn't know what the output shape will be and tries to infer it
In the code and/or documentation for apply_along_axis, we can see that Dask tries to infer the output shape and dtype by passing in the array [1] (related question). This is a problem, since bincount cannot just accept such argument.
What we can do instead is provide shape and dtype to the method so that Dask doesn't have to infer it.
The problem here is that bincount's output shape depends on the maximum value of the input array. Unless you know it beforehand, you will sadly need to compute it. The whole operation therefore won't be fully lazy.
This is the full answer:
import numpy as np
import dask.array as da
idx = da.random.random_integers(0, 1024, 1000)
weight = da.random.random((2, 1000))
def bincount(weights, x):
return np.bincount(x, weights)
m = idx.max().compute()
da.apply_along_axis(bincount, 1, weight, idx, shape=(m,), dtype=weight.dtype)
Appendix: randint vs random_integers
Be careful, because these are subtly different
randint takes integers from low (inclusive) to high (exclusive)
random_integers takes integers from low (inclusive) to high (inclusive)
Thus you have to call randint with high + 1 to get the same value.

How to get prediction and confidence of that prediction using resnet

I have a binary classifier which predicts whether the image is positive or negative. I am using model.predict for getting the detections. So basically what I want is the class index and the confidence value with which it belongs to that class. I am able to get the detections and able to show it on the image, but for background images also it is showing some false predictions so I would like to remove those by setting a threshold for the confidence. For information about the training file and testing file I have asked a question on StackOverflow, please refer the link "Resnet is showing wrong predictions even without any object"
My Resnet code:
# import the necessary packages
from keras.layers.normalization import BatchNormalization
from keras.layers.convolutional import Conv2D
from keras.layers.convolutional import AveragePooling2D
from keras.layers.convolutional import MaxPooling2D
from keras.layers.convolutional import ZeroPadding2D
from keras.layers.core import Activation
from keras.layers.core import Dense
from keras.layers import Flatten
from keras.layers import Input
from keras.models import Model
from keras.layers import add
from keras.regularizers import l2
from keras import backend as K
class ResNet:
#staticmethod
def residual_module(data, K, stride, chanDim, red=False,
reg=0.0001, bnEps=2e-5, bnMom=0.9):
# the shortcut branch of the ResNet module should be
# initialize as the input (identity) data
shortcut = data
# the first block of the ResNet module are the 1x1 CONVs
bn1 = BatchNormalization(axis=chanDim, epsilon=bnEps,
momentum=bnMom)(data)
act1 = Activation("relu")(bn1)
conv1 = Conv2D(int(K * 0.25), (1, 1), use_bias=False,
kernel_regularizer=l2(reg))(act1)
# the second block of the ResNet module are the 3x3 CONVs
bn2 = BatchNormalization(axis=chanDim, epsilon=bnEps,
momentum=bnMom)(conv1)
act2 = Activation("relu")(bn2)
conv2 = Conv2D(int(K * 0.25), (3, 3), strides=stride,
padding="same", use_bias=False,
kernel_regularizer=l2(reg))(act2)
# the third block of the ResNet module is another set of 1x1
# CONVs
bn3 = BatchNormalization(axis=chanDim, epsilon=bnEps,
momentum=bnMom)(conv2)
act3 = Activation("relu")(bn3)
conv3 = Conv2D(K, (1, 1), use_bias=False,
kernel_regularizer=l2(reg))(act3)
# if we are to reduce the spatial size, apply a CONV layer to
# the shortcut
if red:
shortcut = Conv2D(K, (1, 1), strides=stride,
use_bias=False, kernel_regularizer=l2(reg))(act1)
# add together the shortcut and the final CONV
x = add([conv3, shortcut])
# return the addition as the output of the ResNet module
return x
#staticmethod
def build(width, height, depth, classes, stages, filters,
reg=0.0001, bnEps=2e-5, bnMom=0.9):
# initialize the input shape to be "channels last" and the
# channels dimension itself
inputShape = (height, width, depth)
chanDim = -1
# if we are using "channels first", update the input shape
# and channels dimension
if K.image_data_format() == "channels_first":
inputShape = (depth, height, width)
chanDim = 1
# set the input and apply BN
inputs = Input(shape=inputShape)
x = BatchNormalization(axis=chanDim, epsilon=bnEps,
momentum=bnMom)(inputs)
# apply CONV => BN => ACT => POOL to reduce spatial size
x = Conv2D(filters[0], (5, 5), use_bias=False,
padding="same", kernel_regularizer=l2(reg))(x)
x = BatchNormalization(axis=chanDim, epsilon=bnEps,
momentum=bnMom)(x)
x = Activation("relu")(x)
x = ZeroPadding2D((1, 1))(x)
x = MaxPooling2D((3, 3), strides=(2, 2))(x)
# loop over the number of stages
for i in range(0, len(stages)):
# initialize the stride, then apply a residual module
# used to reduce the spatial size of the input volume
stride = (1, 1) if i == 0 else (2, 2)
x = ResNet.residual_module(x, filters[i + 1], stride,
chanDim, red=True, bnEps=bnEps, bnMom=bnMom)
# loop over the number of layers in the stage
for j in range(0, stages[i] - 1):
# apply a ResNet module
x = ResNet.residual_module(x, filters[i + 1],
(1, 1), chanDim, bnEps=bnEps, bnMom=bnMom)
# apply BN => ACT => POOL
x = BatchNormalization(axis=chanDim, epsilon=bnEps,
momentum=bnMom)(x)
x = Activation("relu")(x)
x = AveragePooling2D((8, 8))(x)
# softmax classifier
x = Flatten()(x)
x = Dense(classes, kernel_regularizer=l2(reg))(x)
x = Activation("softmax")(x)
# create the model
model = Model(inputs, x, name="resnet")
# return the constructed network architecture
return model
Any kind of suggestion to get rid of my this problem would be really helpful

DesicionTreeClassifier: Input contains NaN, infinity or a value too large for dtype('float32')

clf = DecisionTreeClassifier()
scoring = 'accuracy'
score = cross_val_score(clf, train_data, target, cv=k_fold, n_jobs=1, error_score='raise')
print(score)
after run this code I have error:
ValueError: Input contains NaN, infinity or a value too large for
dtype('float32').
So how I can fix it?
Decision Trees doesn't accept NaN / infinity values.
Try doing (assuming that train_data is a Pandas DataFrame):
train_data.fillna(0, inplace = True)
This will replace all NaN values by 0.
If you don't want this, the only thing to do is to delete entries with NaN data :
train_data.dropna(inplace = True)
If this is not a DataFrame, try adding this line before the fillna method:
train_data = pd.DataFrame(train_data)

Resources