SMOTENC oversampling without one-hot encoding - machine-learning

I'm using SMOTENC to oversample an imbalanced-dataset https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTENC.html
I thought the point of SMOTENC was to give the option to oversample categorical features without one-hot encoding them. The reason I don't want to one-hot encode is to avoid Curse of Dimensionality and let CatBoost deal with the categorical features by defining the categorical features using the Pool Class https://catboost.ai/en/docs/concepts/python-reference_pool.
However, when trying to oversample with SMOTENC I still get the error:
could not convert string to float
First, I perform some preprocessing on my numerical- and categorical features.
Preprocessing
numerical_transformer = Pipeline(
steps=[
("transformer", FunctionTransformer(lambda d: d.astype(np.float32))),
("imputer", SimpleImputer(strategy="mean")),
("scaler", MinMaxScaler()),
],
verbose=True,
)
categorical_transformer = Pipeline(
steps=[
("transformer", FunctionTransformer(lambda d: d.astype(str))),
("imputer", SimpleImputer(strategy="most_frequent")),
#("oh_encoder", OneHotEncoder(handle_unknown="ignore")),
],
verbose=True,
)
Second, my resampling transformer consist of first an undersampler and then an oversampler (cat_col_indices are the indices of my categorial features all having dtype "object"):
Resampling
resampling_coefficient = 0.6
resampling_transformer = Pipeline(
steps=[
(
"undersampler",
RandomUnderSampler(
sampling_strategy=resampling_coefficient
),
),
(
"oversampler",
SMOTENC(
categorical_features=cat_col_indices,
sampling_strategy="not majority",
k_neighbors=3,
n_jobs=16
),
),
],
verbose=True,
)
I preprocess my data and resample:
x_t = preprocessor.fit_transform(x)
x_t, y_t = resampling_transformer.fit_resample(x_t, y)
Resampling_transformer fit_resample function gives me:
ValueError: could not convert string to float: '(str)'
Do I still need to one-hot encode using SMOTENC? Or am I doing something wrong?`

Related

Understand the output of LSTM autoencoder and use it to detect outliers in a sequence

I try to build LSTM model that as input receives sequence of integer numbers and outputs probability for each integer to appear. If this probability is low, then the integer should be considered as anomaly. I tried to follow this tutorial - https://towardsdatascience.com/lstm-autoencoder-for-extreme-rare-event-classification-in-keras-ce209a224cfb, particularly this is where my model is from. My input looks like this:
[[[3]
[1]
[2]
[0]]
[[3]
[1]
[2]
[0]]
[[3]
[1]
[2]
[0]]
However I can't understand what I gain as an output.
[[[ 2.7052343 ]
[ 1.0618575 ]
[ 1.8257084 ]
[-0.54579014]]
[[ 2.9069736 ]
[ 1.0850943 ]
[ 1.9787762 ]
[ 0.01915958]]
[[ 2.9069736 ]
[ 1.0850943 ]
[ 1.9787762 ]
[ 0.01915958]]
Is it reconstruction error? Or the probabilities for each integer? And if so, why they're not in the range of 0-1? I would be grateful if someone could explain this.
The model:
time_steps = 4
features = 1
train_keys_reshaped = train_integer_encoded.reshape(91, time_steps, features)
test_keys_reshaped = test_integer_encoded.reshape(25, time_steps, features)
model = Sequential()
model.add(LSTM(32, activation='relu', input_shape=(time_steps, features), return_sequences=True))
model.add(LSTM(16, activation='relu', return_sequences=False))
model.add(RepeatVector(time_steps)) # to convert 2D output into expected by decoder 3D
model.add(LSTM(16, activation='relu', return_sequences=True))
model.add(LSTM(32, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(features)))
adam = optimizers.Adam(0.0001)
model.compile(loss='mse', optimizer=adam)
model_history = model.fit(train_keys_reshaped, train_keys_reshaped,
epochs=700,
validation_split=0.1)
predicted_probs = model.predict(test_keys_reshaped)
As you said it's an autoencoder. Your autoencoder tries to reconstruct your input.
As you see, the output values are very close to the input values, there is not a big error. So the autoencoder is well trained.
Now if you want to detect outliers in your data, you can compute the reconstruction error (Could be Mean square Error between input and output) and set up a threshold.
If reconstruction error is superior than the threshold it's gonna be an outlier, since the autoencoder is not trained on reconstructing outlier data.
This schema reprensents better the idea:
I hope this helps ;)

Seq2Seq for string reversal

If I have a string, say "abc" and target of that string in reverse, say "cba".
Can a neural network, in particular an encoder-decoder model, learn this mapping? If so, what is the best model to accomplish this.
I ask, as this is a structural translation rather than a simple character mapping as in normal machine translation
If your network is an old-fashioned encoder-decoder model (without attention), then, as #Prune said, it has memory bottleneck (encoder dimensionality). Thus, such a network cannot learn to reverse strings of arbitrary size. However, you can train such an RNN to reverse strings of limited size. For example, the following toy seq2seq LSTM is able to reverse sequences of digits with length up to 10. Here is how you train it:
from keras.models import Model
from keras.layers import Input, LSTM, Dense, Embedding
import numpy as np
emb_dim = 20
latent_dim = 100 # Latent dimensionality of the encoding space.
vocab_size = 12 # digits 0-9, 10 is for start token, 11 for end token
encoder_inputs = Input(shape=(None, ), name='enc_inp')
common_emb = Embedding(input_dim=vocab_size, output_dim=emb_dim)
encoder_emb = common_emb(encoder_inputs)
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_emb)
encoder_states = [state_h, state_c]
decoder_inputs = Input(shape=(None,), name='dec_inp')
decoder_emb = common_emb(decoder_inputs)
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_emb, initial_state=encoder_states)
decoder_dense = Dense(vocab_size, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
def generate_batch(length=4, batch_size=64):
x = np.random.randint(low=0, high=10, size=(batch_size, length))
y = x[:, ::-1]
start = np.ones((batch_size, 1), dtype=int) * 10
end = np.ones((batch_size, 1), dtype=int) * 11
enc_x = np.concatenate([start, x], axis=1)
dec_x = np.concatenate([start, y], axis=1)
dec_y = np.concatenate([y, end], axis=1)
dec_y_onehot = np.zeros(shape=(batch_size, length+1, vocab_size), dtype=int)
for row in range(batch_size):
for col in range(length+1):
dec_y_onehot[row, col, dec_y[row, col]] = 1
return [enc_x, dec_x], dec_y_onehot
def generate_batches(batch_size=64, max_length=10):
while True:
length = np.random.randint(low=1, high=max_length)
yield generate_batch(length=length, batch_size=batch_size)
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['categorical_accuracy'])
model.fit_generator(generate_batches(), steps_per_epoch=1000, epochs=20)
Now you can apply it to reverse a sequence (my decoder is very inefficient, but it does illustrate the principle)
input_seq = np.array([[10, 2, 1, 2, 8, 5, 0, 6]])
result = np.array([[10]])
next_digit = -1
for i in range(100):
next_digit = model.predict([input_seq, result])[0][-1].argmax()
if next_digit == 11:
break
result = np.concatenate([result, [[next_digit]]], axis=1)
print(result[0][1:])
Hoorray, it prints [6 0 5 8 2 1 2] !
Generally, you can think of such a model as a weird autoencoder (with a reversal side-effect), and choose architecture and training procedure suitable for autoencoders. And there is quite a vast literature about text autoencoders.
Moreover, if you make an encoder-decoder model with attention, then, it will have no memory bottleneck, so, in principle, it is possible to reverse a sequence of any length with a neural network. However, attention requires quadratic computational time, so in practice even neural networks with attention will be very inefficient for long sequences.
I doubt that a NN will learn the abstract structural transformation. Since the string is of unbounded input length, the finite NN won't have the info necessary. NLP processes generally work with identifying small blocks and simple context-sensitive shifts. I don't think they'd identify the end-to-end swaps needed.
However, I expect that an image processor, adapted to a single dimension, would learn this quite quickly. Some can learn how to rotate a sub-image.

How to scale / impute a tensor in a `sklearn` pipeline for input to a Keras LSTM

How can I use sklearn scaler / imputer to impute a tensor? I want to scale / impute within a pipeline. My input is a 3-d numpy array.
I have a tensor of shape (n_samples, n_timesteps, n_feat) a la Keras. This is a sequence that can be learned by an LSTM. I want to scale / impute first, however. In particular, I want to scale on the fly inside a sci-kit learn pipeline, since scaling the full dataset, which would be easy, leads to leakage. Keras already integrates w sklearn (see here), but there do not appear to be easy ways to scale and impute the tensors that keras time series models process.
Unfortunately, the following gives an error
import numpy as np
X = np.array([[[3,5],[6,2]],[[8.,23.],[7.,23]],[[3, 4],[2, 55]]])
print X
from sklearn.preprocessing import StandardScaler
s = StandardScaler()
X = s.fit_transform(X)
print X
Of the effect, "the scaler only works on 2-d numpy arrays".
My solution was to add a decorator to the sklearn preprocessing data.py file
def flat(func):
def wrapper(*args, **kwargs):
self, X = args
a, b, c = X.shape
X = X.reshape(a, b*c)
r = func(self, X, **kwargs)
if hasattr(r,'ndim'):
X = r.reshape(a, b, c)
return X
else:
return r
return wrapper
Then use it on the functions, eg fit
#flat
def fit(self, X, y=None):
"""Compute the mean and std to be used for later scaling.
Parameters
----------
X : {array-like, sparse matrix}, shape [n_samples, n_features]
The data used to compute the mean and standard deviation
used for later scaling along the features axis.
y : Passthrough for ``Pipeline`` compatibility.
"""
# Reset internal state before fitting
self._reset()
return self.partial_fit(X, y)
This works well; with the same script as above, I get
[[[ 3. 5.]
[ 6. 2.]]
[[ 8. 23.]
[ 7. 23.]]
[[ 3. 4.]
[ 2. 55.]]]
[[[-0.70710678 -0.64906302]
[ 0.46291005 -1.13191668]]
[[ 1.41421356 1.41266656]
[ 0.9258201 -0.16825789]]
[[-0.70710678 -0.76360355]
[-1.38873015 1.30017457]]]
But beware, it doesn't check for 2d arrays, which it can't process. So, use the normal preprocessing module for 2d arrays!

Logistic Regression on MNIST dataset

In this post you can find a very good tutorial on how to apply SVM classifier to MNIST dataset. I was wondering if I could use logistic regression instead of SVM classifier. So I searhed for Logistic regression in openCV, And I found that the syntax for both classifiers are almost identical. So I guessed that I could just comment out these parts:
cv::Ptr<cv::ml::SVM> svm = cv::ml::SVM::create();
svm->setType(cv::ml::SVM::C_SVC);
svm->setKernel(cv::ml::SVM::POLY);//LINEAR, RBF, SIGMOID, POLY
svm->setTermCriteria(cv::TermCriteria(cv::TermCriteria::MAX_ITER, 100, 1e-6));
svm->setGamma(3);
svm->setDegree(3);
svm->train( trainingMat , cv::ml::ROW_SAMPLE , labelsMat );
and replace it with:
cv::Ptr<cv::ml::LogisticRegression> lr1 = cv::ml::LogisticRegression::create();
lr1->setLearningRate(0.001);
lr1->setIterations(10);
lr1->setRegularization(cv::ml::LogisticRegression::REG_L2);
lr1->setTrainMethod(cv::ml::LogisticRegression::BATCH);
lr1->setMiniBatchSize(1);
lr1->train( trainingMat, cv::ml::ROW_SAMPLE, labelsMat);
But first I got this error:
OpenCV Error: Bad argument(data and labels must be a floating point matrix)
Then I changed
cv::Mat labelsMat(labels.size(), 1, CV_32S, labelsArray);
to:
cv::Mat labelsMat(labels.size(), 1, CV_32F, labelsArray);
And now I get this error: OpenCV Error: bad argument(data should have atleast two classes)
I have 10 classes (0,1,...,9) but I don't know why I get this error. My codes are almost identical with the ones in the mentioned tutorial.
In Python you could do something like this:
import matplotlib.pyplot as plt
# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics
from sklearn.linear_models import LogisticRegression
# The digits dataset
digits = datasets.load_digits()
# The data that we are interested in is made of 8x8 images of digits, let's
# have a look at the first 3 images, stored in the `images` attribute of the
# dataset. If we were working from image files, we could load them using
# pylab.imread. Note that each image must have the same size. For these
# images, we know which digit they represent: it is given in the 'target' of
# the dataset.
images_and_labels = list(zip(digits.images, digits.target))
for index, (image, label) in enumerate(images_and_labels[:4]):
plt.subplot(2, 4, index + 1)
plt.axis('off')
plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Training: %i' % label)
# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))
Choose which one of the classifiers you like below
# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)
# create a Logistic Regression Classifier
classifier = LogisticRegression(C=1.0)
# We learn the digits on the first half of the digits
classifier.fit(data[:n_samples / 2], digits.target[:n_samples / 2])
# Now predict the value of the digit on the second half:
expected = digits.target[n_samples / 2:]
predicted = classifier.predict(data[n_samples / 2:])
print("Classification report for classifier %s:\n%s\n"
% (classifier, metrics.classification_report(expected, predicted)))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))
images_and_predictions = list(zip(digits.images[n_samples / 2:], predicted))
for index, (image, prediction) in enumerate(images_and_predictions[:4]):
plt.subplot(2, 4, index + 5)
plt.axis('off')
plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Prediction: %i' % prediction)
plt.show()
You can see the whole code here

How to stack Autoencoder/ Create Deep Autoencoder with Theano class

I understand the concept behind Stacked/ Deep Autoencoders and therefore want to implement it with the following code of a single-layer de-noising Autoencoder. Theano also provides a tutorial for a Stacked Autoencoder but this is trained in a supervised fashion - I need to stack it to establish unsupervised (hierarchical) feature learning.
Any idea how to get this working with the following code?
import os
import sys
import timeit
import numpy
import theano
import theano.tensor as T
from theano.tensor.shared_randomstreams import RandomStreams
from logistic_sgd import load_data
from utils import tile_raster_images
try:
import PIL.Image as Image
except ImportError:
import Image
class dA(object):
"""Denoising Auto-Encoder class (dA)
A denoising autoencoders tries to reconstruct the input from a corrupted
version of it by projecting it first in a latent space and reprojecting
it afterwards back in the input space. Please refer to Vincent et al.,2008
for more details. If x is the input then equation (1) computes a partially
destroyed version of x by means of a stochastic mapping q_D. Equation (2)
computes the projection of the input into the latent space. Equation (3)
computes the reconstruction of the input, while equation (4) computes the
reconstruction error.
.. math::
\tilde{x} ~ q_D(\tilde{x}|x) (1)
y = s(W \tilde{x} + b) (2)
x = s(W' y + b') (3)
L(x,z) = -sum_{k=1}^d [x_k \log z_k + (1-x_k) \log( 1-z_k)] (4)
"""
def __init__(
self,
numpy_rng,
theano_rng=None,
input=None,
n_visible=784,
n_hidden=500,
W=None,
bhid=None,
bvis=None
):
"""
Initialize the dA class by specifying the number of visible units (the
dimension d of the input ), the number of hidden units ( the dimension
d' of the latent or hidden space ) and the corruption level. The
constructor also receives symbolic variables for the input, weights and
bias. Such a symbolic variables are useful when, for example the input
is the result of some computations, or when weights are shared between
the dA and an MLP layer. When dealing with SdAs this always happens,
the dA on layer 2 gets as input the output of the dA on layer 1,
and the weights of the dA are used in the second stage of training
to construct an MLP.
:type numpy_rng: numpy.random.RandomState
:param numpy_rng: number random generator used to generate weights
:type theano_rng: theano.tensor.shared_randomstreams.RandomStreams
:param theano_rng: Theano random generator; if None is given one is
generated based on a seed drawn from `rng`
:type input: theano.tensor.TensorType
:param input: a symbolic description of the input or None for
standalone dA
:type n_visible: int
:param n_visible: number of visible units
:type n_hidden: int
:param n_hidden: number of hidden units
:type W: theano.tensor.TensorType
:param W: Theano variable pointing to a set of weights that should be
shared belong the dA and another architecture; if dA should
be standalone set this to None
:type bhid: theano.tensor.TensorType
:param bhid: Theano variable pointing to a set of biases values (for
hidden units) that should be shared belong dA and another
architecture; if dA should be standalone set this to None
:type bvis: theano.tensor.TensorType
:param bvis: Theano variable pointing to a set of biases values (for
visible units) that should be shared belong dA and another
architecture; if dA should be standalone set this to None
"""
self.n_visible = n_visible
self.n_hidden = n_hidden
# create a Theano random generator that gives symbolic random values
if not theano_rng:
theano_rng = RandomStreams(numpy_rng.randint(2 ** 30))
# note : W' was written as `W_prime` and b' as `b_prime`
if not W:
# W is initialized with `initial_W` which is uniformely sampled
# from -4*sqrt(6./(n_visible+n_hidden)) and
# 4*sqrt(6./(n_hidden+n_visible))the output of uniform if
# converted using asarray to dtype
# theano.config.floatX so that the code is runable on GPU
initial_W = numpy.asarray(
numpy_rng.uniform(
low=-4 * numpy.sqrt(6. / (n_hidden + n_visible)),
high=4 * numpy.sqrt(6. / (n_hidden + n_visible)),
size=(n_visible, n_hidden)
),
dtype=theano.config.floatX
)
W = theano.shared(value=initial_W, name='W', borrow=True)
if not bvis:
bvis = theano.shared(
value=numpy.zeros(
n_visible,
dtype=theano.config.floatX
),
borrow=True
)
if not bhid:
bhid = theano.shared(
value=numpy.zeros(
n_hidden,
dtype=theano.config.floatX
),
name='b',
borrow=True
)
self.W = W
# b corresponds to the bias of the hidden
self.b = bhid
# b_prime corresponds to the bias of the visible
self.b_prime = bvis
# tied weights, therefore W_prime is W transpose
self.W_prime = self.W.T
self.theano_rng = theano_rng
# if no input is given, generate a variable representing the input
if input is None:
# we use a matrix because we expect a minibatch of several
# examples, each example being a row
self.x = T.dmatrix(name='input')
else:
self.x = input
self.params = [self.W, self.b, self.b_prime]
def get_corrupted_input(self, input, corruption_level):
"""This function keeps ``1-corruption_level`` entries of the inputs the
same and zero-out randomly selected subset of size ``coruption_level``
Note : first argument of theano.rng.binomial is the shape(size) of
random numbers that it should produce
second argument is the number of trials
third argument is the probability of success of any trial
this will produce an array of 0s and 1s where 1 has a
probability of 1 - ``corruption_level`` and 0 with
``corruption_level``
The binomial function return int64 data type by
default. int64 multiplicated by the input
type(floatX) always return float64. To keep all data
in floatX when floatX is float32, we set the dtype of
the binomial to floatX. As in our case the value of
the binomial is always 0 or 1, this don't change the
result. This is needed to allow the gpu to work
correctly as it only support float32 for now.
"""
return self.theano_rng.binomial(size=input.shape, n=1,
p=1 - corruption_level,
dtype=theano.config.floatX) * input
def get_hidden_values(self, input):
""" Computes the values of the hidden layer """
return T.nnet.sigmoid(T.dot(input, self.W) + self.b)
def get_reconstructed_input(self, hidden):
"""Computes the reconstructed input given the values of the
hidden layer
"""
return T.nnet.sigmoid(T.dot(hidden, self.W_prime) + self.b_prime)
def get_cost_updates(self, corruption_level, learning_rate):
""" This function computes the cost and the updates for one trainng
step of the dA """
tilde_x = self.get_corrupted_input(self.x, corruption_level)
y = self.get_hidden_values(tilde_x)
z = self.get_reconstructed_input(y)
# note : we sum over the size of a datapoint; if we are using
# minibatches, L will be a vector, with one entry per
# example in minibatch
L = - T.sum(self.x * T.log(z) + (1 - self.x) * T.log(1 - z), axis=1)
# note : L is now a vector, where each element is the
# cross-entropy cost of the reconstruction of the
# corresponding example of the minibatch. We need to
# compute the average of all these to get the cost of
# the minibatch
cost = T.mean(L)
# compute the gradients of the cost of the `dA` with respect
# to its parameters
gparams = T.grad(cost, self.params)
# generate the list of updates
updates = [
(param, param - learning_rate * gparam)
for param, gparam in zip(self.params, gparams)
]
return (cost, updates)

Resources