How to normalise all training samples at once using MinMaxScaler - machine-learning

I have 1320 training samples (sea surface temperature) and each sample is a 2d array(160,320) so the final array is in the shape (1320,160,320). I would like to normalize them to values between 0 and 1 using MinMaxScaler(). I get the error "Found array with dim 3. MinMaxScaler expected <= 2.". My code is as follows. I could loop through all the 1320 samples, normalising them one by one but I would like to know if there is a way to normalize all of them because Max and Mix for each sample is not the same.
scaler = prep.MinMaxScaler()
sst = scaler.fit_transform(sst)

As far as I know, you can't really do it only using MinMaxScaler(). np.apply_along_axis won't be useful either since you want to apply a min-max scaler over 2D slices. One solution could be something like this:
import numpy as np
a = np.random.random((2, 3, 3))
def customMinMaxScaler(X):
return (X - X.min()) / (X.max() - X.min())
np.array([customMinMaxScaler(x) for x in a])
But I guess it wouldn't be much faster than iterating over the samples.

Related

How to handle class imbalance of multiple columns?

My dataset is :enter image description here. First seven columns are for input metric. And the last five columns are for outputs. Output is an array of 5 numbers consist of zero or one. I am using Keras functional API for that. Whenever I try to to resample my data with individual columns, I got shape issues in merging, even if I I try to slice the rows.
Basically there's no "easy" approach to doing this. The only logical way is to maybe use Label Powerset over your design matrix, and resample based on the created column off that - though in that scenario it might be easier to "handcraft" such a transformation.
Here is one approach
import numpy as np
from sklearn.datasets import make_multilabel_classification
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
import pandas as pd
X0, y = make_classification()
_, X1 = make_multilabel_classification(n_classes=5, random_state=0)
# transform X1 by creating a powerset...
df_x1 = pd.DataFrame(X1, columns=[f'c{x}' for x in range(X1.shape[1])])
df_x1 = pd.merge(df_x1, df_x1.drop_duplicates().reset_index()).rename(columns={"index":"dummy"})
print(df_x1['dummy'].value_counts()) # shows imbalance
df_x1 = df_x1.reset_index() # so that we know which rows are resampled
df_y1 = df_x1['dummy']
df_x1 = df_x1[[x for x in df_x1.columns if x != 'dummy']]
ros = RandomOverSampler()
X_sample, _ = ros.fit_resample(df_x1, df_y1) # this is the resampled index
X = np.hstack([X0, X1])
X_res, y_res = X[X_sample['index'], :], y[X_sample['index']]
Where the secret sauce really is this bit:
df_x1 = pd.merge(df_x1, df_x1.drop_duplicates().reset_index()).rename(columns={"index":"dummy"})
Which re-indexes based on the selected 5 columns
df_x1 = df_x1.reset_index()
Which is then used in the RandomOverSampler, and would guarantee the 5 columns would be balanced.
Finally, we can select the indices of the sampling, to generate a dataset and labels which has been successfully resampled across both X0, X1, y
X = np.hstack([X0, X1])
X_res, y_res = X[X_sample['index'], :], y[X_sample['index']]

How to scale / impute a tensor in a `sklearn` pipeline for input to a Keras LSTM

How can I use sklearn scaler / imputer to impute a tensor? I want to scale / impute within a pipeline. My input is a 3-d numpy array.
I have a tensor of shape (n_samples, n_timesteps, n_feat) a la Keras. This is a sequence that can be learned by an LSTM. I want to scale / impute first, however. In particular, I want to scale on the fly inside a sci-kit learn pipeline, since scaling the full dataset, which would be easy, leads to leakage. Keras already integrates w sklearn (see here), but there do not appear to be easy ways to scale and impute the tensors that keras time series models process.
Unfortunately, the following gives an error
import numpy as np
X = np.array([[[3,5],[6,2]],[[8.,23.],[7.,23]],[[3, 4],[2, 55]]])
print X
from sklearn.preprocessing import StandardScaler
s = StandardScaler()
X = s.fit_transform(X)
print X
Of the effect, "the scaler only works on 2-d numpy arrays".
My solution was to add a decorator to the sklearn preprocessing data.py file
def flat(func):
def wrapper(*args, **kwargs):
self, X = args
a, b, c = X.shape
X = X.reshape(a, b*c)
r = func(self, X, **kwargs)
if hasattr(r,'ndim'):
X = r.reshape(a, b, c)
return X
else:
return r
return wrapper
Then use it on the functions, eg fit
#flat
def fit(self, X, y=None):
"""Compute the mean and std to be used for later scaling.
Parameters
----------
X : {array-like, sparse matrix}, shape [n_samples, n_features]
The data used to compute the mean and standard deviation
used for later scaling along the features axis.
y : Passthrough for ``Pipeline`` compatibility.
"""
# Reset internal state before fitting
self._reset()
return self.partial_fit(X, y)
This works well; with the same script as above, I get
[[[ 3. 5.]
[ 6. 2.]]
[[ 8. 23.]
[ 7. 23.]]
[[ 3. 4.]
[ 2. 55.]]]
[[[-0.70710678 -0.64906302]
[ 0.46291005 -1.13191668]]
[[ 1.41421356 1.41266656]
[ 0.9258201 -0.16825789]]
[[-0.70710678 -0.76360355]
[-1.38873015 1.30017457]]]
But beware, it doesn't check for 2d arrays, which it can't process. So, use the normal preprocessing module for 2d arrays!

Kernel Function in Gaussian Processes

Given a kernel in Gaussian Process, is it possible to know the shape of functions being drawn from the prior distribution without sampling at first?
I think the best way to know the shape of prior functions is to draw them. Here's 1-dimensional example:
These are the samples from the GP prior (mean is 0 and covariance matrix induced by the squared exponential kernel). As you case see they are smooth and generally it gives a feeling how "wiggly" they are. Also note that in case of multi-dimensions each one of them will look somewhat like this.
Here's a full code I used, feel free to write your own kernel or tweak the parameters to see how it affects the samples:
import numpy as np
import matplotlib.pyplot as pl
def kernel(a, b, gamma=0.1):
""" GP squared exponential kernel """
sq_dist = np.sum(a**2, 1).reshape(-1, 1) + np.sum(b**2, 1) - 2*np.dot(a, b.T)
return np.exp(-0.5 * (1 / gamma) * sq_dist)
n = 300 # number of points.
m = 10 # number of functions to draw.
s = 1e-6 # noise variance.
X = np.linspace(-5, 5, n).reshape(-1, 1)
K = kernel(X, X)
L = np.linalg.cholesky(K + s * np.eye(n))
f_prior = np.dot(L, np.random.normal(size=(n, m)))
pl.figure(1)
pl.clf()
pl.plot(X, f_prior)
pl.title('%d samples from the GP prior' % m)
pl.axis([-5, 5, -3, 3])
pl.show()

Scikit-learn PCA .fit_transform shape is inconsistent (n_samples << m_attributes)

I am getting different shapes for my PCA using sklearn. Why isn't my transformation resulting in an array of the same dimensions like the docs say?
fit_transform(X, y=None)
Fit the model with X and apply the dimensionality reduction on X.
Parameters:
X : array-like, shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features.
Returns:
X_new : array-like, shape (n_samples, n_components)
Check this out with the iris dataset which is (150, 4) where I'm making 4 PCs:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})
%matplotlib inline
np.random.seed(0)
# Iris dataset
DF_data = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = load_iris().feature_names)
Se_targets = pd.Series(load_iris().target,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
name = "Species")
# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data),
index = DF_data.index,
columns = DF_data.columns)
# Sklearn for Principal Componenet Analysis
# Dims
m = DF_standard.shape[1]
K = m
# PCA (How I tend to set it up)
M_PCA = decomposition.PCA()
A_components = M_PCA.fit_transform(DF_standard)
#DF_standard.shape, A_components.shape
#((150, 4), (150, 4))
but then when I use the same exact approach on my actual dataset (76, 1989) as in 76 samples and 1989 attributes/dimensions I get a (76, 76) array instead of (76, 1989)
DF_centered = normalize(DF_mydata, method="center", axis=0)
m = DF_centered.shape[1]
# print(m)
# 1989
M_PCA = decomposition.PCA(n_components=m)
A_components = M_PCA.fit_transform(DF_centered)
DF_centered.shape, A_components.shape
# ((76, 1989), (76, 76))
normalize is just a wrapper I made that subtracts the mean from each dimension.
(Note: this answer is adapted from my answer on Cross Validated here: Why are there only n−1 principal components for n data points if the number of dimensions is larger or equal than n?)
PCA (as most typically run) creates a new coordinate system by:
shifting the origin to the centroid of your data,
squeezes and/or stretches the axes to make them equal in length, and
rotates your axes into a new orientation.
(For more details, see this excellent CV thread: Making sense of principal component analysis, eigenvectors & eigenvalues.) However, step 3 rotates your axes in a very specific way. Your new X1 (now called "PC1", i.e., the first principal component) is oriented in your data's direction of maximal variation. The second principal component is oriented in the direction of the next greatest amount of variation that is orthogonal to the first principal component. The remaining principal components are formed likewise.
With this in mind, let's examine a simple example (suggested by #amoeba in a comment). Here is a data matrix with two points in a three dimensional space:
X = [ 1 1 1
2 2 2 ]
Let's view these points in a (pseudo) three dimensional scatterplot:
So let's follow the steps listed above. (1) The origin of the new coordinate system will be located at (1.5,1.5,1.5). (2) The axes are already equal. (3) The first principal component will go diagonally from what used to be (0,0,0) to what was originally (3,3,3), which is the direction of greatest variation for these data. Now, the second principal component must be orthogonal to the first, and should go in the direction of the greatest remaining variation. But what direction is that? Is it from (0,0,3) to (3,3,0), or from (0,3,0) to (3,0,3), or something else? There is no remaining variation, so there cannot be any more principal components.
With N=2 data, we can fit (at most) N−1=1 principal components.

How to use SGD for time series analysis

Is it possible to use stochastic gradient descent for time-series analysis?
My initial idea, given a series of (t, v) pairs where I want an SGD regressor to predict the v associated with t+1, would be to convert the date/time into an integer value, and train the regressor on this list using the hinge loss function. Is this feasible?
Edit: This is example code using the SGD implementation in scikit-learn. However, it fails to properly predict a simple linear time series model. All it seems to do is calculate the average of the training Y-values, and use that as its prediction of the test Y-values. Is SGD just unsuitable for time-series-analysis or am I formulating this incorrectly?
from datetime import date
from sklearn.linear_model import SGDRegressor
# Build data.
s = date(2010,1,1)
i = 0
training = []
for _ in xrange(12):
i += 1
training.append([[date(2012,1,i).toordinal()], i])
testing = []
for _ in xrange(12):
i += 1
testing.append([[date(2012,1,i).toordinal()], i])
clf = SGDRegressor(loss='huber')
print 'Training...'
for _ in xrange(20):
try:
print _
clf.partial_fit(X=[X for X,_ in training], y=[y for _,y in training])
except ValueError:
break
print 'Testing...'
for X,y in testing:
p = clf.predict(X)
print y,p,abs(p-y)
SGDRegressor in sklearn is numerically not stable for not scaled input parameters. For good result it's highly recommended that you scale the input variable.
from datetime import date
from sklearn.linear_model import SGDRegressor
# Build data.
s = date(2010,1,1).toordinal()
i = 0
training = []
for _ in range(1,13):
i += 1
training.append([[s+i], i])
testing = []
for _ in range(13,25):
i += 1
testing.append([[s+i], i])
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform([X for X,_ in training])
after training the SGD regressor, you will have to scale the test input variable accordingly.
clf = SGDRegressor()
clf.fit(X=X_train, y=[y for _,y in training])
print(clf.intercept_, clf.coef_)
print('Testing...')
for X,y in testing:
p = clf.predict(scaler.transform([X]))
print(X[0],y,p[0],abs(p[0]-y))
Here is the result:
[6.31706122] [3.35332573]
Testing...
733786 13 12.631164799851827 0.3688352001481725
733787 14 13.602565350686039 0.39743464931396133
733788 15 14.573965901520248 0.42603409847975193
733789 16 15.545366452354457 0.45463354764554254
733790 17 16.51676700318867 0.48323299681133136
733791 18 17.488167554022876 0.5118324459771237
733792 19 18.459568104857084 0.5404318951429161
733793 20 19.430968655691295 0.569031344308705
733794 21 20.402369206525506 0.5976307934744938
733795 22 21.373769757359714 0.6262302426402861
733796 23 22.34517030819392 0.6548296918060785
733797 24 23.316570859028133 0.6834291409718674
The method of choice for time series prediction depends on what you know about your time series. If you choose a specific method for your task you always make implicit assumptions about the nature of your signal and the kind of system that generated the signal. Any method is always a model of the system. The more you know a priori about your signal and the system the better you are able to model it.
If your signal for instance is of stochastic nature, usually ARMA processes or Kalman filters are a good choice. If those fail, other more deterministic models might help, given, of corse, you have some information about you system.

Resources