How to train model with features selected by SelectKBest? - machine-learning

I am using SelectKBest() in Sklearn's Pipeline() class to reduce the number of features down from 30 to the 5 best features. When I fit the classifer, I get different test results as expected with feature selection. However I spotted an error in my code which doesn't seem to cause an actual error in runtime.
When I call predict(), I realised that it was still being given all 30 features as input as if feature selection wasn't occurring. Even though I only trained the model on the 5 best features. Surely giving 30 features to an SVM to predict a class will crash if it was only trained on the 5 best features?
In my train_model(df) function, my code looks as follows:
def train_model(df):
x,y = balance_dataset(df)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
feature_selection = SelectKBest()
pipe = Pipeline([('sc', preprocessing.MinMaxScaler()),
('feature_selection', feature_selection),
('SVM', svm.SVC(decision_function_shape = 'ovr', kernel = 'poly'))])
candidate_parameters = [{'SVM__C': [0.01, 0.1, 1], 'SVM__gamma': [0.01, 0.1, 1], 'feature_selection__k': [5]}]
clf = GridSearchCV(estimator = pipe, param_grid = candidate_parameters, cv = 5, n_jobs = -1)
clf.fit(X_train, y_train )
return clf
However this is when what happens when I call trade():
def trade(df):
clf = train_model(df)
for index, row in trading_set.iterrows():
features = row[:-3] #features is now an array of 30 features, even though model is only trained on 5
if trade_balance > 0:
trades[index] = trade_balance
if clf.predict(features) == 1: #So this should crash and give an input Shape error, but it doesn't
#Rest of code unneccesary#
So my question is, how do I know that the model is really being trained on only the 5 best features?

Your code is correct, and there is no reason why it should throw you any error. You are confused between the pipeline object and the model itself, which is only one block of the pipeline.
In your example, the pipeline is taking 30 features, scaling them, selecting the 5 best, then training an SVM on these 5 best features. So your SVM has been trained on 5 best features, but you still need to pass all 30 features to your pipeline, because your pipeline expects data to come in in the same format as during the training.

Related

Where the categorical encoding should be done in a k fold - cv procedure?

I want to apply a cross validation method in my machine learning models. I these models, I want a Feature Selection and a GridSearch to be applied as well. Imagine that I want to estimate the performance of K-Nearest-Neighbor Classifier by applying a feature selection technique based on an F-score (ANOVA) that chooses the 10 most relevant features. The code would be as follows:
# 10-times 10-fold cross validation
n_repeats = 10
rkf = RepeatedKFold(n_splits=10, n_repeats = n_repeats, random_state=0)
# Data standardization
scaler = StandardScaler()
# Variable to contain error measures and counter for the splits
error_knn = []
split = 0
for train_index, test_index in rkf.split(X, y):
# Print a dot for each train / test partition
sys.stdout.write('.')
sys.stdout.flush()
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Standardize the data
scaler.fit(X_train, y_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
###- In order to select the best number of neighbors -###
# Pipeline for training the classifier from previous notebooks
pipeline = Pipeline([ ('knn', KNeighborsClassifier()) ])
N_neighbors = [1, 3, 5, 7, 11, 15, 20, 25, 30]
param_grid = { 'knn__n_neighbors': N_neighbors }
# Evaluate the performance in a 5-fold cross-validation
skfold = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=split)
# n_jobs = -1 to use all processors
gridcv = GridSearchCV(pipeline, cv=skfold, n_jobs=-1, param_grid=param_grid, \
scoring=make_scorer(accuracy_score))
result = gridcv.fit(X_train, y_train)
###- Results -###
# Mean accuracy and standard deviation
accuracies = gridcv.cv_results_['mean_test_score']
std_accuracies = gridcv.cv_results_['std_test_score']
# Best value for the number of neighbors
# Define KNeighbors Classifier with that best value
# Method fit(X,y) to fit each model according to training data
best_Nneighbors = N_neighbors[np.argmax(accuracies)]
knn = KNeighborsClassifier(n_neighbors = best_Nneighbors)
knn.fit(X_train, y_train)
# Error for the prediction
error_knn.append(1.0 - np.mean(knn.predict(X_test) == y_test))
split += 1
However, my columns are categorical (except binary label) and I need to do a categorical encoding. I can not remove this columns because they are essential.
Where would you perform this encoding and how the problems of categorical encoding of unseen labels in each fold would be solved?
Categorical encoding should be performed as the first step, precisely to avoid the problem you mentioned regarding unseen labels in each fold.
Additionally, your current implementation suffers from data leakage.
You're performing feature scaling on the full X_train dataset before performing your inner cross-validation.
This can be solved by including StandardScaler on the pipeline used for your GridSearchCV:
...
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
###- In order to select the best number of neighbors -###
# Pipeline for training the classifier from previous notebooks
pipeline = Pipeline(
[ ('scaler', scaler), ('knn', KNeighborsClassifier()) ]
)
N_neighbors = [1, 3, 5, 7, 11, 15, 20, 25, 30]
param_grid = { 'knn__n_neighbors': N_neighbors }
...
Another couple of tips:
GridSearchCV has a best_estimator_ attribute that can be used to extract the estimator with the best set of hyperparameters found.
When using GridSearchCV with refit=True (the default), you can use the object directly to perform predictions, e.g. gridcv.predict(X_test).
EDIT: Perhaps I was too general when it came to when to perform categorical enconding. Your approach should depend on your problem/dataset.
If you know beforehand how many categorical features exist and you want to train your inner CV classifiers with this knowledge, you should perform categorical enconding as the first step.
If at training time you do not know how many categorical features you are going to see or you want to train your CV classifiers without knowledge of the full range of categorical features, you should perform categorical enconding at each fold.
When using the former your classifiers will all be trained on the same feature space while that's not guaranteed for the latter.
If using the latter, the above pipeline can be extended to incorporate categorical encoding:
pipeline = Pipeline(
[
('enc', OneHotEncoder()),
('scaler', StandardScaler(with_mean=False)),
('knn', KNeighborsClassifier()),
],
)
I suggest you read the Encoding categorical features section of scikit-learn's User Guide carefully.

PyTorch: Predicting future values with LSTM

I'm currently working on building an LSTM model to forecast time-series data using PyTorch. I used lag features to pass the previous n steps as inputs to train the network. I split the data into three sets, i.e., train-validation-test split, and used the first two to train the model. My validation function takes the data from the validation data set and calculates the predicted valued by passing it to the LSTM model using DataLoaders and TensorDataset classes. Initially, I've got pretty good results with R2 values in the region of 0.85-0.95.
However, I have an uneasy feeling about whether this validation function is also suitable for testing my model's performance. Because the function now takes the actual X values, i.e., time-lag features, from the DataLoader to predict y^ values, i.e., predicted target values, instead of using the predicted y^ values as features in the next prediction. This situation seems far from reality where the model has no clue of the real values of the previous time steps, especially if you forecast time-series data for longer time periods, say 3-6 months.
I'm currently a bit puzzled about tackling this issue and defining a function to predict future values relying on the model's values rather than the actual values in the test set. I have the following function predict, which makes a one-step prediction, but I haven't really figured out how to predict the whole test dataset using DataLoader.
def predict(self, x):
# convert row to data
x = x.to(device)
# make prediction
yhat = self.model(x)
# retrieve numpy array
yhat = yhat.to(device).detach().numpy()
return yhat
You can find how I split and load my datasets, my constructor for the LSTM model, and the validation function below. If you need more information, please do not hesitate to reach out to me.
Splitting and Loading Datasets
def create_tensor_datasets(X_train_arr, X_val_arr, X_test_arr, y_train_arr, y_val_arr, y_test_arr):
train_features = torch.Tensor(X_train_arr)
train_targets = torch.Tensor(y_train_arr)
val_features = torch.Tensor(X_val_arr)
val_targets = torch.Tensor(y_val_arr)
test_features = torch.Tensor(X_test_arr)
test_targets = torch.Tensor(y_test_arr)
train = TensorDataset(train_features, train_targets)
val = TensorDataset(val_features, val_targets)
test = TensorDataset(test_features, test_targets)
return train, val, test
def load_tensor_datasets(train, val, test, batch_size=64, shuffle=False, drop_last=True):
train_loader = DataLoader(train, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last)
val_loader = DataLoader(val, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last)
test_loader = DataLoader(test, batch_size=batch_size, shuffle=shuffle, drop_last=drop_last)
return train_loader, val_loader, test_loader
Class LSTM
class LSTMModel(nn.Module):
def __init__(self, input_dim, hidden_dim, layer_dim, output_dim, dropout_prob):
super(LSTMModel, self).__init__()
self.hidden_dim = hidden_dim
self.layer_dim = layer_dim
self.lstm = nn.LSTM(
input_dim, hidden_dim, layer_dim, batch_first=True, dropout=dropout_prob
)
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, x, future=False):
h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_()
c0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).requires_grad_()
out, (hn, cn) = self.lstm(x, (h0.detach(), c0.detach()))
out = out[:, -1, :]
out = self.fc(out)
return out
Validation (defined within a trainer class)
def validation(self, val_loader, batch_size, n_features):
with torch.no_grad():
predictions = []
values = []
for x_val, y_val in val_loader:
x_val = x_val.view([batch_size, -1, n_features]).to(device)
y_val = y_val.to(device)
self.model.eval()
yhat = self.model(x_val)
predictions.append(yhat.cpu().detach().numpy())
values.append(y_val.cpu().detach().numpy())
return predictions, values
I've finally found a way to forecast values based on predicted values from the earlier observations. As expected, the predictions were rather accurate in the short-term, slightly becoming worse in the long term. It is not so surprising that the future predictions digress over time, as they no longer depend on the actual values. Reflecting on my results and the discussions I had on the topic, here are my take-aways:
In real-life cases, the real values can be retrieved and fed into the model at each step of the prediction -be it weekly, daily, or hourly- so that the next step can be predicted with the actual values from the previous step. So, testing the performance based on the actual values from the test set may somewhat reflect the real performance of the model that is maintained regularly.
However, for predicting future values in the long term, forecasting, if you will, you need to make either multiple one-step predictions or multi-step predictions that span over the time period you wish to forecast.
Making multiple one-step predictions based on the values predicted the model yields plausible results in the short term. As the forecasting period increases, the predictions become less accurate and therefore less fit for the purpose of forecasting.
To make multiple one-step predictions and update the input after each prediction, we have to work our way through the dataset one by one, as if we are going through a for-loop over the test set. Not surprisingly, this makes us lose all the computational advantages that matrix operations and mini-batch training provide us.
An alternative could be predicting sequences of values, instead of predicting the next value only, say using RNNs with multi-dimensional output with many-to-many or seq-to-seq structure. They are likely to be more difficult to train and less flexible to make predictions for different time periods. An encoder-decoder structure may prove useful for solving this, though I have not implemented it by myself.
You can find the code for my function that forecasts the next n_steps based on the last row of the dataset X (time-lag features) and y (target value). To iterate over each row in my dataset, I would set batch_size to 1 and n_features to the number of lagged observations.
def forecast(self, X, y, batch_size=1, n_features=1, n_steps=100):
predictions = []
X = torch.roll(X, shifts=1, dims=2)
X[..., -1, 0] = y.item(0)
with torch.no_grad():
self.model.eval()
for _ in range(n_steps):
X = X.view([batch_size, -1, n_features]).to(device)
yhat = self.model(X)
yhat = yhat.to(device).detach().numpy()
X = torch.roll(X, shifts=1, dims=2)
X[..., -1, 0] = yhat.item(0)
predictions.append(yhat)
return predictions
The following line shifts values in the second dimension of the tensor by one so that a tensor [[[x1, x2, x3, ... , xn ]]] becomes [[[xn, x1, x2, ... , x(n-1)]]].
X = torch.roll(X, shifts=1, dims=2)
And, the line below selects the first element from the last dimension of the 3d tensor and sets that item to the predicted value stored in the NumPy ndarray (yhat), [[xn+1]]. Then, the new input tensor becomes [[[x(n+1), x1, x2, ... , x(n-1)]]]
X[..., -1, 0] = yhat.item(0)
Recently, I've decided to put together the things I had learned and the things I would have liked to know earlier. If you'd like to have a look, you can find the links down below. I hope you'll find it useful. Feel free to comment or reach out to me if you agree or disagree with any of the remarks I made above.
Building RNN, LSTM, and GRU for time series using PyTorch
Predicting future values with RNN, LSTM, and GRU using PyTorch

scikit-learn and imblearn: does GridSearchCV/RandomSearchCV apply preprocessing to the validation set as well?

I'm currently using sklearn for a school project and I have some questions about how GridsearchCV applies preprocessing algorithms such as PCA or Factor Analysis. Let's suppose I perform hold out:
X_tr, X_ts, y_tr, y_ts = train_test_split(X, y, test_size = 0.1, stratify = y)
Then, I declare some hyperparameters and perform a GridSearchCV (it would be the same with RandomSearchCV but whatever):
params = {
'linearsvc__C' : [...],
'linearsvc__tol' : [...],
'linearsvc__degree' : [...]
}
clf = make_pipeline(PCA(), SVC(kernel='linear'))
model = GridSearchCV(clf, params, cv = 5, verbose = 2, n_jobs = -1)
model.fit(X_tr, y_tr)
My issue is: my teacher told me that you should never fit the preprocessing algorithm (here PCA) on the validation set in case of a k fold cv, but only on the train split (here both the train split and validation split are subsets of X_tr, and of course they change at every fold). So if I have PCA() here, it should fit on the part of the fold used for training the model and eventually when I test the resulting model against the validation split, preprocess it using the PCA model obtained fitting it against the training set. This ensures no leaks whatsowever.
Does sklearn account for this?
And if it does: suppose that now I want to use imblearn to perform oversampling on an unbalanced set:
clf = make_pipeline(SMOTE(), SVC(kernel='linear'))
still according to my teacher, you shouldn't perform oversampling on the validation split as well, as this could lead to inaccurate accuracies. So the statement above that held for PCA about transforming the validation set on a second moment does not apply here.
Does sklearn/imblearn account for this as well?
Many thanks in advance

How to conduct dataset balancing whilst using Pipeline in Sklearn? [duplicate]

This question already has answers here:
Balance classes in cross validation
(2 answers)
Process for oversampling data for imbalanced binary classification
(2 answers)
Closed 2 years ago.
I have decided to use Sklearn's Pipeline class to ensure that my model is not prone to data leakage.
However, my multi-class classification dataset is extremely imbalanced (3 classes) and hence need to implement data set balancing. However, I have researched properly but I cannot find an answer as to when and how this dataset rebalancing step should be conducted. Should it be done before scaling or after? Should it be done train/test split or after?
For simplicity's sake, I will not be using SMOTE, but rather random minority upsampling. Any answer would be greatly appreciated.
My code is as follows:
#All necessary packages have already been imported
x = df['MACD', 'MFI', 'ROC', 'RSI', 'Ultimate Oscillator', 'Williams %R', 'Awesome Oscillator', 'KAMA',
'Stochastic Oscillator', 'TSI', 'Volume Accumulator', 'ADI', 'CMF', 'EoM', 'FI', 'VPT','ADX','ADX Negative',
'ADX Positive', 'EMA', 'CRA']
y = df['Label']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 0)
pipe = Pipeline([('sc', StandardScaler()),
('svc', SVC(decision_function_shape = 'ovr'))])
candidate_parameters = [{'C': [0.0001, 0.001, 0.01, 0.1, 1, 2, 3],
'gamma': [0.0001, 0.001, 0.01, 0.1, 1, 2, 3], 'kernel': ['poly']
}]
clf = GridSearchCV(estimator = pipe, param_grid = candidate_parameters, cv = 5, n_jobs = -1)
clf.fit(X_train, y_train)
You need to do rebalancing after train/test split. In real world, you do not know what will be your test set so it is better to keep original. You can rebalance only train set to learn better model and then test on original test dataset. (you can also keep validation set as original)

Difference between cross_val_score and another way of calculating accuracy

I tried to calculate the accuracy and was puzzled by the fact that cross_val_score gives a rather low result, than by comparing the predicted results with the correct.
First way of counting, that gives
[0.8033333333333333, 0.7908333333333334, 0.8033333333333333, 0.7925,0.8066666666666666]
kf = KFold(shuffle=True, n_splits=5)
scores = []
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
scores.append(np.sum(y_pred == y_test) / len(y_test))
Second way gives array([0.46166667, 0.53583333, 0.40916667, 0.44666667, 0.3775 ]):
model = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)
cross_val_score(model, X, y, cv = 5, scoring='accuracy')
What's my mistake?
cross_val_score will use a StratifiedKFold cv iterator when not specified otherwise. A StratifiedKFold will keep the ratio of classes balanced the same way in train and test split. For more explanation, see my other answer here:-
https://stackoverflow.com/a/48314533/3374996
On the other hand, in your first approach you are using KFold which will not keep the balance of classes. In addition you are doing shuffling of data in that.
So in each fold, there is data difference in your two approaches and hence different results.
The low score in cross_val_score is probably because of the fact that you are providing the complete data to it, instead of breaking it into test and training set. This generally leads to leakage of information which results in your model giving incorrect predictions. See this post for more explanation.
References
Learn the right way to validate models

Resources