SciKit Learn feature selection and cross validation using RFECV - machine-learning

I am still very new to machine learning and trying to figure things out myself. I am using SciKit learn and have a data set of tweets with around 20,000 features (n_features=20,000). So far I achieved a precision, recall and f1 score of around 79%. I would like to use RFECV for feature selection and improve the performance of my model. I have read the SciKit learn documentation but am still a bit confused on how to use RFECV.
This is the code I have so far:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.cross_validation import cross_val_score
from sklearn.feature_selection import RFECV
from sklearn import metrics
# cross validation
sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
for train_index, test_index in sss:
docs_train, docs_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# feature extraction
count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
X_CV = count_vect.fit_transform(docs_train)
tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_CV)
# Create the RFECV object
nb = MultinomialNB(alpha=0.5)
# The "accuracy" scoring is proportional to the number of correct classifications
rfecv = RFECV(estimator=nb, step=1, cv=2, scoring='accuracy')
rfecv.fit(X_tfidf, y_train)
X_rfecv=rfecv.transform(X_tfidf)
print("Optimal number of features : %d" % rfecv.n_features_)
# train classifier
clf = MultinomialNB(alpha=0.5).fit(X_rfecv, y_train)
# test clf on test data
X_test_CV = count_vect.transform(docs_test)
X_test_tfidf = tfidf_transformer.transform(X_test_CV)
X_test_rfecv = rfecv.transform(X_test_tfidf)
y_predicted = clf.predict(X_test_rfecv)
#print the mean accuracy on the given test data and labels
print ("Classifier score is: %s " % rfecv.score(X_test_rfecv,y_test))
Three questions:
1) Is this the correct way to use cross validation and RFECV? I am especially interested to know if I am running any risk of overfitting.
2) The accuracy of my model before and after I implemented RFECV with the above code are almost the same (around 78-79%), which puzzles me. I would expect performance to improve by using RFECV. Anything I might have missed here or could do differently to improve the performance of my model?
3) What other feature selection methods could you recommend me to try? I have tried RFE and SelectKBest so far, but they both haven't given me any improvement in terms of model accuracy.

To answer your questions:
There is a cross-validation built in the RFECV feature selection (hence the name), so you don't really need to have additional cross-validation for this single step. However since I understand you are running several tests, it's good to have an overall cross-validation to ensure you're not overfitting to a specific train-test split. I'd like to mention 2 points here:
I doubt the code behaves exactly like you think it does ;).
# cross validation
sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
for train_index, test_index in sss:
docs_train, docs_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# feature extraction
count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
X_CV = count_vect.fit_transform(docs_train)
Here we first go through the loop, that has 5 iterations (n_iter parameter in StratifiedShuffleSplit). Then we go out of the loop and we just run all your code with the last values of train_index, test_index. So this is equivalent to a single train-test split where you probably meant to have 5. You should move your code back into the loop if you want it to run like a 'proper' cross validation.
You are worried about overfitting: indeed when 'looking for the best method' the risk exists that we're going to pick the method that works best... only on the small sample we're testing the method on.
Here the best practice is to have a first train-test split, then to perform cross-validation only using the train set. The test set can be used 'sparingly' when you think you found something, to make sure the scores you get are consistent and you're not overfitting.
It may look like you're throwing away 30% of your data (your test set), but it's absolutely worth it.
It can be puzzling to see feature selection does not have that big an impact. To introspect a bit more you could look into the evolution of the score with the number of selected features (see the example from the docs).
That being said, I don't think this is the right use case for RFE. Basically with your code you are eliminating features one by one, which probably takes a long time to run and does not make so much sense when you have 20000 features.
Other feature selection methods: here you mention SelectKBest but you don't tell us which method you use to score your features! SelectKBest will pick the K best features according to a score function. I'm guessing you were using the default which is ok, but it's better to have an idea of what the default does ;).
I would try SelectPercentile with chi2 as a score function. SelectPercentile is probably a bit more convenient than SelectKBest because if your dataset grows a percentage probably makes more sense than a hardcoded number of features.
Another example from the docs that does just that (and more).
Additional remarks:
You could use a TfidfVectorizer instead of a CountVectorizer followed by a TfidfTransformer. This is strictly equivalent.
You could use a pipeline object to pack the different steps of your classifier into a single object you can run cross validation on (I encourage you to read the docs, it's pretty useful).
from sklearn.feature_selection import chi2_sparse
from sklearn.feature_selection import SelectPercentile
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
pipeline = Pipeline(steps=[
("vectorizer", TfidfVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))),
("selector", SelectPercentile(score_func=chi2, percentile=70)),
('NB', MultinomialNB(alpha=0.5))
])
Then you'd be able to run cross validation on the pipeline object to find the best combination of alpha and percentile, which is much harder to do with separate estimators.
Hope this helps, happy learning ;).

Related

Correct way to do cross validation in a pipeline with imbalanced data

For the given imbalanced data , I have created a different pipelines for standardization & one hot encoding
numeric_transformer = Pipeline(steps = [('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=['ohe', OneHotCategoricalEncoder()])
After that a column transformer keeping the above pipelines in one
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer,categorical_features)]
The final pipeline is as below
smt = SMOTE(random_state=42)
rf = pl1([('preprocessor', preprocessor),('smote',smt),
('classifier', RandomForestClassifier())])
I am doing the pipeline fit on imbalanced data so i have included the SMOTE technique along with the pre-processing and classifier. As it is imbalanced I want to check for the recall score.
Is the correct way as shown in the code below? I am getting recall around 0.98 which can cause the model to overfit. Any suggestions if I am making any mistake?
scores = cross_val_score(rf, X, y, cv=5,scoring="recall")
The important concern in imbalanced settings is to ensure that enough members of the minority class will be present in each CV fold; thus, it would seem advisable to enforce that using StratifiedKFold, i.e.:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
scores = cross_val_score(rf, X, y, cv=skf, scoring="recall")
Nevertheless, it turns out that even when using the cross_val_score as you do (i.e. simply with cv=5), scikit-learn takes care of it and engages a stratified CV indeed; from the docs:
cv : int, cross-validation generator or an iterable, default=None
None, to use the default 5-fold cross validation,
int, to specify the number of folds in a (Stratified)KFold.
For int/None inputs, if the estimator is a classifier and y is either
binary or multiclass, StratifiedKFold is used. In all other cases,
KFold is used.
So, using your code as is:
scores = cross_val_score(rf, X, y, cv=5, scoring="recall")
is absolutely fine indeed.

Are the k-fold cross-validation scores from scikit-learn's `cross_val_score` and `GridsearchCV` biased if we include transformers in the pipeline?

Data pre-processers such as StandardScaler should be used to fit_transform the train set and only transform (not fit) the test set. I expect the same fit/transform process applies to cross-validation for tuning the model. However, I found cross_val_score and GridSearchCV fit_transform the entire train set with the preprocessor (rather than fit_transform the inner_train set, and transform the inner_validation set). I believe this artificially removes the variance from the inner_validation set which makes the cv score (the metric used to select the best model by GridSearch) biased. Is this a concern or did I actually miss anything?
To demonstrate the above issue, I tried the following three simple test cases with the Breast Cancer Wisconsin (Diagnostic) Data Set from Kaggle.
I intentionally fit and transform the entire X with StandardScaler()
X_sc = StandardScaler().fit_transform(X)
lr = LogisticRegression(penalty='l2', random_state=42)
cross_val_score(lr, X_sc, y, cv=5)
I include SC and LR in the Pipeline and run cross_val_score
pipe = Pipeline([
('sc', StandardScaler()),
('lr', LogisticRegression(penalty='l2', random_state=42))
])
cross_val_score(pipe, X, y, cv=5)
Same as 2 but with GridSearchCV
pipe = Pipeline([
('sc', StandardScaler()),
('lr', LogisticRegression(random_state=42))
])
params = {
'lr__penalty': ['l2']
}
gs=GridSearchCV(pipe,
param_grid=params, cv=5).fit(X, y)
gs.cv_results_
They all produce the same validation scores.
[0.9826087 , 0.97391304, 0.97345133, 0.97345133, 0.99115044]
No, sklearn doesn't do fit_transform with entire dataset.
To check this, I subclassed StandardScaler to print the size of the dataset sent to it.
class StScaler(StandardScaler):
def fit_transform(self,X,y=None):
print(len(X))
return super().fit_transform(X,y)
If you now replace StandardScaler in your code, you'll see dataset size passed in first case is actually bigger.
But why does the accuracy remain exactly same? I think this is because LogisticRegression is not very sensitive to feature scale. If we instead use a classifier that is very sensitive to scale, like KNeighborsClassifier for example, you'll find accuracy between two cases start to vary.
X,y = load_breast_cancer(return_X_y=True)
X_sc = StScaler().fit_transform(X)
lr = KNeighborsClassifier(n_neighbors=1)
cross_val_score(lr, X_sc,y, cv=5)
Outputs:
569
[0.94782609 0.96521739 0.97345133 0.92920354 0.9380531 ]
And the 2nd case,
pipe = Pipeline([
('sc', StScaler()),
('lr', KNeighborsClassifier(n_neighbors=1))
])
print(cross_val_score(pipe, X, y, cv=5))
Outputs:
454
454
456
456
456
[0.95652174 0.97391304 0.97345133 0.92920354 0.9380531 ]
Not big change accuracy-wise, but change nonetheless.
Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test
A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:
A model is trained using of the folds as training data;
the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.
More over if your model is already biased from starting we have to make it balance by SMOTE /Oversampling of Less Target Variable/Under-sampling of High target variable.

Cross Validation in Keras

I'm implementing a Multilayer Perceptron in Keras and using scikit-learn to perform cross-validation. For this, I was inspired by the code found in the issue Cross Validation in Keras
from sklearn.cross_validation import StratifiedKFold
def load_data():
# load your data using this function
def create model():
# create your model using this function
def train_and_evaluate__model(model, data[train], labels[train], data[test], labels[test)):
# fit and evaluate here.
if __name__ == "__main__":
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
In my studies on neural networks, I learned that the knowledge representation of the neural network is in the synaptic weights and during the network tracing process, the weights that are updated to thereby reduce the network error rate and improve its performance. (In my case, I'm using Supervised Learning)
For better training and assessment of neural network performance, a common method of being used is cross-validation that returns partitions of the data set for training and evaluation of the model.
My doubt is...
In this code snippet:
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
We define, train and evaluate a new neural net for each of the generated partitions?
If my goal is to fine-tune the network for the entire dataset, why is it not correct to define a single neural network and train it with the generated partitions?
That is, why is this piece of code like this?
for train, test in kFold.split(X, Y):
model = None
model = create_model()
train_evaluate(model, X[train], Y[train], X[test], Y[test])
and not so?
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
Is my understanding of how the code works wrong? Or my theory?
If my goal is to fine-tune the network for the entire dataset
It is not clear what you mean by "fine-tune", or even what exactly is your purpose for performing cross-validation (CV); in general, CV serves one of the following purposes:
Model selection (choose the values of hyperparameters)
Model assessment
Since you don't define any search grid for hyperparameter selection in your code, it would seem that you are using CV in order to get the expected performance of your model (error, accuracy etc).
Anyway, for whatever reason you are using CV, the first snippet is the correct one; your second snippet
model = None
model = create_model()
for train, test in kFold.split(X, Y):
train_evaluate(model, X[train], Y[train], X[test], Y[test])
will train your model sequentially over the different partitions (i.e. train on partition #1, then continue training on partition #2 etc), which essentially is just training on your whole data set, and it is certainly not cross-validation...
That said, a final step after the CV which is often only implied (and frequently missed by beginners) is that, after you are satisfied with your chosen hyperparameters and/or model performance as given by your CV procedure, you go back and train again your model, this time with the entire available data.
You can use wrappers of the Scikit-Learn API with Keras models.
Given inputs x and y, here's an example of repeated 5-fold cross-validation:
from sklearn.model_selection import RepeatedKFold, cross_val_score
from tensorflow.keras.models import *
from tensorflow.keras.layers import *
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
def buildmodel():
model= Sequential([
Dense(10, activation="relu"),
Dense(5, activation="relu"),
Dense(1)
])
model.compile(optimizer='adam', loss='mse', metrics=['mse'])
return(model)
estimator= KerasRegressor(build_fn=buildmodel, epochs=100, batch_size=10, verbose=0)
kfold= RepeatedKFold(n_splits=5, n_repeats=100)
results= cross_val_score(estimator, x, y, cv=kfold, n_jobs=2) # 2 cpus
results.mean() # Mean MSE
I think many of your questions will be answered if you read about nested cross-validation. This is a good way to "fine tune" the hyper parameters of your model. There's a thread here:
https://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection
The biggest issue to be aware of is "peeking" or circular logic. Essentially - you want to make sure that none of data used to assess model accuracy is seen during training.
One example where this might be problematic is if you are running something like PCA or ICA for feature extraction. If doing something like this, you must be sure to run PCA on your training set, and then apply the transformation matrix from the training set to the test set.
The main idea of testing your model performance is to perform the following steps:
Train a model on a training set.
Evaluate your model on a data not used during training process in order to simulate a new data arrival.
So basically - the data you should finally test your model should mimic the first data portion you'll get from your client/application to apply your model on.
So that's why cross-validation is so powerful - it makes every data point in your whole dataset to be used as a simulation of new data.
And now - to answer your question - every cross-validation should follow the following pattern:
for train, test in kFold.split(X, Y
model = training_procedure(train, ...)
score = evaluation_procedure(model, test, ...)
because after all, you'll first train your model and then use it on a new data. In your second approach - you cannot treat it as a mimicry of a training process because e.g. in second fold your model would have information kept from the first fold - which is not equivalent to your training procedure.
Of course - you could apply a training procedure which uses 10 folds of consecutive training in order to finetune network. But this is not cross-validation then - you'll need to evaluate this procedure using some kind of schema above.
The commented out functions make this a little less obvious, but the idea is to keep track of your model performance as you iterate through your folds and at the end provide either those lower level performance metrics or an averaged global performance. For example:
The train_evaluate function ideally would output some accuracy score for each split, which could be combined at the end.
def train_evaluate(model, x_train, y_train, x_test, y_test):
model.fit(x_train, y_train)
return model.score(x_test, y_test)
X, Y = load_model()
kFold = StratifiedKFold(n_splits=10)
scores = np.zeros(10)
idx = 0
for train, test in kFold.split(X, Y):
model = create_model()
scores[idx] = train_evaluate(model, X[train], Y[train], X[test], Y[test])
idx += 1
print(scores)
print(scores.mean())
So yes you do want to create a new model for each fold as the purpose of this exercise is to determine how your model as it is designed performs on all segments of the data, not just one particular segment that may or may not allow the model to perform well.
This type of approach becomes particularly powerful when applied along with a grid search over hyperparameters. In this approach you train a model with varying hyperparameters using the cross validation splits and keep track of the performance on splits and overall. In the end you will be able to get a much better idea of which hyperparameters allow the model to perform best. For a much more in depth explanation see sklearn Model Selection and pay particular attention to the sections of Cross Validation and Grid Search.

Build a Random Forest regressor with Cross Validation from scratch

I know this is a very classical question which might be answered many times in this forum, however I could not find any clear answer which explains this clearly from scratch.
Firstly, imgine that my dataset called my_data has 4 variables such as
my_data = variable1, variable2, variable3, target_variable
So, let's come to my problem. I'll explain all my steps and ask your help for where I've been stuck:
# STEP1 : split my_data into [predictors] and [targets]
predictors = my_data[[
'variable1',
'variable2',
'variable3'
]]
targets = my_data.target_variable
# STEP2 : import the required libraries
from sklearn import cross_validation
from sklearn.ensemble import RandomForestRegressor
#STEP3 : define a simple Random Forest model attirbutes
model = RandomForestClassifier(n_estimators=100)
#STEP4 : Simple K-Fold cross validation. 3 folds.
cv = cross_validation.KFold(len(my_data), n_folds=3, random_state=30)
# STEP 5
At this step, I want to fit my model based on the training dataset, and then
use that model on test dataset and predict test targets. I also want to calculate the required statistics such as MSE, r2 etc. for understanding the performance of my model.
I'd appreciate if someone helps me woth some basic codelines for Step5.
First off, you are using the deprecated package cross-validation of scikit library. New package is named model_selection. So I am using that in this answer.
Second, you are importing RandomForestRegressor, but defining RandomForestClassifier in the code. I am taking RandomForestRegressor here, because the metrics you want (MSE, R2 etc) are only defined for regression problems, not classification.
There are multiple ways to do what you want. I assume that since you are trying to use the KFold cross-validation here, you want to use the left-out data of each fold as test fold. To accomplish this, we can do:
predictors = my_data[[
'variable1',
'variable2',
'variable3'
]]
targets = my_data.target_variable
from sklearn import model_selection
from sklearn.ensemble import RandomForestRegressor
from sklearn import metrics
model = RandomForestRegressor(n_estimators=100)
cv = model_selection.KFold(n_splits=3)
for train_index, test_index in kf.split(predictors):
print("TRAIN:", train_index, "TEST:", test_index)
X_train, X_test = predictors[train_index], predictors[test_index]
y_train, y_test = targets[train_index], targets[test_index]
# For training, fit() is used
model.fit(X_train, y_train)
# Default metric is R2 for regression, which can be accessed by score()
model.score(X_test, y_test)
# For other metrics, we need the predictions of the model
y_pred = model.predict(X_test)
metrics.mean_squared_error(y_test, y_pred)
metrics.r2_score(y_test, y_pred)
For all this, documentation is your best friend. And scikit-learn documentation are one of the best I have ever seen. Following links may help you know more about them:
http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-evaluating-estimator-performance
http://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
http://scikit-learn.org/stable/user_guide.html
Also in the for loop it should be:
model = RandomForestRegressor(n_estimators=100)
for train_index, test_index in cv.split(X):

Classification with numerical label?

I know of a couple of classification algorithms such as decision trees, but I can't use any of them to the problem I have at hands.
I have a dataset in which each row contains information about a purchase. It's columns are:
- customer id
- store id where the purchase took place
- date and time of the event
- amount of money spent
I'm trying to make a prediction that, given the information of who, where and when, predicts how much money is going to be spent.
What are some possible ways of doing this? Are there any well-known algorithms?
Also, I'm currently learning RapidMiner, and I'm experimenting with some of its features. Everything that I've tried there doesn't allow me to have a real number (amount spent) as a label. Maybe I'm doing something wrong?
You could use a Decision Tree Regressor for this. Using a toolkit like scikit-learn, you could use the DecisionTreeRegressor algo where your features would be store id, date and time, and customer id, and your target would be the amount spent.
You could turn this into a supervised learning problem. This is untested code, but it could probably get you started
# Load libraries
import numpy as np
import pylab as pl
from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor
from sklearn import cross_validation
from sklearn import metrics
from sklearn import grid_search
def fit_predict_model(data_import):
"""Find and tune the optimal model. Make a prediction on housing data."""
# Get the features and labels from your data
X, y = data_import.data, data_import.target
# Setup a Decision Tree Regressor
regressor = DecisionTreeRegressor()
parameters = {'max_depth':(4,5,6,7), 'random_state': [1]}
scoring_function = metrics.make_scorer(metrics.mean_absolute_error, greater_is_better=False)
## fit your data to it ##
reg = grid_search.GridSearchCV(estimator = regressor, param_grid = parameters, scoring=scoring_function, cv=10, refit=True)
fitted_data = reg.fit(X, y)
print "Best Parameters: "
print fitted_data.best_params_
# Use the model to predict the output of a particular sample
x = [## input a test sample in this list ##]
y = reg.predict(x)
print "Prediction: " + str(y)
fit_predict_model(##your data in here)
I took this from a project I was working on almost directly to predict housing prices so there are probably some unnecessary libraries and without doing validation you have no clue how accurate this case would be, but this should get you started.
Check out this link:
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
Yes, as comments have pointed out it's regression that you need. Linear regression does sound like a good starting point as you don't have a huge number of variables.
In RapidMiner type regression into the Operators menu and you'll see several options under Modelling-> Functions. Linear Regression, Polynomical, Vector, etc. (There's more, but as a beginner let's start here).
Right click any of these operators and press Show Operator Info and you'll see numerical labels are allowed.
Next scroll through the help documentation of the operator and you'll see a link to a tutorial process. It's really simple to use, but it's good to get you started with an example.
Let me know if you need any help.

Resources