How to get support vector number after cross validation - machine-learning

Here is my code for digit classification using non linear SVM. I apply a cross validaton scheme to select the hyperparameter c and gamma. But, the model returned by GridSearch have not a n_support_ attribute to get the number of support vectors.
from sklearn import datasets
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
from sklearn.cross_validation import ShuffleSplit
# Loading the Digits dataset
digits = datasets.load_digits()
# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target
# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0)
#Intilize an svm estimator
estimator=SVC(kernel='rbf',C=1,gamma=1)
#Choose cross validation iterator.
cv = ShuffleSplit(X_train.shape[0], n_iter=5, test_size=0.2, random_state=0)
# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4,1,2,10],
'C': [1, 10, 50, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
clf=GridSearchCV(estimator=estimator, cv=cv, param_grid=tuned_parameters)
#begin the cross-validation task to get the best model with best parameters.
#After this task, we get a clf as a best model with best parameters C and gamma.
clf.fit(X_train,y_train)
print()
print ("Best parameters: ")
print(clf.get_params)
print("error test set with clf1",clf.score(X_test,y_test))
print("error training set with cf1",clf.score(X_train,y_train))
#It does not work. So, how can I recuperate the number of vector support?
print ("Number of support vectors by class", clf.n_support_);
**##Here is my methods. I train a new SVM object with the best parameters and I remark that it gate the same test and train error as clf**
clf2=SVC(C=10,gamma= 0.001);
clf2.fit(X_train,y_train)
print("error test set with clf2 ",clf2.score(X_test,y_test))
print("error training set with cf1",clf.score(X_train,y_train))
print clf2.n_support_
Any comment if my proposed method is right?

GridSearchCV will fit a number of models. You can get the best one with clf.best_estimator_ so to find the indices of the support vectors in your training set you can use clf.best_estimator_.n_support_, and of course len(clf.best_estimator_.n_support_) will give you the number of support vectors.
You can also get the parameters and the score of the best model with clf.best_params_ and clf.best_score_ respectively.

Related

Where the categorical encoding should be done in a k fold - cv procedure?

I want to apply a cross validation method in my machine learning models. I these models, I want a Feature Selection and a GridSearch to be applied as well. Imagine that I want to estimate the performance of K-Nearest-Neighbor Classifier by applying a feature selection technique based on an F-score (ANOVA) that chooses the 10 most relevant features. The code would be as follows:
# 10-times 10-fold cross validation
n_repeats = 10
rkf = RepeatedKFold(n_splits=10, n_repeats = n_repeats, random_state=0)
# Data standardization
scaler = StandardScaler()
# Variable to contain error measures and counter for the splits
error_knn = []
split = 0
for train_index, test_index in rkf.split(X, y):
# Print a dot for each train / test partition
sys.stdout.write('.')
sys.stdout.flush()
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Standardize the data
scaler.fit(X_train, y_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
###- In order to select the best number of neighbors -###
# Pipeline for training the classifier from previous notebooks
pipeline = Pipeline([ ('knn', KNeighborsClassifier()) ])
N_neighbors = [1, 3, 5, 7, 11, 15, 20, 25, 30]
param_grid = { 'knn__n_neighbors': N_neighbors }
# Evaluate the performance in a 5-fold cross-validation
skfold = RepeatedStratifiedKFold(n_splits=5, n_repeats=1, random_state=split)
# n_jobs = -1 to use all processors
gridcv = GridSearchCV(pipeline, cv=skfold, n_jobs=-1, param_grid=param_grid, \
scoring=make_scorer(accuracy_score))
result = gridcv.fit(X_train, y_train)
###- Results -###
# Mean accuracy and standard deviation
accuracies = gridcv.cv_results_['mean_test_score']
std_accuracies = gridcv.cv_results_['std_test_score']
# Best value for the number of neighbors
# Define KNeighbors Classifier with that best value
# Method fit(X,y) to fit each model according to training data
best_Nneighbors = N_neighbors[np.argmax(accuracies)]
knn = KNeighborsClassifier(n_neighbors = best_Nneighbors)
knn.fit(X_train, y_train)
# Error for the prediction
error_knn.append(1.0 - np.mean(knn.predict(X_test) == y_test))
split += 1
However, my columns are categorical (except binary label) and I need to do a categorical encoding. I can not remove this columns because they are essential.
Where would you perform this encoding and how the problems of categorical encoding of unseen labels in each fold would be solved?
Categorical encoding should be performed as the first step, precisely to avoid the problem you mentioned regarding unseen labels in each fold.
Additionally, your current implementation suffers from data leakage.
You're performing feature scaling on the full X_train dataset before performing your inner cross-validation.
This can be solved by including StandardScaler on the pipeline used for your GridSearchCV:
...
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
###- In order to select the best number of neighbors -###
# Pipeline for training the classifier from previous notebooks
pipeline = Pipeline(
[ ('scaler', scaler), ('knn', KNeighborsClassifier()) ]
)
N_neighbors = [1, 3, 5, 7, 11, 15, 20, 25, 30]
param_grid = { 'knn__n_neighbors': N_neighbors }
...
Another couple of tips:
GridSearchCV has a best_estimator_ attribute that can be used to extract the estimator with the best set of hyperparameters found.
When using GridSearchCV with refit=True (the default), you can use the object directly to perform predictions, e.g. gridcv.predict(X_test).
EDIT: Perhaps I was too general when it came to when to perform categorical enconding. Your approach should depend on your problem/dataset.
If you know beforehand how many categorical features exist and you want to train your inner CV classifiers with this knowledge, you should perform categorical enconding as the first step.
If at training time you do not know how many categorical features you are going to see or you want to train your CV classifiers without knowledge of the full range of categorical features, you should perform categorical enconding at each fold.
When using the former your classifiers will all be trained on the same feature space while that's not guaranteed for the latter.
If using the latter, the above pipeline can be extended to incorporate categorical encoding:
pipeline = Pipeline(
[
('enc', OneHotEncoder()),
('scaler', StandardScaler(with_mean=False)),
('knn', KNeighborsClassifier()),
],
)
I suggest you read the Encoding categorical features section of scikit-learn's User Guide carefully.

How to compare baseline and GridSearchCV results fair?

I am a bit confusing with comparing best GridSearchCV model and baseline.
For example, we have classification problem.
As a baseline, we'll fit a model with default settings (let it be logistic regression):
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
baseline = LogisticRegression()
baseline.fit(X_train, y_train)
pred = baseline.predict(X_train)
print(accuracy_score(y_train, pred))
So, the baseline gives us accuracy using the whole train sample.
Next, GridSearchCV:
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
X_val, X_test_val,y_val,y_test_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
parameters = [ ... ]
best_model = GridSearchCV(LogisticRegression(parameters,scoring='accuracy' ,cv=cv))
best_model.fit(X_val, y_val)
print(best_model.best_score_)
Here, we have accuracy based on validation sample.
My questions are:
Are those accuracy scores comparable? Generally, is it fair to compare GridSearchCV and model without any cross validation?
For the baseline, isn't it better to use Validation sample too (instead of the whole Train sample)?
No, they aren't comparable.
Your baseline model used X_train to fit the model. Then you're using the fitted model to score the X_train sample. This is like cheating because the model is going to already perform the best since you're evaluating it based on data that it has already seen.
The grid searched model is at a disadvantage because:
It's working with less data since you have split the X_train sample.
Compound that with the fact that it's getting trained with even less data due to the 5 folds (it's training with only 4/5 of X_val per fold).
So your score for the grid search is going to be worse than your baseline.
Now you might ask, "so what's the point of best_model.best_score_? Well, that score is used to compare all the models used when searching for the optimal hyperparameters in your search space, but in no way should be used to compare against a model that was trained outside of the grid search context.
So how should one go about conducting a fair comparison?
Split your training data for both models.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Fit your models using X_train.
# fit baseline
baseline.fit(X_train, y_train)
# fit using grid search
best_model.fit(X_train, y_train)
Evaluate models against X_test.
# baseline
baseline_pred = baseline.predict(X_test)
print(accuracy_score(y_test, baseline_pred))
# grid search
grid_pred = best_model.predict(X_test)
print(accuracy_score(y_test, grid_pred))

How to build voting classifier in sklearn when the individual classifiers are being fit with different datasets?

I'm building a classifier using highly unbalanced data. The strategy I'm interesting in testing is ensembling a model using 3 different resampled datasets. In other words, each dataset will have all the samples from the rare class, but only n samples of the abundant class (technique #4 mentioned in this article).
I want to fit 3 different VotingClassifiers on each resampled dataset, and then combine the results of the individual models using another VotingClassifier (or similar). I know that building a single voting classifier looks like this:
# First Model
rnd_clf_1 = RandomForestClassifier()
xgb_clf_1 = XGBClassifier()
voting_clf_1 = VotingClassifier(
estimators = [
('rf', rnd_clf_1),
('xgb', xgb_clf_1),
],
voting='soft'
)
# And I can fit it with the first dataset this way:
voting_clf_1.fit(X_train_1, y_train_1)
But how to stack the three of them if they are fitted on different datasets? For example, if I had three fitted models (see code below), I could build a function that calls the .predict_proba() method on each of the models and then "manually" averages the individual probabilities.
But... is there a better way?
# Fitting the individual models... but how to combine the predictions?
voting_clf_1.fit(X_train_1, y_train_1)
voting_clf_2.fit(X_train_2, y_train_2)
voting_clf_3.fit(X_train_3, y_train_3)
Thanks!
Usually the #4 method shown in the article is implemented with same type of classifier. It looks like you want to try VotingClassifier on each sample dataset.
There is an implementation of this methodology already in imblearn.ensemble.BalancedBaggingClassifier, which is an extension from Sklearn Bagging approach.
You can feed the estimator as VotingClassifier and number of estimators as the number of times, which you want carry out the dataset sampling. Use sampling_strategy param to mention proportion of downsampling which you want on Majority class.
Working Example:
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from imblearn.ensemble import BalancedBaggingClassifier # doctest: +NORMALIZE_WHITESPACE
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape %s' % Counter(y))
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=0)
rnd_clf_1 = RandomForestClassifier()
xgb_clf_1 = xgb.XGBClassifier()
voting_clf_1 = VotingClassifier(
estimators = [
('rf', rnd_clf_1),
('xgb', xgb_clf_1),
],
voting='soft'
)
bbc = BalancedBaggingClassifier(base_estimator=voting_clf_1, random_state=42)
bbc.fit(X_train, y_train) # doctest: +ELLIPSIS
y_pred = bbc.predict(X_test)
print(confusion_matrix(y_test, y_pred))
See here. May be you can reuse _predict_proba() and _collect_probas() functions after fitting your estimators manually.

Can we fit a model using the same dataset applied during cross validation process?

I have the following method that performs Cross Validation on a dataset followed by a final model fit:
import numpy as np
import utilities.utils as utils
from sklearn.model_selection import cross_val_score
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.utils import shuffle
def CV(args, path):
df = pd.read_csv(path + 'HIGGS.csv', sep=',')
df = shuffle(df)
df_labels = df[df.columns[0]]
df_features = df.drop(df.columns[0], axis=1)
clf = MLPClassifier(hidden_layer_sizes=(64, 64, 64),
activation='logistic',
solver='adam',
learning_rate_init=1e-3,
max_iter=1000,
batch_size=1000,
learning_rate='adaptive',
early_stopping=True
)
print('\t >>> Start Cross Validation ... ')
scores = cross_val_score(estimator=clf, X=df_features, y=df_labels, cv=5, n_jobs=-1)
print("CV Accuracy: %0.2f (+/- %0.2f)\n\n" % (scores.mean(), scores.std() * 2))
# Final Fit
print('\t >>> Start Final Fit ... ')
num_to_read = (int(10999999) * (args.stages * np.dtype(np.float64).itemsize))
C1 = utils.read_from_disk(path + 'HIGGS.dat', 0, num_to_read, args.stages)
print(C1)
print(C1.shape)
r = C1[:, :1]
C = np.delete(C1, 0, axis=1)
tr_C, ts_C, tr_r, ts_r = train_test_split(C, r, train_size=.8)
clf.fit(tr_C, tr_r)
prd_r = clf.predict(ts_C)
test_acc = accuracy_score(ts_r, prd_r) * 100.
return test_acc
I understand that Cross Validation is about evaluating how well your model is with a given dataset. My questions are:
Is it logically correct to fit the model again by the same dataset I used during the cross validation process?
During each CV fold, are the model parameters carried out to the next fold? For instance, in the case of Neural Network, is the fitted model from fold=1 carried out to fold=2?
Does this process (I mean fitting the entire dataset as I did above) produce a model accuracy near to the average accuracy we get after cross validation?
Thank you
R1. At the very end when you are performing CV you are splitting your dataset into k-sets and each time your are going to train your set with k-1 sets and test/validate with the 1/k of the data (different every time).
R2. Every time MLP is performing learning with a set (k-1 small sets) the learning tasks starts again, at the end the average measure of MSE or the error measure is the mean of errors in k different scenarios.
R3. If the class distribution in data is balance results of k-CV and traditional 70/30 splittings will have approximate generalizations. On the other hand, if the dataset is highly unbalance, a k-CV (10) will tend to better learning generalizations than traditional splittings (since data will represent more effectively all or the majority of the problem classes).

scikit-learn cross validation score in regression

I'm trying to build a regression model, validate and test it and make sure it doesn't overfit the data. This is my code thus far:
from pandas import read_csv
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score, validation_curve
import numpy as np
import matplotlib.pyplot as plt
data = np.array(read_csv('timeseries_8_2.csv', index_col=0))
inputs = data[:, :8]
targets = data[:, 8:]
x_train, x_test, y_train, y_test = train_test_split(
inputs, targets, test_size=0.1, random_state=2)
rate1 = 0.005
rate2 = 0.1
mlpr = MLPRegressor(hidden_layer_sizes=(12,10), max_iter=700, learning_rate_init=rate1)
# trained = mlpr.fit(x_train, y_train) # should I fit before cross val?
# predicted = mlpr.predict(x_test)
scores = cross_val_score(mlpr, inputs, targets, cv=5)
print(scores)
Scores prints an array of 5 numbers where the first number usually around 0.91 and is always the largest number in the array.
I'm having a little trouble figuring out what to do with these numbers. So if the first number is the largest number, then does this mean that on the first cross validation attempt, the model scored the highest, and then the scores decreased as it kept trying to cross validate?
Also, should I fit the training the data before I call the cross validation function? I tried commenting it out and it's giving me more or less the same results.
The cross validation function performs the model fitting as part of the operation, so you gain nothing from doing that by hand:
The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time):
http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics
And yes, the returned numbers reflect multiple runs:
Returns: Array of scores of the estimator for each run of the cross validation.
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score
Finally, there is no reason to expect that the first result is the largest:
from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn.neural_network import MLPRegressor
boston = datasets.load_boston()
est = MLPRegressor(hidden_layer_sizes=(120,100), max_iter=700, learning_rate_init=0.0001)
cross_val_score(est, boston.data, boston.target, cv=5)
# Output
array([-0.5611023 , -0.48681641, -0.23720267, -0.19525727, -4.23935449])

Resources