scikit-learn cross validation score in regression - machine-learning

I'm trying to build a regression model, validate and test it and make sure it doesn't overfit the data. This is my code thus far:
from pandas import read_csv
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score, validation_curve
import numpy as np
import matplotlib.pyplot as plt
data = np.array(read_csv('timeseries_8_2.csv', index_col=0))
inputs = data[:, :8]
targets = data[:, 8:]
x_train, x_test, y_train, y_test = train_test_split(
inputs, targets, test_size=0.1, random_state=2)
rate1 = 0.005
rate2 = 0.1
mlpr = MLPRegressor(hidden_layer_sizes=(12,10), max_iter=700, learning_rate_init=rate1)
# trained = mlpr.fit(x_train, y_train) # should I fit before cross val?
# predicted = mlpr.predict(x_test)
scores = cross_val_score(mlpr, inputs, targets, cv=5)
print(scores)
Scores prints an array of 5 numbers where the first number usually around 0.91 and is always the largest number in the array.
I'm having a little trouble figuring out what to do with these numbers. So if the first number is the largest number, then does this mean that on the first cross validation attempt, the model scored the highest, and then the scores decreased as it kept trying to cross validate?
Also, should I fit the training the data before I call the cross validation function? I tried commenting it out and it's giving me more or less the same results.

The cross validation function performs the model fitting as part of the operation, so you gain nothing from doing that by hand:
The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time):
http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics
And yes, the returned numbers reflect multiple runs:
Returns: Array of scores of the estimator for each run of the cross validation.
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score
Finally, there is no reason to expect that the first result is the largest:
from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn.neural_network import MLPRegressor
boston = datasets.load_boston()
est = MLPRegressor(hidden_layer_sizes=(120,100), max_iter=700, learning_rate_init=0.0001)
cross_val_score(est, boston.data, boston.target, cv=5)
# Output
array([-0.5611023 , -0.48681641, -0.23720267, -0.19525727, -4.23935449])

Related

Get 100% accuracy score on Decision tree model

I got 100% accuracy on my decision tree using decision tree algorithm but only got 75% accuracy on random forest
Is there something wrong with my model or is decision tree best suited for the dataset provide?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state= 30)
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)
At First it may look like your model is overfitted but it is not the case as you have put the test set aside.
The reason is Data Leak. Random Forest, randomly exludes some features for every tree. Now suppose you have the labels as one of the features: in some trees the label got excluded and the accuracy is reduced while in the Decission three the label is always among the featurs and predict the result perfectly.
How can you find if it is the case?
use the visualization for the Decision three and if my guess is true you will find that there a few number of decision nodes. You can also visualize the correlation between label and every feature and check out if there is any perfevt correlation or not.

Why is Scikit-learn RFECV returning very different features for the training dataset?

I have been experimenting with RFECV on the Boston dataset.
My understanding, thus far, is that to prevent data-leakage, it is important to perform activities such as this, only on the training data and not the whole dataset.
I performed RFECV on just the training data, and it indicated that 13 of the 14 features are optimal. However, I then ran the same process on the whole dataset, and this time around, it indicated that only 6 of the features are optimal - which seems more likely.
To illustrate:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
### CONSTANTS
TARGET_COLUMN = 'Price'
TEST_SIZE = 0.1
RANDOM_STATE = 0
### LOAD THE DATA AND ASSIGN TO X and y
data_dict = load_boston()
data = data_dict.data
features = list(data_dict.feature_names)
target = data_dict.target
df = pd.DataFrame(data=data, columns=features)
df[TARGET_COLUMN] = target
X = df[features]
y = df[TARGET_COLUMN]
### PERFORM TRAIN TEST SPLIT
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE,
random_state=RANDOM_STATE)
#### DETERMINE THE DATA THAT IS PASSED TO RFECV
## Just the Training data
X_input = X_train
y_input = y_train
## All the data
# X_input = X
# y_input = y
### IMPLEMENT RFECV AND GET RESULTS
rfecv = RFECV(estimator=LinearRegression(), step=1, scoring='neg_mean_squared_error')
rfecv.fit(X_input, y_input)
rfecv.transform(X_input)
print(f'Optimal number of features: {rfecv.n_features_}')
imp_feats = X.drop(X.columns[np.where(rfecv.support_ == False)[0]], axis=1)
print('Important features:', list(imp_feats.columns))
Running the above will result in:
Optimal number of features: 13
Important features: ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
Now, if change the code so that RFECV fits all the data:
#### DETERMINE THE DATA THAT IS PASSED TO RFECV
## Just the Training data
# X_input = X_train # NOW COMMENTED OUT
# y_input = y_train # NOW COMMENTED OUT
## All the data
X_input = X # NOW UN-COMMENTED
y_input = y # NOW UN-COMMENTED
and run it, I get the following result:
Optimal number of features: 6
Important features: ['CHAS', 'NOX', 'RM', 'DIS', 'PTRATIO', 'LSTAT']
I don't understand why the results are so markedly different (and seemingly more accurate) for the whole dataset as opposed to just the training set.
I have tried making the training set close to the size of the whole data, by making the test_size extremely small (via my TEST_SIZE constant), but I still get this seemingly unlikely difference.
What am I missing?
It certainly seems like unexpected behavior, and especially when, as you say, you can reduce the test size to 10% or even 5% and find a similar disparity, which seems very counter-intuitive. The key to understanding what's going on here is to realize that for this particular dataset the values in each column are not randomly distributed across the rows (for example, try running X['CRIM'].plot()). The train_test_split function you're using to split the data has a parameter shuffle which defaults to True. So if you look at the X_train dataset you'll see that the index is jumbled up, whereas in X it is sequential. This means that when the cross-validation is performed under the hood by the RFECV class, it is getting a biased subset of data in each split of X, but a more representative/random subset of data in each split of X_train. If you pass shuffle=False to train_test_split you'll see that the two results are much closer (or alternatively, and probably better, try shuffling the index of X).

reduction of model accuracy while using PCA for a regression problem

I am trying to build a prection problem to predict the fare of flights. My data set has several catergorical variables like class,hour,day of week, day of month, month of year etc. I am using multiple algorithms like xgboost, ANN to fit the model
Intially I have one hot encoded these variables, which led to total of 90 variables, when I tried to fit a model for this data, training r2_score was high around .90 and test score was relatively very low(0.6).
I have used sine and cosine transformation for temporal variables, this led to a total of only 27 variables. With this training accuracy has dropped to .83 but test score is increased to .70
I was thinking that my variables are sparse and tried doing a PCA, but this drastically reduced the performance both on train set and test set.
So I have few questions regarding the same.
Why is PCA not helping and inturn reducing the performance of my model so badly
Any suggestions on how to improve my model performance?
code
from xgboost import XGBRegressor
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_excel('Airline Dataset1.xlsx',sheet_name='Airline Dataset1')
dataset = dataset.drop(columns = ['SL. No.'])
dataset['time'] = dataset['time'] - 24
import numpy as np
dataset['time'] = np.where(dataset['time']==24,0,dataset['time'])
cat_cols = ['demand', 'from_ind', 'to_ind']
cyc_cols = ['time','weekday','month','monthday']
def cyclic_encode(data,col,col_max):
data[col + '_sin'] = np.sin(2*np.pi*data[col]/col_max)
data[col + '_cos'] = np.cos(2*np.pi*data[col]/col_max)
return data
cyclic_encode(dataset,'time',23)
cyclic_encode(dataset,'weekday',6)
cyclic_encode(dataset,'month',11)
cyclic_encode(dataset,'monthday',31)
dataset = dataset.drop(columns=cyc_cols)
ohe_dataset = pd.get_dummies(dataset,columns = cat_cols , drop_first=True)
X = ohe_dataset.iloc[:,:-1]
y = ohe_dataset.iloc[:,27:28]
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train_us, X_test_us, y_train_us, y_test_us = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_Y = StandardScaler()
X_train = sc_X.fit_transform(X_train_us)
X_test = sc_X.transform(X_test_us)
y_train = sc_Y.fit_transform(y_train_us)
y_test = sc_Y.transform(y_test_us)
#Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train,y_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
regressor = XGBRegressor()
model = regressor.fit(X_train,y_train)
# Predicting the Test & Train set with regressor built
y_pred = regressor.predict(X_test)
y_pred = sc_Y.inverse_transform(y_pred)
y_pred_train = regressor.predict(X_train)
y_pred_train = sc_Y.inverse_transform(y_pred_train)
y_train = sc_Y.inverse_transform(y_train)
y_test = sc_Y.inverse_transform(y_test)
#calculate r2_score
from sklearn.metrics import r2_score
score_train = r2_score(y_train,y_pred_train)
score_test = r2_score(y_test,y_pred)
Thanks
You dont really need PCA for such small dimensional problem. Decision trees perform very well even with thousands of variables.
Here are few things you can try
Pass a watchlist and train up until you are not overfitting on validation set. https://github.com/dmlc/xgboost/blob/2d95b9a4b6d87e9f630c59995403988dee390c20/demo/guide-python/basic_walkthrough.py#L64
try all sine cosine transformations and other one hot encoding together and make a model (along with watchlist)
Looks for more causal data. Just seasonal patterns does not cause air fare fluctuations. For starting you can add flags for festivals, holidays, important dates. Also do feature engineering for proximities to these days. Weather data is also easy to find and add.
PCA usually help in cases where you have extreme dimensionality like genome data or algorithm involved doesnt do well in high dimensional data like kNN etc.

How to build voting classifier in sklearn when the individual classifiers are being fit with different datasets?

I'm building a classifier using highly unbalanced data. The strategy I'm interesting in testing is ensembling a model using 3 different resampled datasets. In other words, each dataset will have all the samples from the rare class, but only n samples of the abundant class (technique #4 mentioned in this article).
I want to fit 3 different VotingClassifiers on each resampled dataset, and then combine the results of the individual models using another VotingClassifier (or similar). I know that building a single voting classifier looks like this:
# First Model
rnd_clf_1 = RandomForestClassifier()
xgb_clf_1 = XGBClassifier()
voting_clf_1 = VotingClassifier(
estimators = [
('rf', rnd_clf_1),
('xgb', xgb_clf_1),
],
voting='soft'
)
# And I can fit it with the first dataset this way:
voting_clf_1.fit(X_train_1, y_train_1)
But how to stack the three of them if they are fitted on different datasets? For example, if I had three fitted models (see code below), I could build a function that calls the .predict_proba() method on each of the models and then "manually" averages the individual probabilities.
But... is there a better way?
# Fitting the individual models... but how to combine the predictions?
voting_clf_1.fit(X_train_1, y_train_1)
voting_clf_2.fit(X_train_2, y_train_2)
voting_clf_3.fit(X_train_3, y_train_3)
Thanks!
Usually the #4 method shown in the article is implemented with same type of classifier. It looks like you want to try VotingClassifier on each sample dataset.
There is an implementation of this methodology already in imblearn.ensemble.BalancedBaggingClassifier, which is an extension from Sklearn Bagging approach.
You can feed the estimator as VotingClassifier and number of estimators as the number of times, which you want carry out the dataset sampling. Use sampling_strategy param to mention proportion of downsampling which you want on Majority class.
Working Example:
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from imblearn.ensemble import BalancedBaggingClassifier # doctest: +NORMALIZE_WHITESPACE
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape %s' % Counter(y))
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=0)
rnd_clf_1 = RandomForestClassifier()
xgb_clf_1 = xgb.XGBClassifier()
voting_clf_1 = VotingClassifier(
estimators = [
('rf', rnd_clf_1),
('xgb', xgb_clf_1),
],
voting='soft'
)
bbc = BalancedBaggingClassifier(base_estimator=voting_clf_1, random_state=42)
bbc.fit(X_train, y_train) # doctest: +ELLIPSIS
y_pred = bbc.predict(X_test)
print(confusion_matrix(y_test, y_pred))
See here. May be you can reuse _predict_proba() and _collect_probas() functions after fitting your estimators manually.

Can we fit a model using the same dataset applied during cross validation process?

I have the following method that performs Cross Validation on a dataset followed by a final model fit:
import numpy as np
import utilities.utils as utils
from sklearn.model_selection import cross_val_score
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.utils import shuffle
def CV(args, path):
df = pd.read_csv(path + 'HIGGS.csv', sep=',')
df = shuffle(df)
df_labels = df[df.columns[0]]
df_features = df.drop(df.columns[0], axis=1)
clf = MLPClassifier(hidden_layer_sizes=(64, 64, 64),
activation='logistic',
solver='adam',
learning_rate_init=1e-3,
max_iter=1000,
batch_size=1000,
learning_rate='adaptive',
early_stopping=True
)
print('\t >>> Start Cross Validation ... ')
scores = cross_val_score(estimator=clf, X=df_features, y=df_labels, cv=5, n_jobs=-1)
print("CV Accuracy: %0.2f (+/- %0.2f)\n\n" % (scores.mean(), scores.std() * 2))
# Final Fit
print('\t >>> Start Final Fit ... ')
num_to_read = (int(10999999) * (args.stages * np.dtype(np.float64).itemsize))
C1 = utils.read_from_disk(path + 'HIGGS.dat', 0, num_to_read, args.stages)
print(C1)
print(C1.shape)
r = C1[:, :1]
C = np.delete(C1, 0, axis=1)
tr_C, ts_C, tr_r, ts_r = train_test_split(C, r, train_size=.8)
clf.fit(tr_C, tr_r)
prd_r = clf.predict(ts_C)
test_acc = accuracy_score(ts_r, prd_r) * 100.
return test_acc
I understand that Cross Validation is about evaluating how well your model is with a given dataset. My questions are:
Is it logically correct to fit the model again by the same dataset I used during the cross validation process?
During each CV fold, are the model parameters carried out to the next fold? For instance, in the case of Neural Network, is the fitted model from fold=1 carried out to fold=2?
Does this process (I mean fitting the entire dataset as I did above) produce a model accuracy near to the average accuracy we get after cross validation?
Thank you
R1. At the very end when you are performing CV you are splitting your dataset into k-sets and each time your are going to train your set with k-1 sets and test/validate with the 1/k of the data (different every time).
R2. Every time MLP is performing learning with a set (k-1 small sets) the learning tasks starts again, at the end the average measure of MSE or the error measure is the mean of errors in k different scenarios.
R3. If the class distribution in data is balance results of k-CV and traditional 70/30 splittings will have approximate generalizations. On the other hand, if the dataset is highly unbalance, a k-CV (10) will tend to better learning generalizations than traditional splittings (since data will represent more effectively all or the majority of the problem classes).

Resources