Prediction in machine learning - machine-learning

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
clf.fit(X_train, y_train)
y_true, y_pred = y_test, clf.predict(X_test)
acc = accuracy_score(y_pred, y_test)
The above 4 lines of code are from a model I came across. Can somebody tell me the difference between y_test and y_pred.

y_test is the actual label set which is used to evaluate your final model performance in comparison with y_pred, which is the one we get after fitting our machine learning model, i.e. what our model would predict for X_test.

Related

Troubles with Cross-Validation

I have some troubles to implement cross-validation. I understand that after cross-validation I have to re-train the model but I have the next doubts:
Do train_test split before cross validation and use X_train and y_train for cross-validation process and then re-train model with X_train and y_train
Split data in features (X) and labels (y) and use this variables in cross-validation process and then do train test split and train model with X_train and y_train
If I use features and label variables what is the next step after cross-validation?
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
data = pd.read_csv('../data/pima-indians-diabetes.csv')
data.head()
# All the columns except the one we want to predict
features = data.drop(['Outcome'], axis=1)
# Only the column we want to predict
labels = data['Outcome']
from sklearn.model_selection import train_test_split
test_size = 0.33
seed = 12
X_train, X_test, Y_train, Y_test = train_test_split(features, labels, test_size=test_size,
random_state=seed)
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
scores = cross_val_score(model, X_train, y_train, cv=kfold)`
model.fit(X_train, Y_train)
kfold = KFold(n_splits=10, random_state=1)
model = LogisticRegression()
scores = cross_val_score(model, features, labels, cv=kfold)
X_train, X_test, Y_train, Y_test = train_test_split(features, labels, test_size=0.2,
random_state=42)
model.fit(X_train, Y_train)
Which of the two code blocks is correct or is there another way to implement cross-validation correctly?

How to do model evaluation with train, validation, test?

Generally if one dataset is given we use
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
y_pred = lr.predict(X_test)
print(confusion_matrix(y_test,y_pred))
print(accuracy_score(y_test,y_pred))
print(classification_report(y_test,y_pred))
If we are doing validation on the training dataset
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=0)
If both Train and Test datasets are given in separate datasets, where do I use Test dataset in the code?
Normally I will split the available data 2 times like this:
#first split: 80% (green+blue) and 20% (orange)
X_model, X_test, y_model, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
#second split: 75% (green) and 25% (orange)
X_train, X_val, y_train, y_val = train_test_split(X_model, y_model, test_size=0.25, random_state=0)
Note the the first split is 80:20 ratio, and the second split is 75:25 ratio, in order to get the 60:20:20 ratio in the overall dataset. And if you're already given separate dataset for model train/val and model test, you can skip the first split.
After this you can proceed to train and evaluate the model (using _train and _val):
lr = some_model()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_val)
print(confusion_matrix(y_val, y_pred))
print(accuracy_score(y_val, y_pred)) #accuracy score (TRAIN)
print(classification_report(y_val, y_pred))
This is repeated using various models, and in model tuning, until you can find the best performing model and hyperparameters. It is highly recommended to do cross validation here to identify a better and more robust model.
After the winning model is found, do a test on the holdout data (using _test):
y_pred = lr.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred)) #accuracy score (TEST)
print(classification_report(y_test, y_pred))
Now you can compare the 2 accuracy scores (TRAIN and TEST) to see if model is overfitting or underfitting.

How to do Multi-step forecasting using XGBoost?

I am currently using XGBoost to predict sales in the future. My time series data is given per week interval. But I am not sure how can I do multistep forcasting using XGBoost. I split my data set into train and test and after training the model I use my test set to predict the sales. But I only get prediction on the actual values that I have not on the future weeks that are beyond the test set. Here are some code for clarification:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=0,
shuffle=False)
reg = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=1000, nthread=24)
reg.fit(X_train, y_train)
# predicting
predictions_xgb = reg.predict(X_test)
Can I get some help on this?

sklearn cross valid / cross predict

I understand that cross_val_predict / cross_val trains n out-of-folds models and then aggragate them to produce the final prediction. This is done on the train phase. Now, I want to use the fitted models to predict the test data. I can use for loop to collect predictions on the test data and aggregate them but first I want to ask if there is a build-in sklearn method for this?
from sklearn.model_selection import cross_val_predict, train_test_split
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
lasso = linear_model.Lasso()
y_train_hat = cross_val_predict(lasso, X_train, y_train, cv=3)
y_test_hat = do_somthing(lasso, X_test)```
Thanks
The 3 models from your cross_val_predict are not saved anywhere, so you can't make predictions with them. You can use instead cross_validate with return_estimator=True. You'll still be left with three models that you'll have to manually use to make and aggregate predictions. (You could in principle put those models into an ensemble model like VotingClassifier, but at least for now there is no prefit argument to prevent refitting your estimators. There some discussion in Issue 7382 and links from there.)

MLPRegressor gives very negative scores

I'm kind of new to machine learning and I am using MLPRegressor. I split my data with
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
then I make and fit the model, using 10-fold validation for test set.
nn = MLPRegressor(hidden_layer_sizes=(100, 100), activation='relu',
solver='lbfgs', max_iter=500)
nn.fit(X_train, y_train)
TrainScore = nn.score(X_train, y_train)
kfold = KFold(n_splits=10, shuffle=True, random_state=0)
print("Cross-validation scores:\t{} ".format(cross_val_score(nn, X_test, y_test, cv=kfold)))
av_corss_val_score = np.mean(cross_val_score(nn, X_test, y_test, cv=kfold))
print("The average cross validation score is: {}".format(av_corss_val_score))
The problem is that the test scores I receive are very negative (-4256). What could be possible be wrong?
To keep syntax the same, sklearn maximizes every metric, whether classification accuracy or regression MSE. Therefore, the objective function is defined in a way that a more positive number is good and more negative number is bad. Hence, a less negative MSE is preferred.
Moving on to why it may be so negative in your case, it could be broadly due to two things: overfitting or underfitting. There are tonnes of resources out there to help you from this point forward.

Resources