Python Time SeriesSplit - time-series

I have the following question:
I have a timer series. I have done my preprocessing, and now I have x, which contains multiple features and y, which contains my output. I have split it into the train, test: x_train, x_test, y_train, y_test
I now want to do a regression and a gridsearch.
Since I have a time series, I cant do the k-fold cross-validation. So I wanted to use the TimeSeriesSplit.
But what exactly am I splitting? I thought I would split the training set into train and test/validate to train my model, validate/select my hyperparameter and then forecast using the test. Is this correct?
And how do I choose n_splits?
I have now the following code:
pipe=Pipeline....
pipe.fit(x_train, y_train)
tss=TimeSeriesSplit(n_splits=5)
for train_index, test_index in tss(train):
print('train:', train_index, 'test:', test_index
clf=GridSearchCV(pipe, param_grid, cv=tss)
clf.fit(x_train, y_train)

according to sklearn documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html
"Provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets. In each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate."
The way to go, if you want to validate a time series model is nested cross validation some info about it is in the link bellow:
https://mlfromscratch.com/nested-cross-validation-python-code/

Related

How to do Multi-step forecasting using XGBoost?

I am currently using XGBoost to predict sales in the future. My time series data is given per week interval. But I am not sure how can I do multistep forcasting using XGBoost. I split my data set into train and test and after training the model I use my test set to predict the sales. But I only get prediction on the actual values that I have not on the future weeks that are beyond the test set. Here are some code for clarification:
# train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=0,
shuffle=False)
reg = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=1000, nthread=24)
reg.fit(X_train, y_train)
# predicting
predictions_xgb = reg.predict(X_test)
Can I get some help on this?

How to compare baseline and GridSearchCV results fair?

I am a bit confusing with comparing best GridSearchCV model and baseline.
For example, we have classification problem.
As a baseline, we'll fit a model with default settings (let it be logistic regression):
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
baseline = LogisticRegression()
baseline.fit(X_train, y_train)
pred = baseline.predict(X_train)
print(accuracy_score(y_train, pred))
So, the baseline gives us accuracy using the whole train sample.
Next, GridSearchCV:
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
X_val, X_test_val,y_val,y_test_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
parameters = [ ... ]
best_model = GridSearchCV(LogisticRegression(parameters,scoring='accuracy' ,cv=cv))
best_model.fit(X_val, y_val)
print(best_model.best_score_)
Here, we have accuracy based on validation sample.
My questions are:
Are those accuracy scores comparable? Generally, is it fair to compare GridSearchCV and model without any cross validation?
For the baseline, isn't it better to use Validation sample too (instead of the whole Train sample)?
No, they aren't comparable.
Your baseline model used X_train to fit the model. Then you're using the fitted model to score the X_train sample. This is like cheating because the model is going to already perform the best since you're evaluating it based on data that it has already seen.
The grid searched model is at a disadvantage because:
It's working with less data since you have split the X_train sample.
Compound that with the fact that it's getting trained with even less data due to the 5 folds (it's training with only 4/5 of X_val per fold).
So your score for the grid search is going to be worse than your baseline.
Now you might ask, "so what's the point of best_model.best_score_? Well, that score is used to compare all the models used when searching for the optimal hyperparameters in your search space, but in no way should be used to compare against a model that was trained outside of the grid search context.
So how should one go about conducting a fair comparison?
Split your training data for both models.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Fit your models using X_train.
# fit baseline
baseline.fit(X_train, y_train)
# fit using grid search
best_model.fit(X_train, y_train)
Evaluate models against X_test.
# baseline
baseline_pred = baseline.predict(X_test)
print(accuracy_score(y_test, baseline_pred))
# grid search
grid_pred = best_model.predict(X_test)
print(accuracy_score(y_test, grid_pred))

How to Build a Decision tree Regressor model

I am learning ML and was doing a simple handsOn as below:
//
Split boston.data into two sets names x_train and x_test. Also, split boston.target into two sets y_train and y_test.
Build a Decision tree Regressor model from x_train set, with default parameters.
//
I did following code for this:
from sklearn import datasets, model_selection, tree
boston = datasets.load_boston()
x_train, x_test, y_train, y_test = model_selection.train_test_split(boston.data,boston.target, random_state=30)
dt = tree.DecisionTreeRegressor()
dt_reg = dt.fit(x_train)
When I am doing above, it's giving:
TypeError: fit() missing 1 required positional argument: 'y'
Can I fit a model for one training dataset?
What should I give here as 'y'?
As the error states, the fit() method takes 2 parameters for a regression problem, the predictors and the outcome:
dt_reg = dt.fit(x_train, y_train)
Supervised learning models such as the regression tree you are using require a set of observations composed of features (each row of X_train can be understood as a vector containing features for one observation) and a target outcome (each element in the vector y_train)

sklearn multiclass svm function

I have multi class labels and want to compute the accuracy of my model.
I am kind of confused on which sklearn function I need to use.
As far as I understood the below code is only used for the binary classification.
# dividing X, y into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state = 0)
# training a linear SVM classifier
from sklearn.svm import SVC
svm_model_linear = SVC(kernel = 'linear', C = 1).fit(X_train, y_train)
svm_predictions = svm_model_linear.predict(X_test)
# model accuracy for X_test
accuracy = svm_model_linear.score(X_test, y_test)
print accuracy
and as I understood from the link:
Which decision_function_shape for sklearn.svm.SVC when using OneVsRestClassifier?
for multiclass classification I should use OneVsRestClassifier with decision_function_shape (with ovr or ovo and check which one works better)
svm_model_linear = OneVsRestClassifier(SVC(kernel = 'linear',C = 1, decision_function_shape = 'ovr')).fit(X_train, y_train)
The main problem is that the time of predicting the labels does matter to me but it takes about 1 minute to run the classifier and predict the data (also this time is added to the feature reduction such as PCA which also takes sometime)? any suggestions to reduce the time for svm multiclassifer?
There are multiple things to consider here:
1) You see, OneVsRestClassifier will separate out all labels and train multiple svm objects (one for each label) on the given data. So each time, only binary data will be supplied to single svm object.
2) SVC internally uses libsvm and liblinear, which have a 'OvO' strategy for multi-class or multi-label output. But this point will be of no use because of point 1. libsvm will only get binary data.
Even if it did, it doesnt take into account the 'decision_function_shape'. So it does not matter if you provide decision_function_shape = 'ovr' or decision_function_shape = 'ovr'.
So it seems that you are looking at the problem wrong. decision_function_shape should not affect the speed. Try standardizing your data before fitting. SVMs work well with standardized data.
When wrapping models with the ovr or ovc classifiers, you could set the n_jobs parameters to make them run faster, e.g. sklearn.multiclass.OneVsOneClassifier(estimator, n_jobs=-1) or sklearn.multiclass.OneVsRestClassifier(estimator, n_jobs=-1).
Although each single SVM classifier in sklearn could only use one CPU core at a time, the ensemble multi class classifier could fit multiple models at the same time by setting n_jobs.

How to predict on a single data sample when preprocssing is needed

When I read scikit learn example, a typical machine learning flow is prepocessing --> learning --> predicting. As the code snippet shown below:
steps = [('scalar', StandardScalar()),
('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
knn_scaled = pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
Here, both training and testing dataset are scaled before fitting into the classifier. But in my task, I am going to predict on a single data sample. After training my model, I will get data from a streaming line. So each time, a single new data is received, I need to use the classifier to predict on it, and preceed my task with the predicted value.
So with only one example available each time, how to preprocess it before predicting? Scaling on this single example seems make no sense. How should I deal with such issue?
just as you train your classifier and use the generated model to predict the individual records, preprocessing step generates a preprocessing model as well. Let's say your input is Xi and you fitted the preprocessing and classifier models(scaler and clf respectively) already:
Xi_new=scaler.transform(Xi)
print(clf.predict(Xi_new))

Resources