How to use SlidingWindowSplitter in sktime - time-series

I need to fit ARIMA model from sktime package. I want to use SlidingWindowSplitter from sktime.forecasting.model_selection but I don really understand how it works.
If I wanted to fit a simple ARIMA I would do this
...
model = ARIMA(order = (p, d, q)).fit(y_train)
y_pred, y_conf = model.predict(fh, return_pred_int=True)
But how that works with the SlidingWindowSplitter?

This should work:
from sktime.forecasting.all import *
from sktime.forecasting.model_evaluation import evaluate
y = load_airline()
forecaster = ARIMA()
cv = SlidingWindowSplitter()
out = evaluate(forecaster, cv, y)

Related

When predicting new dataset should I use scaler.fit_trasform(new_dataset) or scaler.transform(new_dataset)

final_poly_converter = PolynomialFeatures(degree=3,include_bias=False)
final_poly_features = final_poly_converter.fit_transform(X)
final_scaler = StandardScaler()
scaled_X = final_scaler.fit_transform(final_poly_features)
from sklearn.linear_model import Lasso
final_model = Lasso(alpha=0.004943070909225827,max_iter=1000000)
final_model.fit(scaled_X,y)
from joblib import dump,load
dump(final_model,'lasso_model.joblib')
dump(final_poly_converter,'lasso_poly_coverter.joblib')
dump(final_scaler,'scaler.joblib')
loaded_converter = load('lasso_poly_coverter.joblib')
loaded_model = load('lasso_model.joblib')
loaded_scaler = load('scaler.joblib')
campaign = [[149,22,12]]
transformed_data = loaded_converter.fit_transform(campaign)
scaled_data = loaded_scaler.transform(transformed_data)# fit_transform or only transform
loaded_model.predict(scaled_data)
The output values change when I use fit_transform() and when I use transform()
You should always use fit_transform on your train and transform on test and further predictions. If you refit your scaler on test pool you would have a different feature distribution in your test set vs train set which is something you don't want to happen. Think of scaler params that you fit as part of the model parameters. Naturally you fit all the parameters on the training set and then you don't change them on the test evaluation/prediction.

Getting the column names chosen after a feature selection method

Given a simple feature selection code below, I want to know the selected columns after the feature selection (The dataset includes a header V1 ... V20)
import pandas as pd
from sklearn.feature_selection import SelectFromModel, SelectKBest, f_regression
def feature_selection(data):
y = data['Class']
X = data.drop(['Class'], axis=1)
fs = SelectKBest(score_func=f_regression, k=10)
# Applying feature selection
X_selected = fs.fit_transform(X, y)
# TODO: determine the columns being selected
return X_selected
data = pd.read_csv("../dataset.csv")
new_data = feature_selection(data)
I appreciate any help.
I have used the iris dataset for my example but you can probably easily modify your code to match your use case.
The SelectKBest method has the scores_ attribute I used to sort the features.
Feel free to ask for any clarifications.
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectFromModel, SelectKBest, f_regression
from sklearn.datasets import load_iris
def feature_selection(data):
y = data[1]
X = data[0]
column_names = ["A", "B", "C", "D"] # Here you should use your dataframe's column names
k = 2
fs = SelectKBest(score_func=f_regression, k=k)
# Applying feature selection
X_selected = fs.fit_transform(X, y)
# Find top features
# I create a list like [[ColumnName1, Score1] , [ColumnName2, Score2], ...]
# Then I sort in descending order on the score
top_features = sorted(zip(column_names, fs.scores_), key=lambda x: x[1], reverse=True)
print(top_features[:k])
return X_selected
data = load_iris(return_X_y=True)
new_data = feature_selection(data)
I don't know the in-build method, but it can be easily coded.
n_columns_selected = X_new.shape[0]
new_columns = list(sorted(zip(fs.scores_, X.columns))[-n_columns_selected:])
# new_columns order is perturbed, we need to restore it. We use the names of the columns of X as a reference
new_columns = list(sorted(cols_new, key=lambda x: list(X.columns).index(x)))

How to use plot_importance function with MultiOutputRegressor?

I use the code blow to get multi output.
But I got the error message when I want to plot importance.
"ValueError: tree must be Booster, XGBModel or dict instance"
How to fix this problem?
Or is there any other way to get the feature importance?
import numpy as np
import xgboost as xgb
from xgboost import plot_importance
X = np.array([[0,1,2,3,4],[2,3,4,5,6],[3,4,5,6,7]])
y = np.array([[2,3,4],[3,4,5],[4,5,6]])
model_ = MultiOutputRegressor(xgb.XGBRegressor(objective='reg:linear',n_jobs=-1))
model_.fit(X, y)
pred = model_.predict(X)
fig,ax = plt.subplots(figsize=(15,15))
plot_importance(model_,height=0.5,ax=ax,max_num_features=3)
plt.show()
I found the solution.
fig,ax = plt.subplots(ncols=3,figsize=(15,6))
plot_importance(model.estimators_[0],height=0.5,ax=ax[0],max_num_features=20)
plot_importance(model.estimators_[1],height=0.5,ax=ax[1],max_num_features=20)
plot_importance(model.estimators_[2],height=0.5,ax=ax[2],max_num_features=20)
plt.show()

Find the best pipeline model using CrossValidator and ParamGridBuilder

I have an acceptable model, but I would like to improve it by adjusting its parameters in Spark ML Pipeline with CrossValidator and ParamGridBuilder.
As an Estimator I will place the existing pipeline.
In ParamMaps I would not know what to put, I do not understand it.
As Evaluator I will use the RegressionEvaluator already created previously.
I'm going to do it for 5 folds, with a list of 10 different depth values in the tree.
How can I select and show the best model for the lowest RMSE?
ACTUAL example:
from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
dt = DecisionTreeRegressor()
dt.setPredictionCol("Predicted_PE")
dt.setMaxBins(100)
dt.setFeaturesCol("features")
dt.setLabelCol("PE")
dt.setMaxDepth(8)
pipeline = Pipeline(stages=[vectorizer, dt])
model = pipeline.fit(trainingSetDF)
regEval = RegressionEvaluator(predictionCol = "Predicted_XX", labelCol = "XX", metricName = "rmse")
rmse = regEval.evaluate(predictions)
print("Root Mean Squared Error: %.2f" % rmse)
(1) Spark Jobs
(2) Root Mean Squared Error: 3.60
NEED:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
dt2 = DecisionTreeRegressor()
dt2.setPredictionCol("Predicted_PE")
dt2.setMaxBins(100)
dt2.setFeaturesCol("features")
dt2.setLabelCol("PE")
dt2.setMaxDepth(10)
pipeline2 = Pipeline(stages=[vectorizer, dt2])
model2 = pipeline2.fit(trainingSetDF)
regEval2 = RegressionEvaluator(predictionCol = "Predicted_PE", labelCol = "PE", metricName = "rmse")
paramGrid = ParamGridBuilder().build() # ??????
crossval = CrossValidator(estimator = pipeline2, estimatorParamMaps = paramGrid, evaluator=regEval2, numFolds = 5) # ?????
rmse2 = regEval2.evaluate(predictions)
#bestPipeline = ????
#bestLRModel = ????
#bestParams = ????
print("Root Mean Squared Error: %.2f" % rmse2)
(1) Spark Jobs
(2) Root Mean Squared Error: 3.60 # the same ¿?
You need to call .fit() with your training data on the crossval object to create the cv model. That will do the cross validation. Then you get the best model (according to your evaluator metric) from that. Eg.
cvModel = crossval.fit(trainingData)
myBestModel = cvModel.bestModel

What's the difference between optimizer.compute_gradient() and tf.gradients() in tensorflow?

The following code I've written, fails at self.optimizer.compute_gradients(self.output,all_variables)
import tensorflow as tf
import tensorlayer as tl
from tensorflow.python.framework import ops
import numpy as np
class Network1():
def __init__(self):
ops.reset_default_graph()
tl.layers.clear_layers_name()
self.sess = tf.Session()
self.optimizer = tf.train.AdamOptimizer(learning_rate=0.1)
self.input_x = tf.placeholder(tf.float32, shape=[None, 784],name="input")
input_layer = tl.layers.InputLayer(self.input_x)
relu1 = tl.layers.DenseLayer(input_layer, n_units=800, act = tf.nn.relu, name="relu1")
relu2 = tl.layers.DenseLayer(relu1, n_units=500, act = tf.nn.relu, name="relu2")
self.output = relu2.all_layers[-1]
all_variables = relu2.all_layers
self.gradient = self.optimizer.compute_gradients(self.output,all_variables)
init_op = tf.initialize_all_variables()
self.sess.run(init_op)
with warning,
TypeError: Argument is not a tf.Variable: Tensor("relu1/Relu:0",
shape=(?, 800), dtype=float32)
However when I change that line to tf.gradients(self.output,all_variables), the code works fine, at least no warning is reported. Where did I miss, since I think these two methods are actually executing the same thing, that is return a list of (gradient, variable) pairs.
optimizer.compute_gradients wraps tf.gradients(), as you can see here. It does additional asserts (which explains your error).
I would like to add to the above answer by referring to a simple point. optimizer.compute_gradients return a list of tuples as (grads, vars) pairs. Variables are always there, but the gradients might be None. That makes sense since computing the gradients of specific loss with respect to some of the variables in var_list can be None. It says there is no dependency.
On the other hand, tf.gradients only return the list of sum(dy/dx) for each variable. It MUST be accompanied by the variable list for applying the gradient update.
Henceforth, the following two approaches can be utilized interchangeably:
### Approach 1 ###
variable_list = desired_list_of_variables
gradients = optimizer.compute_gradients(loss,var_list=variable_list)
optimizer.apply_gradients(gradients)
# ### Approach 2 ###
variable_list = desired_list_of_variables
gradients = tf.gradients(loss, var_list=variable_list)
optimizer.apply_gradients(zip(gradients, variable_list))

Resources