How to extend and generate future forecast using SARIMA - machine-learning

I am fairly new to this domain. I need to predict the next 3 months forecast.
The current code is -
train = daily_country_cases[:int(0.75*(len(daily_country_cases)))]
test = daily_country_cases[int(0.75*(len(daily_country_cases))):]
#plotting the data
train['Total_cases'].plot()
test['Total_cases'].plot()
plt.show()
[enter image description here][1]
model=sm.tsa.statespace.SARIMAX(daily_country_cases['Total_cases'],order=(7, 1,
5),seasonal_order=(7,1,5,30))
sarima_model=model.fit()
sarima_model.summary()
daily_country_cases['forecast']=sarima_model.predict(start=start_index, end=end_index
,dynamic=True)
daily_country_cases[['Total_cases','forecast']].plot(figsize=(10,6))
plt.show()
How can I predict the next 2 months?

Related

Straight line forecasat with Statsmodels ARIMA

I am new to statsmodels ARIMA
What is the best approach to do ARIMA on such a dataset?
The goal is to forecast the Value of the different types of gas.
I have run Augmented Dickey-Fuller test and have concluded that data is stationary.
How do I get a more accurate forecast?
Date
T
RH
Gas
Value
6/2/2017
6.62
51.73
CO
845.23
6/2/2017
6.62
51.73
HC
626.34
#Initialising ARIMA model
from statsmodels.tsa.arima_model import ARIMA
arima_model = ARIMA(scaled_df.Value, order=(2,0,1)).fit()
arima_model.summary()
start = len(df)
end = len(df) + len(test) -1
test['Date'] = pd.to_datetime(test['Date'],format='%d/%m/%Y')
test.set_index('Date', inplace=True)
pred = arima_model.predict(start=start, end=end,typ='levels')
i think it might be due to your training data being too large, try splitting it into smaller chunks

grid of mtry values while training random forests with ranger

I am working with a subset of the 'Ames Housing' dataset and have originally 17 features. Using the 'recipes' package, I have engineered the original set of features and created dummy variables for nominal predictors with the following code. That has resulted in 35 features in the 'baked_train' dataset below.
blueprint <- recipe(Sale_Price ~ ., data = _train) %>%
step_nzv(Street, Utilities, Pool_Area, Screen_Porch, Misc_Val) %>%
step_impute_knn(Gr_Liv_Area) %>%
step_integer(Overall_Qual) %>%
step_normalize(all_numeric_predictors()) %>%
step_other(Neighborhood, threshold = 0.01, other = "other") %>%
step_dummy(all_nominal_predictors(), one_hot = FALSE)
prepare <- prep(blueprint, data = ames_train)
baked_train <- bake(prepare, new_data = ames_train)
baked_test <- bake(prepare, new_data = ames_test)
Now, I am trying to train random forests with the 'ranger' package using the following code.
cv_specs <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
param_grid_rf <- expand.grid(mtry = seq(1, 35, 1),
splitrule = "variance",
min.node.size = 2)
rf_cv <- train(blueprint,
data = ames_train,
method = "ranger",
trControl = cv_specs,
tuneGrid = param_grid_rf,
metric = "RMSE")
I have set the grid of 'mtry' values based on the number of features in the 'baked_train' data. It is my understanding that 'caret' will apply the blueprint within each resample of 'ames_train' creating a baked version at each CV step.
The text Hands-On Machine Learning with R by Boehmke & Greenwell says on section 3.8.3,
Consequently, the goal is to develop our blueprint, then within each resample iteration we want to apply prep() and bake() to our resample training and validation data. Luckily, the caret package simplifies this process. We only need to specify the blueprint and caret will automatically prepare and bake within each resample.
However, when I run the code above I get an error,
mtry can not be larger than number of variables in data. Ranger will EXIT now.
I get the same error when I specify 'tuneLength = 20' instead of the 'tuneGrid'. Although the code works fine when the grid of 'mtry' values is specified to be from 1 to 17 (the number of features in the original training data 'ames_train').
When I specify a grid of 'mtry' values from 1 to 17, info about the final model after CV is shown below. Notice that it mentions Number of independent variables: 35 which corresponds to the 'baked_train' data, although specifying a grid from 1 to 35 throws an error.
Type: Regression
Number of trees: 500
Sample size: 618
Number of independent variables: 35
Mtry: 15
Target node size: 2
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 995351989
R squared (OOB): 0.8412147
What am I missing here? Specifically, why do I have to specify the number of features in 'ames_train' instead of 'baked_train' when essentially 'caret' is supposed to create a baked version before fitting and evaluating the model for each resample?
Thanks.

LSTM sequence prediction overfits on one specific value only

hello guys i am new in machine learning. I am implementing federated learning on with LSTM to predict the next label in a sequence. my sequence looks like this [2,3,5,1,4,2,5,7]. for example, the intention is predict the 7 in this sequence. So I tried a simple federated learning with keras. I used this approach for another model(Not LSTM) and it worked for me, but here it always overfits on 2. it always predict 2 for any input. I made the input data so balance, means there are almost equal number for each label in last index (here is 7).I tested this data on simple deep learning and greatly works. so it seems to me this data mybe is not suitable for LSTM or any other issue. Please help me. This is my Code for my federated learning. Please let me know if more information is needed, I really need it. Thanks
def get_lstm(units):
"""LSTM(Long Short-Term Memory)
Build LSTM Model.
# Arguments
units: List(int), number of input, output and hidden units.
# Returns
model: Model, nn model.
"""
model = Sequential()
inp = layers.Input((units[0],1))
x = layers.LSTM(units[1], return_sequences=True)(inp)
x = layers.LSTM(units[2])(x)
x = layers.Dropout(0.2)(x)
out = layers.Dense(units[3], activation='softmax')(x)
model = Model(inp, out)
optimizer = keras.optimizers.Adam(lr=0.01)
seqLen=8 -1;
global_model = Mymodel.get_lstm([seqLen, 64, 64, 15]) # 14 categories we have , array start from 0 but never can predict zero class
global_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
def main(argv):
for comm_round in range(comms_round):
print("round_%d" %( comm_round))
scaled_local_weight_list = list()
global_weights = global_model.get_weights()
np.random.shuffle(train)
temp_data = train[:]
# data divided among ten users and shuffled
for user in range(10):
user_data = temp_data[user * userDataSize: (user+1)*userDataSize]
X_train = user_data[:, 0:seqLen]
X_train = np.asarray(X_train).astype(np.float32)
Y_train = user_data[:, seqLen]
Y_train = np.asarray(Y_train).astype(np.float32)
local_model = Mymodel.get_lstm([seqLen, 64, 64, 15])
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
local_model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=tf.keras.metrics.SparseTopKCategoricalAccuracy(k=1))
local_model.set_weights(global_weights)
local_model.fit(X_train, Y_train)
scaling_factor = 1 / 10 # 10 is number of users
scaled_weights = scale_model_weights(local_model.get_weights(), scaling_factor)
scaled_local_weight_list.append(scaled_weights)
K.clear_session()
average_weights = sum_scaled_weights(scaled_local_weight_list)
global_model.set_weights(average_weights)
predictions=global_model.predict(X_test)
for i in range(len(X_test)):
print('%d,%d' % ((np.argmax(predictions[i])), Y_test[i]),file=f2 )
I could find some reasons for my problem, so I thought I can share it with you:
1- the proportion of different items in sequences are not balanced. I mean for example I have 1000 of "2" and 100 of other numbers, so after a few rounds the model fitted on 2 because there are much more data for specific numbers.
2- I changed my sequences as there are not any two items in a sequence while both have same value. so I could remove some repetitive data from the sequences and make them more balance. maybe it is not the whole presentation of activities but in my case it makes sense.

LSTM for time series

I'm trying to build a predictive model using an LSTM; briefly to predict "target", using the series "data". After a few epochs, it correctly fits the data. As a beginner, what kind of dataset(dimensions) do I have to feed the network with, in order to give me further predictions?
My aim is to predict +1 step, given the whole past time series, how can I implement it?
Ultimately, what does the second 1 stand for, when I reshape "data", and "target"?
print(df.head)
0 7 20
1 0 0
2 0 0
3 0 0
4 0 0
...
data = df[:,0]
target = df[:, 1 ]
data = data.reshape((1, 1, int(len(df))))
target = target.reshape((1 ,1, int(len(df))))
#We Define The Model
import random
random.seed(1)
model = Sequential()
model.add(LSTM(int(len(df)), input_shape=(1, int(len(df))), return_sequences=True))
model.add(Dense(int(len(df))))
model.add(LSTM(int(len(df)), input_shape=(1, int(len(df))), return_sequences=True))
model.add(Dense(int(len(df))))
model.compile(loss='mean_absolute_error', optimizer='Adam',metrics=['accuracy'])
model.fit(data, target, nb_epoch=100, batch_size=1, verbose=2, validation_data = (data, target))
predict = model.predict(data)
I think you are using Keras to develop your LSTM model. I will suggest you look at the following link (below) to understand more carefully why one need to reshape the data. In the example, the dimensions of your dataframe are same as the tutorial is using.
https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/

How to create feature columns for TensorFlow classifier

I have a very simple dataset for binary classification in csv file which looks like this:
"feature1","feature2","label"
1,0,1
0,1,0
...
where the "label" column indicates class (1 is positive, 0 is negative). The number of features is actually pretty big but it doesn't matter for that question.
Here is how I read the data:
train = pandas.read_csv(TRAINING_FILE)
y_train, X_train = train['label'], train[['feature1', 'feature2']].fillna(0)
test = pandas.read_csv(TEST_FILE)
y_test, X_test = test['label'], test[['feature1', 'feature2']].fillna(0)
I want to run tensorflow.contrib.learn.LinearClassifier and tensorflow.contrib.learn.DNNClassifier on that data. For instance, I initialize DNN like this:
classifier = DNNClassifier(hidden_units=[3, 5, 3],
n_classes=2,
feature_columns=feature_columns, # ???
activation_fn=nn.relu,
enable_centered_bias=False,
model_dir=MODEL_DIR_DNN)
So how exactly should I create the feature_columns when all the features are also binary (0 or 1 are the only possible values)?
Here is the model training:
classifier.fit(X_train.values,
y_train.values,
batch_size=dnn_batch_size,
steps=dnn_steps)
The solution with replacing fit() parameters with the input function would also be great.
Thanks!
P.S. I'm using TensorFlow version 1.0.1
You can directly use tf.feature_column.numeric_column :
feature_columns = [tf.feature_column.numeric_column(key = key) for key in X_train.columns]
I've just found the solution and it's pretty simple:
feature_columns = tf.contrib.learn.infer_real_valued_columns_from_input(X_train)
Apparently infer_real_valued_columns_from_input() works well with categorical variables.

Resources