Straight line forecasat with Statsmodels ARIMA - machine-learning

I am new to statsmodels ARIMA
What is the best approach to do ARIMA on such a dataset?
The goal is to forecast the Value of the different types of gas.
I have run Augmented Dickey-Fuller test and have concluded that data is stationary.
How do I get a more accurate forecast?
Date
T
RH
Gas
Value
6/2/2017
6.62
51.73
CO
845.23
6/2/2017
6.62
51.73
HC
626.34
#Initialising ARIMA model
from statsmodels.tsa.arima_model import ARIMA
arima_model = ARIMA(scaled_df.Value, order=(2,0,1)).fit()
arima_model.summary()
start = len(df)
end = len(df) + len(test) -1
test['Date'] = pd.to_datetime(test['Date'],format='%d/%m/%Y')
test.set_index('Date', inplace=True)
pred = arima_model.predict(start=start, end=end,typ='levels')

i think it might be due to your training data being too large, try splitting it into smaller chunks

Related

evaluate the output of autoML results

How do I interpret following results? What is the best possible algorithm to train based on autogluon summary?
*** Summary of fit() ***
Estimated performance of each model:
model score_val fit_time pred_time_val stack_level
19 weighted_ensemble_k0_l2 -0.035874 1.848907 0.002517 2
18 weighted_ensemble_k0_l1 -0.040987 1.837416 0.002259 1
16 CatboostClassifier_STACKER_l1 -0.042901 1559.653612 0.083949 1
11 ExtraTreesClassifierGini_STACKER_l1 -0.047882 7.307266 1.057873 1
...
...
0 RandomForestClassifierGini_STACKER_l0 -0.291987 9.871649 1.054538 0
The code to generate the above results:
import pandas as pd
from autogluon import TabularPrediction as task
from sklearn.datasets import load_digits
digits = load_digits()
savedir = "otto_models/" # where to save trained models
train_data = pd.DataFrame(digits.data)
train_target = pd.DataFrame(digits.target)
train_data = pd.merge(train_data, train_target, left_index=True, right_index=True)
label_column = "0_y"
predictor = task.fit(
train_data=train_data,
label=label_column,
output_directory=savedir,
eval_metric="log_loss",
auto_stack=True,
verbosity=2,
visualizer="tensorboard",
)
results = predictor.fit_summary() # display detailed summary of fit() process
Which algorithm seems to work in this case?
weighted_ensemble_k0_l2 is the best result in terms of validation score (score_val) because it has the highest value. You may wish to do predictor.leaderboard(test_data) to get the test scores for each of the models.
Note that the result shows a negative score because AutoGluon always considers higher to be better. If a particular metric such as logloss prefers lower values to be better, AutoGluon flips the sign of the metric. I would guess a val_score of 0 would be a perfect score in your case.

Issue with multilabel classification

I followed this tutorial: https://medium.com/#vijayabhaskar96/multi-label-image-classification-tutorial-with-keras-imagedatagenerator-cd541f8eaf24
and wrote some of my code for multilabel classification. I had it working with one-hot encoding on a small scale but I had to move to option 2 mentioned in the article because I have 6000 classes and therefore one hot was not viable. I managed to train the network and it said 99% accuracy and 83% f1 score. However, when I'm trying to test the network, for every image it's outputting some combination of only 3 labels when there are 6000 possible labels. I wondered if maybe the code to test the model was incorrect. I tried using the code mentioned in the post and it doesn't work:
test_generator.reset()
pred = model.predict_generator(test_generator, steps=STEP_SIZE_TEST, verbose=1);
pred_bool = (pred > 0.5)
unorderable types: list() > float()
I've tried hard to fix this and not figured it out and I can't find any examples online of anyone doing something similar. Does anyone have an idea of how to get this prediction part working using this code block (I had it with another 2 options and was getting that issue printing one or several labels) or why the model might be failing in training with this behavior?
EDIT: for more context on the training issue, here is all the training code:
import json
input_file = open ('class_names_6000.json')
json_array = json.load(input_file)
#print(str(json_array))
args = parser.parse_args()
gpu_options = tf.GPUOptions(allow_growth=True)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
print('Loading Data...')
df = pd.read_csv('dataset_train.csv')
df["labels"]=df["labels"].apply(lambda x:x.split(","))
datagen=ImageDataGenerator(rescale=1./255.)
test_datagen=ImageDataGenerator(rescale=1./255.)
train_generator=datagen.flow_from_dataframe(
dataframe=df,
directory="",
x_col="Filepaths",
y_col="labels",
batch_size=128,
seed=42,
shuffle=True,
class_mode="categorical",
classes=json_array,
target_size=(100,100))
df = pd.read_csv('dataset_test.csv')
df["labels"]=df["labels"].apply(lambda x:x.split(","))
test_generator=test_datagen.flow_from_dataframe(
dataframe=df,
directory="",
x_col="Filepaths",
y_col="labels",
batch_size=128,
seed=42,
shuffle=True,
class_mode="categorical",
classes=json_array,
target_size=(100,100))
df = pd.read_csv('dataset_validation.csv')
df["labels"]=df["labels"].apply(lambda x:x.split(","))
valid_generator=test_datagen.flow_from_dataframe(
dataframe=df,
directory="",
x_col="Filepaths",
y_col="labels",
batch_size=128,
seed=42,
shuffle=True,
class_mode="categorical",
classes=json_array,
target_size=(100,100))
print('Data Loaded.')
f1_score_callback = ComputeF1()
model = build_model('train', numclasses=len(json_array), model_name = args.model)
ImageFile.LOAD_TRUNCATED_IMAGES = True
Also, an important detail, when training, it says the accuracy is 99% and the f1 score is 84% with an validation f1 score at 84% as well.

Random Forest Regression not give 0 or 1

I'm currently using RandomForestRegression for Titanic(Kaggle).
%%timeit
model = RandomForestRegressor(n_estimators=200, oob_score=False,n_jobs=1,random_state=42)
model.fit(X,y)
#y_oob = model.oob_prediction_
#print("c-stat:", roc_auc_score(y,model.oob_prediction_))
prediction_regression = model.predict(X_test)
# dataframe with predictions
kaggle = pd.DataFrame({'PassengerId': passengerId, 'Survived': prediction_regression})
# save to csv
kaggle.to_csv('./csvToday/prediction_regression.csv', index=False)
but it returns not 0 or 1 . it gives decimal points
892: 0.3163
893: 0.07 such and such
How to make RandomForestRegression return as 0 or 1
Regression is a machine learning problem of predicting quantity/amount/price (such as market stock prediction, home price prediction, e.t.c). As far, as I remember, the goal of titanic competition is to predict whether a passenger survive. It's sounds like a binary classification problem. If it's a classification problem you should use RandomForestClassifier (docs).
So your code would look like:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
#some parameters
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
submit_df = pd.DataFrame({'PassengerId': passengerId, 'Survived': y_pred})
submit_df.to_csv('./csvToday/submission.csv', index=False)
This kernel can provide you with some more insights.

increase accuracy of model in sklearn

The decision tree classification gives an accuracy of 0.52 but I want to increase the accuracy. How can I increase the accuracy by using any of the classification model available in sklearn.
I have used knn, decision tree, and cross-validation but all of them gives less accuracy.
Thanks
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
#read from the csv file and return a Pandas DataFrame.
nba = pd.read_csv('wine.csv')
# print the column names
original_headers = list(nba.columns.values)
print(original_headers)
#print the first three rows.
print(nba[0:3])
# "Position (pos)" is the class attribute we are predicting.
class_column = 'quality'
#The dataset contains attributes such as player name and team name.
#We know that they are not useful for classification and thus do not
#include them as features.
feature_columns = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH','sulphates', 'alcohol']
#Pandas DataFrame allows you to select columns.
#We use column selection to split the data into features and class.
nba_feature = nba[feature_columns]
nba_class = nba[class_column]
print(nba_feature[0:3])
print(list(nba_class[0:3]))
train_feature, test_feature, train_class, test_class = \
train_test_split(nba_feature, nba_class, stratify=nba_class, \
train_size=0.75, test_size=0.25)
training_accuracy = []
test_accuracy = []
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=1)
knn.fit(train_feature, train_class)
prediction = knn.predict(test_feature)
print("Test set predictions:\n{}".format(prediction))
print("Test set accuracy: {:.2f}".format(knn.score(test_feature, test_class)))
train_class_df = pd.DataFrame(train_class,columns=[class_column])
train_data_df = pd.merge(train_class_df, train_feature, left_index=True, right_index=True)
train_data_df.to_csv('train_data.csv', index=False)
temp_df = pd.DataFrame(test_class,columns=[class_column])
temp_df['Predicted Pos']=pd.Series(prediction, index=temp_df.index)
test_data_df = pd.merge(temp_df, test_feature, left_index=True, right_index=True)
test_data_df.to_csv('test_data.csv', index=False)
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(train_feature, train_class)
print("Training set score: {:.3f}".format(tree.score(train_feature, train_class)))
print("Test set score Decision: {:.3f}".format(tree.score(test_feature, test_class)))
prediction = tree.predict(test_feature)
print("Confusion matrix:")
print(pd.crosstab(test_class, prediction, rownames=['True'], colnames=['Predicted'], margins=True))
cancer = nba.as_matrix()
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
scores = cross_val_score(tree, train_feature,train_class, cv=10)
print("Cross-validation scores: {}".format(scores))
print("Average cross-validation score: {:.2f}".format(scores.mean()))
Usually the next step after DT are RF (and it's neighbors) or XGBoost (but it's not sklearn). Try them. And DT are very simple to overfit.
Remove outliers. Check classes in your dataset: if they are unbalanced, most of errors may be there. In this case you need to use weights while fitting or in metric function (or use f1).
You can attach here your Confusion Matrix - could be great to see.
Also NN (even from sklearn) may show better results.
Improve your preprocessing.
Methods such as DT and kNN may be sensitive to how you preprocess your columns. For example, a DT can benefit much from well-chosen thresholds on the continuous variables.

How to use SGD for time series analysis

Is it possible to use stochastic gradient descent for time-series analysis?
My initial idea, given a series of (t, v) pairs where I want an SGD regressor to predict the v associated with t+1, would be to convert the date/time into an integer value, and train the regressor on this list using the hinge loss function. Is this feasible?
Edit: This is example code using the SGD implementation in scikit-learn. However, it fails to properly predict a simple linear time series model. All it seems to do is calculate the average of the training Y-values, and use that as its prediction of the test Y-values. Is SGD just unsuitable for time-series-analysis or am I formulating this incorrectly?
from datetime import date
from sklearn.linear_model import SGDRegressor
# Build data.
s = date(2010,1,1)
i = 0
training = []
for _ in xrange(12):
i += 1
training.append([[date(2012,1,i).toordinal()], i])
testing = []
for _ in xrange(12):
i += 1
testing.append([[date(2012,1,i).toordinal()], i])
clf = SGDRegressor(loss='huber')
print 'Training...'
for _ in xrange(20):
try:
print _
clf.partial_fit(X=[X for X,_ in training], y=[y for _,y in training])
except ValueError:
break
print 'Testing...'
for X,y in testing:
p = clf.predict(X)
print y,p,abs(p-y)
SGDRegressor in sklearn is numerically not stable for not scaled input parameters. For good result it's highly recommended that you scale the input variable.
from datetime import date
from sklearn.linear_model import SGDRegressor
# Build data.
s = date(2010,1,1).toordinal()
i = 0
training = []
for _ in range(1,13):
i += 1
training.append([[s+i], i])
testing = []
for _ in range(13,25):
i += 1
testing.append([[s+i], i])
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform([X for X,_ in training])
after training the SGD regressor, you will have to scale the test input variable accordingly.
clf = SGDRegressor()
clf.fit(X=X_train, y=[y for _,y in training])
print(clf.intercept_, clf.coef_)
print('Testing...')
for X,y in testing:
p = clf.predict(scaler.transform([X]))
print(X[0],y,p[0],abs(p[0]-y))
Here is the result:
[6.31706122] [3.35332573]
Testing...
733786 13 12.631164799851827 0.3688352001481725
733787 14 13.602565350686039 0.39743464931396133
733788 15 14.573965901520248 0.42603409847975193
733789 16 15.545366452354457 0.45463354764554254
733790 17 16.51676700318867 0.48323299681133136
733791 18 17.488167554022876 0.5118324459771237
733792 19 18.459568104857084 0.5404318951429161
733793 20 19.430968655691295 0.569031344308705
733794 21 20.402369206525506 0.5976307934744938
733795 22 21.373769757359714 0.6262302426402861
733796 23 22.34517030819392 0.6548296918060785
733797 24 23.316570859028133 0.6834291409718674
The method of choice for time series prediction depends on what you know about your time series. If you choose a specific method for your task you always make implicit assumptions about the nature of your signal and the kind of system that generated the signal. Any method is always a model of the system. The more you know a priori about your signal and the system the better you are able to model it.
If your signal for instance is of stochastic nature, usually ARMA processes or Kalman filters are a good choice. If those fail, other more deterministic models might help, given, of corse, you have some information about you system.

Resources