GluonTS example airpassengers dataset not found - time-series

I am trying to run the GluonTS example code, going through some struggle to install the libraries, now I get the following error:
FileNotFoundError: C:\Users\abcde\.mxnet\gluon-ts\datasets\airpassengers\test
The C:\Users\abcde\.mxnet\gluon-ts\datasets\airpassengers\ does exist but contains only train folder. Have tried reinstalling but to no avail. Any ideas how to fix this and run the example, even if finding the dataset in correct format elsewhere?
EDIT: To clarify, I was referring to an example on https://ts.gluon.ai/stable/
import matplotlib.pyplot as plt
from gluonts.dataset.util import to_pandas
from gluonts.dataset.pandas import PandasDataset
from gluonts.dataset.repository.datasets import get_dataset
from gluonts.mx import DeepAREstimator, Trainer
dataset = get_dataset("airpassengers")
deepar = DeepAREstimator(prediction_length=12, freq="M", trainer=Trainer(epochs=5))
model = deepar.train(dataset.train)
# Make predictions
true_values = to_pandas(list(dataset.test)[0])
true_values.to_timestamp().plot(color="k")
prediction_input = PandasDataset([true_values[:-36], true_values[:-24], true_values[:-12]])
predictions = model.predict(prediction_input)
for color, prediction in zip(["green", "blue", "purple"], predictions):
prediction.plot(color=f"tab:{color}")
plt.legend(["True values"], loc="upper left", fontsize="xx-large")

There was an incorrect import on the earlier version of the example, which was since corrected, also I needed to specify regenerate=True while getting the dataset, so:
dataset = get_dataset("airpassengers", regenerate=True).

Related

Detectron2 prediction problem on my local machine with custom model

I trained a custom model with detectron2 on google colab, and ok, it's working correctly. The model was trained, the predictions were ok, this on google colab. But when I made predictions on my local machine did'nt work. Here a similar example on google colab: https://colab.research.google.com/drive/1bSlH5Am_zFEWbJ9zTRu2wFEDKDvn0LUv?usp=sharing
I exported de final model and ran with this code:
from detectron2.config import get_cfg
from detectron2.engine import DefaultPredictor
from detectron2.data import MetadataCatalog
from detectron2.utils.visualizer import Visualizer, ColorMode
import matplotlib.pyplot as plt
import cv2.cv2 as cv2
cfg = get_cfg()
cfg.merge_from_file("./detectron2_repo/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
cfg.MODEL.WEIGHTS = "model_final.pth" # path for final model
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.8
predictor = DefaultPredictor(cfg)
im = cv2.imread('0.jpg')
outputs = predictor(im)
v = Visualizer(im[:, :, ::-1],
metadata=MetadataCatalog.get(cfg.DATASETS.TRAIN[0]),
scale=0.5,
instance_mode=ColorMode.IMAGE_BW)
out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
img = cv2.cvtColor(out.get_image()[:, :, ::-1], cv2.COLOR_RGBA2RGB)
cv2.imwrite('img.jpg',img)
I supose that the cfg.merge_from_file is the problem. Is there other file? Where I find on colab?
I tested the standard models and worked well on my local machine, the problem is with the custom model.
I saved the configs with this comand and then I downloaded.
f = open('config.yml','w')
f.write(cfg.dump())
f.close()
and replaced:
cfg.merge_from_file("./detectron2_repo/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
by
cfg.merge_from_file("config.yml")
and worked.

Keras: Model Compilation Giving "Index 200005 is out of bounds for axis 0 with size 200000" Error

I'm using Jena Climate Data that my book gives a link to. I have it below;
https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip
I tried messing with it but I have no clue why the index is surpassing 200000. I'm not sure why it gets to 200005 since my training data is 200001 observations long.
I've also gotten an error that said, " Index 200000 is out of bounds for axis 0 with size 200000."
The data is 420551x14 of weather data. My code is as follows:
import pandas as pd
import numpy as np
import keras
data = pd.read_csv("D:\\School\\Spring_2019\\GraduateProject\\jena_climate_2009_2016_Data\\jena_climate_2009_2016.csv")
data = data.iloc[:,data.columns!='Date Time']
data
# Standardize the Data
from sklearn import preprocessing
data = preprocessing.scale(data[:200000])
# Build Generators
from keras.preprocessing.sequence import TimeseriesGenerator
target = data[:,1] # Should target be scaled?
# ? Do I need to remove targets from the data variable?
trainGen = TimeseriesGenerator(data,targets=target,length=1440,
sampling_rate=6,
batch_size=190,
start_index=0,
end_index=200000)
valGen = TimeseriesGenerator(data,targets=target,length=1440,
sampling_rate=6,
batch_size=190,
start_index=199999,
end_index=300000)
testGen = TimeseriesGenerator(data,targets=target,length=6,
batch_size=128,
start_index=300000,
end_index=420550)
from keras.models import Sequential
from keras import layers
from keras.optimizers import RMSprop
from keras.layers import LSTM
#Flatten part is: 240 = lookback//step. This is 1440/6 because we are looking at
model = Sequential()
model.add(layers.Flatten(input_shape=(240,data.shape[-1])))
model.add(layers.Dense(32,activation='relu'))
model.add(layers.Dense(1))
val_steps = 300000-200001-1440
model.compile(optimizer=RMSprop(),loss='mae')
history = model.fit_generator(trainGen,
steps_per_epoch=250,
epochs=20,
validation_data=valGen,
validation_steps=val_steps)
Let me know if you need anything else and thank you greatly in advance.
Well, you've only selected first 200000 rows for your data (data = preprocessing.scale(data[:200000]), so validation and test generators are out of bounds (index > 200000)

Considering one column more important than others

In case of 3 columns data, (In my test case) I can see that all the columns are valued as equal.
random_forest.feature_importances_
array([0.3131602 , 0.31915436, 0.36768544])
Is there any way to add waitage to one of the columns?
Update:
I guess xgboost can be used in this case.
I tried, but getting this error:
import xgboost as xgb
param = {}
num_round = 2
dtrain = xgb.DMatrix(X, y)
dtest = xgb.DMatrix(x_test_split)
dtrain_split = xgb.DMatrix(X_train, label=y_train)
dtest_split = xgb.DMatrix(X_test)
gbdt = xgb.train(param, dtrain_split, num_round)
y_predicted = gbdt.predict(dtest_split)
rmse_pred_vs_actual = xgb.rmse(y_predicted, y_test)
AttributeError: module 'xgboost' has no attribute 'rmse'
Error is by assuming xgb has method "rmse":
rmse_pred_vs_actual = xgb.rmse(y_predicted, y_test)
It is literally written: AttributeError: module 'xgboost' has no attribute 'rmse'
Use sklearn.metrics.mean_squared_error
By:
from sklearn.metrics import mean_squared_error
# Your code
rmse_pred_vs_actual = mean_squared_error(y_test, y_predicted)
It'll fix your error but it still doesn't control a feature importance.
Now, if you really want to change the importance of a feature, you need to be creative about how to make a change like this. There is no text book solution that I know of and no method in xgboost that I know of. You can follow the link Stev posted in a comment to your question and maybe get some ideas (including changing your ML algorithm).

Probable issue with LSTM in lasagne

With a simple constructor for the LSTM, as given in the tutorial, and an input of dimension [,,1] one would expect to see an output of shape [,,num_units].
But regardless of the num_units passed during construction, the output has the same shape as the input.
Following is the min code to replicate this issue...
import lasagne
import theano
import theano.tensor as T
import numpy as np
num_batches= 20
sequence_length= 100
data_dim= 1
train_data_3= np.random.rand(num_batches,sequence_length,data_dim).astype(theano.config.floatX)
#As in the tutorial
forget_gate = lasagne.layers.Gate(b=lasagne.init.Constant(5.0))
l_lstm = lasagne.layers.LSTMLayer(
(num_batches,sequence_length, data_dim),
num_units=8,
forgetgate=forget_gate
)
lstm_in= T.tensor3(name='x', dtype=theano.config.floatX)
lstm_out = lasagne.layers.get_output(l_lstm, {l_lstm:lstm_in})
f = theano.function([lstm_in], lstm_out)
lstm_output_np= f(train_data_3)
lstm_output_np.shape
#= (20, 100, 1)
An unqualified LSTM (I mean in its default mode) should produce one output for each unit right?
The code was run on kaixhin's cuda lasagne docker image docker image
What gives?
Thanks !
You can fix that by using a lasagne.layers.InputLayer
import lasagne
import theano
import theano.tensor as T
import numpy as np
num_batches= 20
sequence_length= 100
data_dim= 1
train_data_3= np.random.rand(num_batches,sequence_length,data_dim).astype(theano.config.floatX)
#As in the tutorial
forget_gate = lasagne.layers.Gate(b=lasagne.init.Constant(5.0))
input_layer = lasagne.layers.InputLayer(shape=(num_batches, # <-- change
sequence_length, data_dim),) # <-- change
l_lstm = lasagne.layers.LSTMLayer(input_layer, # <-- change
num_units=8,
forgetgate=forget_gate
)
lstm_in= T.tensor3(name='x', dtype=theano.config.floatX)
lstm_out = lasagne.layers.get_output(l_lstm, lstm_in) # <-- change
f = theano.function([lstm_in], lstm_out)
lstm_output_np= f(train_data_3)
print lstm_output_np.shape
If you feed your input into the input_layer, it is not ambiguous anymore, so you do not even need to specify where the input is supposed to go. Directly specifying a shape and adding the tensor3 into the LSTM does not work.

Combing the NLTK text features with sklearn Vectorized features

I am trying to combine the dict type features used in NLTK along with the SKLEARN tfidf feature for each instance.
Sample Input:
instances=[["I am working with text data"],["This is my second sentence"]]
instance = "I am working with text data "
def generate_features(instance):
featureset["suffix"]=tokenize(instance)[-1]
featureset["tfidf"]=self.tfidf.transform(instance)
return features
from sklearn.linear_model import LogisticRegressionCV
from nltk.classify.scikitlearn import SklearnClasskifier
self.classifier = SklearnClassifier(LogisticRegressionCV())
self.classifier.train(feature_sets)
This tfidf is trained on all the instances. But when I train the nltk classifier using this featureset it throws the following error.
self.classifier.train(feature_sets)
File "/Library/Python/2.7/site-packages/nltk/classify/scikitlearn.py", line 115, in train
X = self._vectorizer.fit_transform(X)
File "/Library/Python/2.7/site
packages/sklearn/feature_extraction/dict_vectorizer.py", line 226, in fit_transform
return self._transform(X, fitting=True)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/dict_vectorizer.py", line 174, in _transform
values.append(dtype(v))
TypeError: float() argument must be a string or a number
I understand the issue here, that it cannot vectorize the already vectorized features. But is there a way to fix this ?
For those who might visit this question in future, I did the following thing which solved the issue.
from sklearn.linear_model import LogisticRegressionCV
from scipy.sparse import hstack
def generate_features(instance):
featureset["suffix"]=tokenize(instance)[-1]
return features
feature_sets=[(generate_features(instance),label) for instance in instances]
X = self.vec.fit_transform([item[0] for item in feature_sets]).toarray()
Y = [item[1] for item in feature_sets]
tfidf=TfidfVectorizer.fit_transform(instances)
X=hstack((X,tfidf))
classifier=LogisticRegressionCV()
classifier.fit(X,Y)
i don't know if it's help or not. In my case, featureset["suffix"]'s values must be string or number. For example :
featureset["suffix"] = "some value"

Resources