Zero predictions on my LSTM model though accuracy is 30% - machine-learning

I am building an LSTM model, with the input shape of:
x_train_text: (3500,80) " I have 3500 examples, and 80 features extracted from WordEmbedding"
y_train_text: (3500,6) "I have 6 classes, unbalanced"
x_validate_text: (1000,80)
y_validate_text: (1000,6)
Now, I trained the model and the overall accuracy was 30%. I am fine with that as I am building a simple LSTM. The result is as follow:
model.fit(x_train_text,y_train_text,
validation_data = (x_validate_text,y_validate_text)
epochs= 10)
{'model': [['loss', 1.7275227308273315],
['accuracy', 0.24708323180675507],
['val_loss', 1.7259385585784912],
['val_accuracy', 0.2551288902759552]]}
Now, I am trying to do error analysis to see which classes are underfitting. Whenever I run Model.predict(x_train_text) I get only ZEROS although it is the same training dataset!!!
Shouldn't this be at least the same as training accuracy overall?

Related

Linear Regression test data violating training data.Please explain where i went wrong

This is a part of a dataset containing 1000 entries of pricing of rents of houses at different locations.
after training the model, if i send same training data as test data, i am getting incorrect results. How is this even possible?
X_loc = df[{'area','rooms','location'}]
y_loc = df[:]['price']
X_train, X_test, y_train, y_test = train_test_split(X_loc, y_loc, test_size = 1/3, random_state = 0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_train[0:1])
DATASET:
price rooms area location
0 0 22000 3 1339 140
1 1 45000 3 1580 72
3 3 72000 3 2310 72
4 4 40000 3 1800 41
5 5 35000 3 2100 57
expected output (y_pred)should be 220000 but its showing 290000 How can it violate the already trained input?
What you observed is exactly what is referred to as the "training error". Machine learning models are meant to find the "best" fit which minimizes the "total error" (i.e. for all data points and not every data point).
22000 is not very far from 29000, although it is not the exact number. This because linear regression tries compress all the variations in your data to follow one straight line.
Possibly the model is nonlinear and so applying a Linear Regression yields bad results. There are other reasons why a Linear Regression may fail cf. https://stats.stackexchange.com/questions/393706/bad-linear-regression-results
Nonlinear data often appears when there are (statistical) interactions between features.
A generalization of Linear Regression is the Generalized Linear Model (GLM), that is able to handle nonlinearities by its nonlinear link functions : https://en.wikipedia.org/wiki/Generalized_linear_model
In scikit-learn you can use a Support Vector Regression with polynomial or RBF kernel for a nonlinear model https://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html
An alternative ansatz is to analyze the data on interactions and apply methods that are described in https://en.wikipedia.org/wiki/Generalized_linear_model#Correlated_or_clustered_data however this is complex. Possibly try Ridge Regression for this assumption because it can handle multicollinearity tht is one form of statistical interactions: https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf
https://statisticsbyjim.com/regression/difference-between-linear-nonlinear-regression-models/

How many hours of training does it take to get decent error in House Prices Dataset using Neural Network

I'm new to Machine Learning and I'm trying to implement linear regression using keras on this dataset https://www.kaggle.com/harlfoxem/housesalesprediction . Although I think classical machine learning will be more suited to this problem, I want to use Neural Network to learn about it. I have done feature selection and removed some features with high correlation with each other, and now have 8 features left. I have hnormalized my features, but not the labels. I have read and know that Neural Networks generally take time to train, I just want to ask this question to prevent me from investing further time on a model that might won't work. Right now, I am training a model with this design:
model = Sequential()
model.add(Dense(10, inputshape = (10, ) , activation =LeakyReLU()))
model.add(Dense(7, activation=LeakyReLU()))
model.add(Dense(1))
model.compile(optimizer ="adam", loss = "meansquarederror", metrics = ["meansquared_error"])
and right now, it's been 13,000 epochs and 8 hours, and I'm still getting :
loss: 66127403415.9417 - meansquarederror: 66127421440.0000 - valloss: 75086529026.4872 - valmeansquarederror: 75086495744.0000
Although I can see that the loss has been slowly improving (It started at about 300 billion) . So how many hours of training does it take to get decent error on this dataset? Am I on the right track?

Non-linear multivariate time-series response prediction using RNN

I am trying to predict the hygrothermal response of a wall, given the interior and exterior climate. Based on literature research, I believe this should be possible with RNN but I have not been able to get good accuracy.
The dataset has 12 input features (time-series of exterior and interior climate data) and 10 output features (time-series of hygrothermal response), both containing hourly values for 10 years. This data was created with hygrothermal simulation software, there is no missing data.
Dataset features:
Dataset targets:
Unlike most time-series prediction problems, I want to predict the response for the full length of the input features time-series at each time-step, rather than the subsequent values of a time-series (eg financial time-series prediction). I have not been able to find similar prediction problems (in similar or other fields), so if you know of one, references are very welcome.
I think this should be possible with RNN, so I am currently using LSTM from Keras. Before training, I preprocess my data the following way:
Discard first year of data, as the first time steps of the hygrothermal response of the wall is influenced by the initial temperature and relative humidity.
Split into training and testing set. Training set contains the first 8 years of data, the test set contains the remaining 2 years.
Normalise training set (zero mean, unit variance) using StandardScaler from Sklearn. Normalise test set analogously using mean an variance from training set.
This results in: X_train.shape = (1, 61320, 12), y_train.shape = (1, 61320, 10), X_test.shape = (1, 17520, 12), y_test.shape = (1, 17520, 10)
As these are long time-series, I use stateful LSTM and cut the time-series as explained here, using the stateful_cut() function. I only have 1 sample, so batch_size is 1. For T_after_cut I have tried 24 and 120 (24*5); 24 appears to give better results. This results in X_train.shape = (2555, 24, 12), y_train.shape = (2555, 24, 10), X_test.shape = (730, 24, 12), y_test.shape = (730, 24, 10).
Next, I build and train the LSTM model as follows:
model = Sequential()
model.add(LSTM(128,
batch_input_shape=(batch_size,T_after_cut,features),
return_sequences=True,
stateful=True,
))
model.addTimeDistributed(Dense(targets)))
model.compile(loss='mean_squared_error', optimizer=Adam())
model.fit(X_train, y_train, epochs=100, batch_size=batch=batch_size, verbose=2, shuffle=False)
Unfortunately, I don't get accurate prediction results; not even for the training set, thus the model has high bias.
The prediction results of the LSTM model for all targets
How can I improve my model? I have already tried the following:
Not discarding the first year of the dataset -> no significant difference
Differentiating the input features time-series (subtract previous value from current value) -> slightly worse results
Up to four stacked LSTM layers, all with the same hyperparameters -> no significant difference in results but longer training time
Dropout layer after LSTM layer (though this is usually used to reduce variance and my model has high bias) -> slightly better results, but difference might not be statistically significant
Am I doing something wrong with the stateful LSTM? Do I need to try different RNN models? Should I preprocess the data differently?
Furthermore, training is very slow: about 4 hours for the model above. Hence I am reluctant to do an extensive hyperparameter gridsearch...
In the end, I managed to solve this the following way:
Using more samples to train instead of only 1 (I used 18 samples to train and 6 to test)
Keep the first year of data, as the output time-series for all samples have the same 'starting point' and the model needs this information to learn
Standardise both input and output features (zero mean, unit variance). I found this improved prediction accuracy and training speed
Use stateful LSTM as described here, but add reset states after epoch (see below for code). I used batch_size = 6 and T_after_cut = 1460. If T_after_cut is longer, training is slower; if T_after_cut is shorter, accuracy decreases slightly. If more samples are available, I think using a larger batch_size will be faster.
use CuDNNLSTM instead of LSTM, this speed up the training time x4!
I found that more units resulted in higher accuracy and faster convergence (shorter training time). Also I found that the GRU is as accurate as the LSTM tough converged faster for the same number of units.
Monitor validation loss during training and use early stopping
The LSTM model is build and trained as follows:
def define_reset_states_batch(nb_cuts):
class ResetStatesCallback(Callback):
def __init__(self):
self.counter = 0
def on_batch_begin(self, batch, logs={}):
# reset states when nb_cuts batches are completed
if self.counter % nb_cuts == 0:
self.model.reset_states()
self.counter += 1
def on_epoch_end(self, epoch, logs={}):
# reset states after each epoch
self.model.reset_states()
return(ResetStatesCallback)
model = Sequential()
model.add(layers.CuDNNLSTM(256, batch_input_shape=(batch_size,T_after_cut ,features),
return_sequences=True,
stateful=True))
model.add(layers.TimeDistributed(layers.Dense(targets, activation='linear')))
optimizer = RMSprop(lr=0.002)
model.compile(loss='mean_squared_error', optimizer=optimizer)
earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0.005, patience=15, verbose=1, mode='auto')
ResetStatesCallback = define_reset_states_batch(nb_cuts)
model.fit(X_dev, y_dev, epochs=n_epochs, batch_size=n_batch, verbose=1, shuffle=False, validation_data=(X_eval,y_eval), callbacks=[ResetStatesCallback(), earlyStopping])
This gave me very statisfying accuracy (R2 over 0.98):
This figure shows the temperature (left) and relative humidity (right) in the wall over 2 years (data not used in training), prediction in red and true output in black. The residuals show that the error is very small and that the LSTM learns to capture the long-term dependencies to predict the relative humidity.

Training accuracy on SGD

How do you compute for the training accuracy for SGD? Do you compute it using the batch data you trained your network with? Or using the entire dataset? (for each batch optimization iteration)
I tried computing the training accuracy for each iteration using the batch data I trained my network with. And it almost always gives me 100% training accuracy (sometimes 100%, 90%, 80%, always multiples of 10%, but the very first iteration gave me 100%). Is this because I am computing the accuracy on the same batch data I trained it with for that iteration? Or is my model overfitting that it gave me 100% instantly, but the validation accuracy is low? (this is the main question here, if this is acceptable, or there is something wrong with the model)
Here are the hyperparameters I used.
batch_size = 64
kernel_size = 60 #from 60 #optimal 2
depth = 15 #from 60 #optimal 15
num_hidden = 1000 #from 1000 #optimal 80
learning_rate = 0.0001
training_epochs = 8
total_batches = train_x.shape[0] // batch_size
Calculating the training accuracy on the batch data during the training process is correct. If the number of the accuracy is always multiple of 10%, then most likely it is because your batch size is 10. For example, if 8 of the training outputs match the labels, then your training accuracy will be 80%. If the training accuracy number goes up and down, there are two main possibilities:
1. If you print out the accuracy numbers multiple time over one epoch, it is normal, especially at the early stage of training, because the model is predicting over different data samples;
2. If you print out the accuracy once each epoch, and if you see the training accuracy goes up and down during the later stage of the training, that means your learning rate is too big. You need to decease that overtime during the training.
If these do not answer your question, please provider more details so that we can help.

LSTM training pattern

I'm fairly new to NNs and I'm doing my own "Hello World" with LSTMs instead copying something. I have chosen a simple logic as follows:
Input with 3 timesteps. First one is either 1 or 0, the other 2 are random numbers. Expected output is same as the first timestep of input. The data feed looks like:
_X0=[1,5,9] _Y0=[1] _X1=[0,5,9] _Y1=[0] ... 200 more records like this.
This simple(?) logic can be trained for 100% accuracy. I ran many tests and the most efficient model I found was 3 LSTM layers, each of them with 15 hidden units. This returned 100% accuracy after 22 epochs.
However I noticed something that I struggle to understand: In the first 12 epochs the model makes no progress at all as measured by accuracy (acc. stays 0.5) and only marginal progress measured by Categorical Crossentropy (goes from 0.69 to 0.65). Then from epoch 12 through epoch 22 it trains very fast to accuracy 1.0. The question is: Why does training happens like this? Why the first 12 epochs are making no progress and why epochs 12-22 are so much more efficient?
Here is my entire code:
from keras.models import Sequential
from keras.layers import Input, Dense, Dropout, LSTM
from keras.models import Model
import helper
from keras.utils.np_utils import to_categorical
x_,y_ = helper.rnn_csv_toXY("LSTM_hello.csv",3,"target")
y_binary = to_categorical(y_)
model = Sequential()
model.add(LSTM(15, input_shape=(3,1),return_sequences=True))
model.add(LSTM(15,return_sequences=True))
model.add(LSTM(15, return_sequences=False))
model.add(Dense(2, activation='softmax', kernel_initializer='RandomUniform'))
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['acc'])
model.fit(x_, y_binary, epochs=100)
It is hard to give a specific answer to this as it depends on many factors. One major factor that comes into play when training neural networks is the learning rate of the optimizer you choose.
In your code you have no specific learning rate set. The default learning rate of Adam in Keras 2.0.3 is 0.001. Adam uses a dynamic learning rate lr_t based on the initial learning rate (0.001) and the current time step, defined as
lr_t = lr * (sqrt(1. - beta_2**t) / (1. - beta_1**t)) .
The values of beta_2 and beta_1 are commonly left at their default values of 0.999 and 0.9 respectively. If you plot this learning rate you get a picture of something like this:
It might just be that this is the sweet spot for updating your weights to find a local (possibly a global) minimum. A learning rate that is too high often makes no difference at it just 'skips' over the regions that would lower your error, whereas lower learning rates take smaller step in your error landscape and let you find regions where the error is lower.
I suggest that you use an optimizer that makes less assumptions, such as stochastic gradient descent (SGD) and you test this hypothesis by using a lower learning rate.

Resources