When and where should we use these keras LSTM models - machine-learning

I know how a RNN, LSTM, neural nets,activation function works but from various available LSTM models I dont know what should I use for which data and when. I created these 5 models as a sample of different varites of LSTM models I have seen but I dont know which optimal sequence dataset should use. I have most of my confussion in the second/third lines of these models. Are model1 and model4 are same? Why is model1.add(LSTM(10, input_shape=(max_len, 1), return_sequences=False)) different from model4.add(Embedding(X_train.shape[1], 128, input_length=max_len)) . I would much appreciate If some one can explain these five models in simple english.
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
from keras.models import Sequential
from keras.layers.wrappers import TimeDistributed
#model1
model1 = Sequential()
model1.add(LSTM(10, input_shape=(max_len, 1), return_sequences=False))
model1.add(Dense(1, activation='sigmoid'))
model1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print model1.summary()
#model2
model2 = Sequential()
model2.add(LSTM(10, batch_input_shape=(1, 1, 1), return_sequences=False, stateful=True))
model2.add(Dense(1, activation='sigmoid'))
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print model2.summary()
#model3
model3 = Sequential()
model3.add(TimeDistributed(Dense(X_train.shape[1]), input_shape=(X_train.shape[1],1)))
model3.add(LSTM(10, return_sequences=False))
model3.add(Dense(1, activation='sigmoid'))
model3.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print model3.summary()
#model4
model4 = Sequential()
model4.add(Embedding(X_train.shape[1], 128, input_length=max_len))
model4.add(LSTM(10))
model4.add(Dense(1, activation='sigmoid'))
model4.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print model4.summary()
#model5
model5 = Sequential()
model5.add(Embedding(X_train.shape[1], 128, input_length=max_len))
model5.add(Bidirectional(LSTM(10)))
model5.add(Dense(1, activation='sigmoid'))
model5.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print model5.summary()

So:
First network is the best one for classification. It's simply analysing the whole sequence - and once all input steps are fed to a model - it's able to perform a decision. There are other variants of this architecture (using e.g. GlobalAveragePooling1D or max one) which are pretty similiar from a conceptual point of view.
Second network - from a design point of view is quite similar to a first architecture. What differs them is the fact that in a first approach two consequent fit and predict calls are totally independent, whereas here - the starting state for second call is the same to the last one in a first. This enables a lot of cool scenarios like e.g. varying length sequences analysis or e.g. decision making processes thanks to the fact that you could effecitively stop inference / training process - affect network or input and come back to it with actualized state.
Is the best one when you don't want to use recurrent network at all stages of your computations. Especially - when your network is big - introducing a recurrent layers is quite costly from a parameter number point of view (introducing a recurrent connection usually increases the number of parameter by a factor of at least 2). So you could apply a static network as a preprocessing stage - and then you feed results to a recurrent part. This makes training easier.
Model is a special case of case 3. Here - you have a sequence of tokens which are coded by a one-hot encoding and then transformed using Embedding. This makes the process less memory consuming.
Bidrectional network provides you an advantage of knowing at each step not only a sequence previous history - but also further steps. This is at computational cost and also you are losing the possibilty of a sequential data feed - as you need to have a full sequence when analysis is performed.

Related

Keras model accuracy not improving

I'm trying to train a neural network to predict the ratings for players in FIFA 18 by easports (ratings are between 64-99). I'm using their players database (https://easports.com/fifa/ultimate-team/api/fut/item?page=1) and I've processed the data into training_x, testing_x, training_y, testing_y. Each of the training samples is a numpy array containing 7 values...the first 6 are the different stats of the player (shooting, passing, dribbling, etc) and the last value is the position of the player (which I mapped between 1-8, depending on the position), and each of the testing values is a single integer between 64-99, representing the rating of that player.
I've tried many different hyperparameters, including changing the activation functions to tanh and relu, and I've tried adding a batch normalization layer after the first dense layer (I thought that it might be useful since one of my features is very small and the other features are between 50-99), I've played around with the SGD optimizer (changed the learning rate, momentum, even tried changing the optimizer to Adam), tried different loss functions, added/removed dropout layers, and tried different regularizers for the weights of the model.
model = Sequential()
model.add(Dense(64, input_shape=(7,),
kernel_regularizer=regularizers.l2(0.01)))
//batch normalization?
model.add(Activation('sigmoid'))
model.add(Dense(64, kernel_regularizer=regularizers.l2(0.01),
activation='sigmoid'))
model.add(Dropout(0.3))
model.add(Dense(32, kernel_regularizer=regularizers.l2(0.01),
activation='sigmoid'))
model.add(Dense(1, activation='linear'))
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_absolute_error', metrics=['accuracy'],
optimizer=sgd)
model.fit(training_x, training_y, epochs=50, batch_size=128, shuffle=True)
When I train the model, the loss is always nan and the accuracy is always 0, even though I've tried adjusting a lot of different parameters. However, if I remove the last feature from my data, the position of the players, and update the input shape of the first dense layer, the model actually "trains" and ends up with around 6% accuracy no matter what parameters I change. In that case, I've found that the model only predicts 79 to be the player's rating. What am I doing inherently wrong?
You can try the following steps :
Use mean squared error loss function.
Use Adam which will help you converge faster with low learning rate like 0.0001 or 0.001. Otherwise, try using the RMSprop optimizer.
Use the default regularizers. That is none actually.
Since this is a regression task, use activation function like ReLU in all the layers except the output layer ( including the input layer ). Use linear activation in output layer.
As mentioned in the comments by #pooyan , normalize the features. See here. Even try standardizing the features. Use whichever suites the best.

Binary Classification vs. Multi Class Classification

I have a machine learning classification problem with 3 possible classes (Class A, Class b and Class C). Please let me know which one would be better approach?
- Split the problem into 2 binary classification: First Identify whether it is Class A or Class 'Not A'. Then if it is Class 'Not A', then another binary classification to classify into Class B or Class C
Binary classification may at the end use sigmoid function (goes smooth from 0 to 1). This is how we will know how to classify two values.
from keras.layers import Dense
model.add(Dense(1, input_dim=8, kernel_initializer='uniform', activation='relu'))
model.add(Dense(1, kernel_initializer='uniform', activation='relu'))
model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid'))
For multi class classification you would typically use softmax at the very last layer, and the number of neurons in the next example will be 10, means 10 choices.
from keras.layers import Dropout
model.add(Dense(512,activation='relu',input_shape=(784,)))
model.add(Dropout(0.2))
model.add(Dense(10, activation='softmax'))
However, you can also use softmax with 2 neurons in the last layer for the binary classification as well:
model.add(Dense(2, activation='softmax'))
Hope this provides little intuition on classifiers.
What you describe is one method used for Multi Class Classification.
It is called One vs. All / One vs. Rest.
The best way is to chose a good classifier framework with both options and choose the better one using Cross Validation process.

Is it okay to use STATEFUL Recurrent NN (LSTM) for classification

I have a dataset C of 50,000 (binary) samples each of 128 features. The class label is also binary either 1 or -1. For instance, a sample would look like this [1,0,0,0,1,0, .... , 0,1] [-1]. My goal is to classify the samples based on the binary classes( i.e., 1 or -1). I thought to try using Recurrent LSTM to generate a good model for classification. To do so, I have written the following code using Keras library:
tr_C, ts_C, tr_r, ts_r = train_test_split(C, r, train_size=.8)
batch_size = 200
print('>>> Build STATEFUL model...')
model = Sequential()
model.add(LSTM(128, batch_input_shape=(batch_size, C.shape[1], C.shape[2]), return_sequences=False, stateful=True))
model.add(Dense(1, activation='softmax'))
print('>>> Training...')
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(tr_C, tr_r,
batch_size=batch_size, epochs=1, shuffle=True,
validation_data=(ts_C, ts_r))
However, I am getting bad accuracy, not more than 55%. I tried to change the activation function along with the loss function hoping to improve the accuracy but nothing works. Surprisingly, when I use Multilayer Perceptron, I get very good accuracy around 97%. Thus, I start questioning if LSTM can be used for classification or maybe my code here has something missing or it is wrong. Kindly, I want to know if the code has something missing or wrong to improve the accuracy. Any help or suggestion is appreciated.
You cannot use softmax as an output when you have only a single output unit as it will always output you a constant value of 1. You need to either change output activation to sigmoid or set output units number to 2 and loss to categorical_crossentropy. I would advise the first option.

How to calculate prediction uncertainty using Keras?

I would like to calculate NN model certainty/confidence (see What my deep model doesn't know) - when NN tells me an image represents "8", I would like to know how certain it is. Is my model 99% certain it is "8" or is it 51% it is "8", but it could also be "6"? Some digits are quite ambiguous and I would like to know for which images the model is just "flipping a coin".
I have found some theoretical writings about this but I have trouble putting this in code. If I understand correctly, I should evaluate a testing image multiple times while "killing off" different neurons (using dropout) and then...?
Working on MNIST dataset, I am running the following model:
from keras.models import Sequential
from keras.layers import Dense, Activation, Conv2D, Flatten, Dropout
model = Sequential()
model.add(Conv2D(128, kernel_size=(7, 7),
activation='relu',
input_shape=(28, 28, 1,)))
model.add(Dropout(0.20))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(Dropout(0.20))
model.add(Flatten())
model.add(Dense(units=64, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(units=10, activation='softmax'))
model.summary()
model.compile(loss='categorical_crossentropy',
optimizer='sgd',
metrics=['accuracy'])
model.fit(train_data, train_labels, batch_size=100, epochs=30, validation_data=(test_data, test_labels,))
How should I predict with this model so that I get its certainty about predictions too? I would appreciate some practical examples (preferably in Keras, but any will do).
To clarify, I am looking for an example of how to get certainty using the method outlined by Yurin Gal (or an explanation of why some other method yields better results).
If you want to implement dropout approach to measure uncertainty you should do the following:
Implement function which applies dropout also during the test time:
import keras.backend as K
f = K.function([model.layers[0].input, K.learning_phase()],
[model.layers[-1].output])
Use this function as uncertainty predictor e.g. in a following manner:
def predict_with_uncertainty(f, x, n_iter=10):
result = numpy.zeros((n_iter,) + x.shape)
for iter in range(n_iter):
result[iter] = f(x, 1)
prediction = result.mean(axis=0)
uncertainty = result.var(axis=0)
return prediction, uncertainty
Of course you may use any different function to compute uncertainty.
Made a few changes to the top voted answer. Now it works for me.
It's a way to estimate model uncertainty. For other source of uncertainty, I found https://eng.uber.com/neural-networks-uncertainty-estimation/ helpful.
f = K.function([model.layers[0].input, K.learning_phase()],
[model.layers[-1].output])
def predict_with_uncertainty(f, x, n_iter=10):
result = []
for i in range(n_iter):
result.append(f([x, 1]))
result = np.array(result)
prediction = result.mean(axis=0)
uncertainty = result.var(axis=0)
return prediction, uncertainty
Your model uses a softmax activation, so the simplest way to obtain some kind of uncertainty measure is to look at the output softmax probabilities:
probs = model.predict(some input data)[0]
The probs array will then be a 10-element vector of numbers in the [0, 1] range that sum to 1.0, so they can be interpreted as probabilities. For example the probability for digit 7 is just probs[7].
Then with this information you can do some post-processing, typically the predicted class is the one with highest probability, but you can also look at the class with second highest probability, etc.
A simpler way is to set training=True on any dropout layers you want to run during inference as well (essentially tells the layer to operate as if it's always in training mode - so it is always present for both training and inference).
import keras
inputs = keras.Input(shape=(10,))
x = keras.layers.Dense(3)(inputs)
outputs = keras.layers.Dropout(0.5)(x, training=True)
model = keras.Model(inputs, outputs)
Code above is from this issue.

Why is binary_crossentropy more accurate than categorical_crossentropy for multiclass classification in Keras?

I'm learning how to create convolutional neural networks using Keras. I'm trying to get a high accuracy for the MNIST dataset.
Apparently categorical_crossentropy is for more than 2 classes and binary_crossentropy is for 2 classes. Since there are 10 digits, I should be using categorical_crossentropy. However, after training and testing dozens of models, binary_crossentropy consistently outperforms categorical_crossentropy significantly.
On Kaggle, I got 99+% accuracy using binary_crossentropy and 10 epochs. Meanwhile, I can't get above 97% using categorical_crossentropy, even using 30 epochs (which isn't much, but I don't have a GPU, so training takes forever).
Here's what my model looks like now:
model = Sequential()
model.add(Convolution2D(100, 5, 5, border_mode='valid', input_shape=(28, 28, 1), init='glorot_uniform', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Convolution2D(100, 3, 3, init='glorot_uniform', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.3))
model.add(Flatten())
model.add(Dense(100, init='glorot_uniform', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(100, init='glorot_uniform', activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(10, init='glorot_uniform', activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adamax', metrics=['accuracy'])
Short answer: it is not.
To see that, simply try to calculate the accuracy "by hand", and you will see that it is different from the one reported by Keras with the model.evaluate method:
# Keras reported accuracy:
score = model.evaluate(x_test, y_test, verbose=0)
score[1]
# 0.99794011611938471
# Actual accuracy calculated manually:
import numpy as np
y_pred = model.predict(x_test)
acc = sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000
acc
# 0.98999999999999999
The reason it seems to be so is a rather subtle issue at how Keras actually guesses which accuracy to use, depending on the loss function you have selected, when you include simply metrics=['accuracy'] in your model compilation.
If you check the source code, Keras does not define a single accuracy metric, but several different ones, among them binary_accuracy and categorical_accuracy. What happens under the hood is that, since you have selected binary cross entropy as your loss function and have not specified a particular accuracy metric, Keras (wrongly...) infers that you are interested in the binary_accuracy, and this is what it returns.
To avoid that, i.e. to use indeed binary cross entropy as your loss function (nothing wrong with this, in principle) while still getting the categorical accuracy required by the problem at hand (i.e. MNIST classification), you should ask explicitly for categorical_accuracy in the model compilation as follows:
from keras.metrics import categorical_accuracy
model.compile(loss='binary_crossentropy', optimizer='adamax', metrics=[categorical_accuracy])
And after training, scoring, and predicting the test set as I show above, the two metrics now are the same, as they should be:
sum([np.argmax(y_test[i])==np.argmax(y_pred[i]) for i in range(10000)])/10000 == score[1]
# True
(HT to this great answer to a similar problem, which helped me understand the issue...)
UPDATE: After my post, I discovered that this issue had already been identified in this answer.
First of all, binary_crossentropy is not when there are two classes.
The "binary" name is because it is adapted for binary output, and each number of the softmax is aimed at being 0 or 1.
Here, it checks for each number of the output.
It doesn't explain your result, since categorical_entropy exploits the fact that it is a classification problem.
Are you sure that when you read your data there is one and only one class per sample? It's the only one explanation I can give.

Resources