Keras: how to use dropout at train and test phase? - machine-learning

Is it possible to use dropout at train and test phase in Keras?
Like described here:
https://github.com/soumith/ganhacks#17-use-dropouts-in-g-in-both-train-and-test-phase

Sure, you can set training argument to True when calling the Dropout layer. In this way, dropout would be applied in both training and test phases:
drp_output = Dropout(rate)(inputs, training=True) # dropout would be active in train and test phases

Both answers leave me slightly confused. More simply, you may find yourself doing something like this:
model = Model(...)
...
model.add(Dropout(0.5))
...
model.fit(...) # invokes Dropout(training=True)
...
model.evaluate(...) # invokes Dropout(training=False)
That is, when you define your model, you add Dropout layers with the dropout rate you want during training. The rate is not visibly varied between test and training; rather, it is declared once as a fixed value, then (invisibly) switched on/off according to the training parameter the layer is invoked with. See keras.Model.

Related

What is the difference between these backward training methods in Pytorch?

I am a 3-month DL freshman who is doing small NLP projects with Pytorch.
Recently I am trying to reappear a GAN network introduced by a paper, using my own text data, to generate some specific kinds of question sentences.
Here is some background... If you have no time or interest about it, just kindly read the following question is OK.
As that paper says, the generator is firstly trained normally with normal question data to make that the output at least looks like a real question. Then by using an auxiliary classifier's result (of classifying the outputs), the generator is trained again to just generate the specific (several unique categories) questions.
However, as the paper do not reveal its code, I have to do the code all myself. I have these three training thoughts, but I do not know their differences, could you kindly tell me about it?
If they have almost the same effect, could you tell me which is more recommended in Pytorch's grammar? Thank you very much!
Suppose the discriminator loss to generator is loss_G_D, the classifier loss to generator is loss_G_C, and loss_G_D and loss_G_C has the same shape, i.e. [batch_size, loss value], then what is the difference?
1.
optimizer.zero_grad()
loss_G_D = loss_func1(discriminator(generated_data))
loss_G_C = loss_func2(classifier(generated_data))
loss = loss_G+loss_C
loss.backward()
optimizer.step()
optimizer.zero_grad()
loss_G_D = loss_func1(discriminator(generated_data))
loss_G_D.backward()
loss_G_C = loss_func2(classifier(generated_data))
loss_G_C.backward()
optimizer.step()
optimizer.zero_grad()
loss_G_D = loss_func1(discriminator(generated_data))
loss_G_D.backward()
optimizer.step()
optimizer.zero_grad()
loss_G_C = loss_func2(classifier(generated_data))
loss_G_C.backward()
optimizer.step()
Additional info: I observed that the classifier's classification loss is always very big compared with generator's loss, like -300 vs 3. So maybe the third one is better?
First of all:
loss.backward() backpropagates the error and assigns a gradient for every parameter along the way that has requires_grad=True.
optimizer.step() updates the model parameters using their stored gradients
optimizer.zero_grad() sets the gradients to 0, so that you can backpropagate your loss and update your model parameters for each batch without interfering with other batches.
1 and 2 are quite similar, but if your model uses batch statistics or you have an adaptive optimizer they will probably perform differently. However, for instance, if your model doesn't use batch statistics and you have a plain old SGD optimizer, they will produce the same result, even though 1 would be faster since you do the backprop only once.
3 is a completely different case, since you update your model parameters with loss_G_D.backward() and optimizer.step() before processing and backpropagating loss_G_C.
Given all of these, it's up to you which one to choose depending on your application.

Keras Conv1D on ECG Signal

I am trying to classify different ECG signals. I am using Keras' Conv1D, but am not getting any good results.
I have tried changing the number of layers, window size, etc, but every time I run this I get predictions all of the same class (the classes are 0,1,2, so I get a prediction output of something like [1,1,1,1,1,1,1,1,1,1,1,1,1,1], but the class changes each time I run the script).
The ECG signals are in 1000 point numpy arrays.
Are there any glaringly obvious things I am doing wrong here? I was thinking it would've worked great to use a few layers to just classify into 3 different ECG signals.
#arrange and randomize data
y1=[[0]]*len(lead1)
y2=[[1]]*len(lead2)
y3=[[2]]*len(lead3)
y=np.concatenate((y1,y2,y3))
data=np.concatenate((lead1,lead2,lead3))
data = keras.utils.normalize(data)
data=np.concatenate((data,y),axis=1)
data=np.random.permutation((data))
print(data)
#separate data and create categories
Xtrain=data[0:130,0:-1]
Xtrain=np.reshape(Xtrain,(len(Xtrain),1000,1))
Xpred=data[130:,0:-1]
Xpred=np.reshape(Xpred,(len(Xpred),1000,1))
Ytrain=data[0:130,-1]
Yt=to_categorical(Ytrain)
Ypred=data[130:,-1]
Yp=to_categorical(Ypred)
#create CNN model
model = Sequential()
model.add(Conv1D(20,20,activation='relu',input_shape=(1000,1)))
model.add(MaxPooling1D(3))
model.add(Conv1D(20,10,activation='relu'))
model.add(MaxPooling1D(3))
model.add(Conv1D(20,10,activation='relu'))
model.add(GlobalAveragePooling1D())
model.add(Dense(3,activation='relu',use_bias=False))
model.compile(optimizer='adam', loss='categorical_crossentropy',metrics=['accuracy'])
model.fit(Xtrain,Yt)
#test model
print(model.evaluate(Xpred,Yp))
print(model.predict_classes(Xpred,verbose=1))
Are there any glaringly obvious things I am doing wrong here?
Indeed there is: the output you report is not surprising, given that you are currently using the ReLU as activation for your last layer, which does not make any sense.
In multi-class settings, such as yours, the activation of the last layer must be the softmax, and certainly not the ReLU; change your last layer to:
model.add(Dense(3, activation='softmax'))
Not quite sure why you ask for use_bias=False, but you can try both with and without it and experiment...

Difference between doing cross-validation and validation_data/validation_split in Keras

First, I split the dataset into train and test, for example:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=999)
I then use GridSearchCV with cross-validation to find the best performing model:
validator = GridSearchCV(estimator=clf, param_grid=param_grid, scoring="accuracy", cv=cv)
And by doing this, I have:
A model is trained using k-1 of the folds as training data; the resulting
model is validated on the remaining part of the data (scikit-learn.org)
But then, when reading about Keras fit fuction, the document introduces 2 more terms:
validation_split: Float between 0 and 1. Fraction of the training data
to be used as validation data. The model will set apart this fraction
of the training data, will not train on it, and will evaluate the loss
and any model metrics on this data at the end of each epoch. The
validation data is selected from the last samples in the x and y data
provided, before shuffling.
validation_data: tuple (x_val, y_val) or tuple (x_val, y_val,
val_sample_weights) on which to evaluate the loss and any model
metrics at the end of each epoch. The model will not be trained on
this data. validation_data will override validation_split.
From what I understand, validation_split (to be overridden by validation_data) will be used as an unchanged validation dataset, meanwhile hold-out set in cross-validation changes during each cross-validation step.
First question: is it necessary to use validation_split or validation_data since I already do cross validation?
Second question: if it is not necessary, then should I set validation_split and validation_data to 0 and None, respectively?
grid_result = validator.fit(train_images, train_labels, validation_data=None, validation_split=0)
Question 3: If I do so, what will happen during the training, would Keras just simply ignore the validation step?
Question 4: Does the validation_split belong to k-1 folds or the hold-out fold, or will it be considered as "test set" (like in the case of cross validation) which will never be used to train the model.
Validation is performed to ensure that the model is not overfitting on the dataset and it would generalize to new data. Since in the parameters grid search you are also doing validation then there is no need to perform the validation step by the Keras model itself during training. Therefore to answer your questions:
is it necessary to use validation_split or validation_data since I already do cross validation?
No, as I mentioned above.
if it is not necessary, then should I set validation_split and validation_data to 0 and None, respectively?
No, since by default no validation is done in Keras (i.e. by default we have validation_split=0.0, validation_data=None in fit() method).
If I do so, what will happen during the training, would Keras just simply ignore the validation step?
Yes, Keras won't perform the validation when training the model. However note that, as I mentioned above, the grid search procedure would perform validation to better estimate the performance of the model with a specific set of parameters.

Why is ReLU used in regression with Neural Networks?

I am following the official TensorFlow with Keras tutorial and I got stuck here: Predict house prices: regression - Create the model
Why is an activation function used for a task where a continuous value is predicted?
The code is:
def build_model():
model = keras.Sequential([
keras.layers.Dense(64, activation=tf.nn.relu,
input_shape=(train_data.shape[1],)),
keras.layers.Dense(64, activation=tf.nn.relu),
keras.layers.Dense(1)
])
optimizer = tf.train.RMSPropOptimizer(0.001)
model.compile(loss='mse', optimizer=optimizer, metrics=['mae'])
return model
The general reason for using non-linear activation functions in hidden layers is that, without them, no matter how many layers or how many units per layer, the network would behave just like a simple linear unit. This is nicely explained in this short video by Andrew Ng: Why do you need non-linear activation functions?
In your case, looking more closely, you'll see that the activation function of your final layer is not the relu as in your hidden layers, but the linear one (which is the default activation when you don't specify anything, like here):
keras.layers.Dense(1)
From the Keras docs:
Dense
[...]
Arguments
[...]
activation: Activation function to use (see activations). If you don't specify anything, no activation is applied (ie. "linear" activation: a(x) = x).
which is indeed what is expected for a regression network with a single continuous output.

Is it okay to use STATEFUL Recurrent NN (LSTM) for classification

I have a dataset C of 50,000 (binary) samples each of 128 features. The class label is also binary either 1 or -1. For instance, a sample would look like this [1,0,0,0,1,0, .... , 0,1] [-1]. My goal is to classify the samples based on the binary classes( i.e., 1 or -1). I thought to try using Recurrent LSTM to generate a good model for classification. To do so, I have written the following code using Keras library:
tr_C, ts_C, tr_r, ts_r = train_test_split(C, r, train_size=.8)
batch_size = 200
print('>>> Build STATEFUL model...')
model = Sequential()
model.add(LSTM(128, batch_input_shape=(batch_size, C.shape[1], C.shape[2]), return_sequences=False, stateful=True))
model.add(Dense(1, activation='softmax'))
print('>>> Training...')
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(tr_C, tr_r,
batch_size=batch_size, epochs=1, shuffle=True,
validation_data=(ts_C, ts_r))
However, I am getting bad accuracy, not more than 55%. I tried to change the activation function along with the loss function hoping to improve the accuracy but nothing works. Surprisingly, when I use Multilayer Perceptron, I get very good accuracy around 97%. Thus, I start questioning if LSTM can be used for classification or maybe my code here has something missing or it is wrong. Kindly, I want to know if the code has something missing or wrong to improve the accuracy. Any help or suggestion is appreciated.
You cannot use softmax as an output when you have only a single output unit as it will always output you a constant value of 1. You need to either change output activation to sigmoid or set output units number to 2 and loss to categorical_crossentropy. I would advise the first option.

Resources