low loss with low accuracy in deep neural network - machine-learning

I recently built a model for POS tagging. I tried an LSTM model and it works well, but I still want to add a CNN layer which rebuilds the original word's vector. The main problem is the flexible length of the sequence, which can be solved by a masking layer when in RNN, but that's not supported by the CNN. I still zero-pad the origin sequence to the MAXLEN and use it as the input of the CNN because the output of these extra words are still mostly zero, and can be solved by the masking layer.
But it seems very bad with low loss and low acc(0.342,0.298) compared with LSTM(0.478,0.871). What is the main reason for this? How can I solve the flexible length problem?'
input_seq = Input(shape=(None, input_dim), )
#conv,RELU
conv_out=Conv1D(
filters=200,
kernel_size=3,
padding='same',
activation='relu',
use_bias=1,)(input_seq)
#zero pad 2 at head
pad_out=ZeroPadding1D(padding=(2,0))(conv_out)
#max_pool
pool_out=MaxPool1D(pool_size=3,strides=1,padding='valid')(pad_out)
# masking
mask_out = Masking(mask_value=0.0)(pool_out)
# LSTM
lstm_out = LSTM(units=hidden_unit, return_sequences=True)(mask_out)
# drop_out
drop_out = Dropout(drop_out_rate)(lstm_out)
# softmax
output_seq = TimeDistributed(Dense(output_dim, activation="softmax"))(drop_out)
# compile
model = Model(inputs=input_seq, outputs=output_seq)
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
the padding sequences' shape is x(Samples,MAXLEN,200),y(Samples,MAXLEN,42),i use zero-pad for each sequence of x and y.

Related

How to improve accuracy with keras multi class classification?

I am trying to do multi class classification with tf keras. I have total 20 labels and total data I have is 63952and I have tried the following code
features = features.astype(float)
labels = df_test["label"].values
encoder = LabelEncoder()
encoder.fit(labels)
encoded_Y = encoder.transform(labels)
dummy_y = np_utils.to_categorical(encoded_Y)
Then
def baseline_model():
model = Sequential()
model.add(Dense(50, input_dim=3, activation='relu'))
model.add(Dense(40, activation='softmax'))
model.add(Dense(30, activation='softmax'))
model.add(Dense(20, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
finally
history = model.fit(data,dummy_y,
epochs=5000,
batch_size=50,
validation_split=0.3,
shuffle=True,
callbacks=[ch]).history
I have a very poor accuray with this. How can I improve that ?
softmax activations in the intermediate layers do not make any sense at all. Change all of them to relu and keep softmax only in the last layer.
Having done that, and should you still be getting unsatisfactory accuracy, experiment with different architectures (different numbers of layers and nodes) with a short number of epochs (say ~ 50), in order to get a feeling of how your model behaves, before going for a full fit with your 5,000 epochs.
You did not give us vital information, but here are some guidelines:
1. Reduce the number of Dense layer - you have a complicated layer with a small amount of data (63k is somewhat small). You might experience overfitting on your train data.
2. Did you check that the test has the same distribution as your train?
3. Avoid using softmax in middle Dense layers - softmax should be used in the final layer, use sigmoid or relu instead.
4. Plot a loss as a function of epoch curve and check if it is reduces - you can then understand if your learning rate is too high or too small.

Validation accuracy fluctuating while training accuracy increase?

I have a multiclassification problem that depends on historical data. I am trying LSTM using loss='sparse_categorical_crossentropy'. The train accuracy and loss increase and decrease respectively. But, my test accuracy starts to fluctuate wildly.
What I am doing wrong?
Input data:
X = np.reshape(X, (X.shape[0], X.shape[1], 1))
X.shape
(200146, 13, 1)
My model
# fix random seed for reproducibility
seed = 7
np.random.seed(seed)
# define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=False, random_state=seed)
cvscores = []
for train, test in kfold.split(X, y):
regressor = Sequential()
# Units = the number of LSTM that we want to have in this first layer -> we want very high dimentionality, we need high number
# return_sequences = True because we are adding another layer after this
# input shape = the last two dimensions and the indicator
regressor.add(LSTM(units=50, return_sequences=True, input_shape=(X[train].shape[1], 1)))
regressor.add(Dropout(0.2))
# Extra LSTM layer
regressor.add(LSTM(units=50, return_sequences=True))
regressor.add(Dropout(0.2))
# 3rd
regressor.add(LSTM(units=50, return_sequences=True))
regressor.add(Dropout(0.2))
#4th
regressor.add(LSTM(units=50))
regressor.add(Dropout(0.2))
# output layer
regressor.add(Dense(4, activation='softmax', kernel_regularizer=regularizers.l2(0.001)))
# Compile the RNN
regressor.compile(optimizer='adam', loss='sparse_categorical_crossentropy',metrics=['accuracy'])
# Set callback functions to early stop training and save the best model so far
callbacks = [EarlyStopping(monitor='val_loss', patience=9),
ModelCheckpoint(filepath='best_model.h5', monitor='val_loss', save_best_only=True)]
history = regressor.fit(X[train], y[train], epochs=250, callbacks=callbacks,
validation_data=(X[test], y[test]))
# plot train and validation loss
pyplot.plot(history.history['loss'])
pyplot.plot(history.history['val_loss'])
pyplot.title('model train vs validation loss')
pyplot.ylabel('loss')
pyplot.xlabel('epoch')
pyplot.legend(['train', 'validation'], loc='upper right')
pyplot.show()
# evaluate the model
scores = regressor.evaluate(X[test], y[test], verbose=0)
print("%s: %.2f%%" % (regressor.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))
Results:
trainingmodel
Plot
What you are describing here is overfitting. This means your model keeps learning about your training data and doesn't generalize, or other said it is learning the exact features of your training set. This is the main problem you can deal with in deep learning. There is no solution per se. You have to try out different architectures, different hyperparameters and so on.
You can try with a small model that underfits (that is the train acc and validation are at low percentage) and keep increasing your model until it overfits. Then you can play around with the optimizer and other hyperparameters.
By smaller model I mean one with fewer hidden units or fewer layers.
you seem to have too many LSTM layers stacked over and over again which eventually leads to overfitting. Probably should decrease the num of layers.
Your model seems to be overfitting, since the training error keeps on reducing while validation error fails to. Overall, it fails to generalize.
You should try reducing the model complexity by removing some of the LSTM layers. Also, try varying the batch sizes, it will reduce the number of fluctuations in the loss.
You can also consider varying the learning rate.

Weight initialization in neural networks

Hi I am developing a neural network model using keras.
code
def base_model():
# Initialising the ANN
regressor = Sequential()
# Adding the input layer and the first hidden layer
regressor.add(Dense(units = 4, kernel_initializer = 'he_normal', activation = 'relu', input_dim = 7))
# Adding the second hidden layer
regressor.add(Dense(units = 2, kernel_initializer = 'he_normal', activation = 'relu'))
# Adding the output layer
regressor.add(Dense(units = 1, kernel_initializer = 'he_normal'))
# Compiling the ANN
regressor.compile(optimizer = 'adam', loss = 'mse', metrics = ['mae'])
return regressor
I have been reading about which kernel_initializer to use and came across the link- https://towardsdatascience.com/hyper-parameters-in-action-part-ii-weight-initializers-35aee1a28404
it talks about glorot and he initializations. I have tried with different intilizations for weights, but all of them give the same results. I want to understand how important is it do a proper initialization?
Thanks
I'll give you an explanation of how much weights initialisation is important.
Let's suppose our NN has an input layer with 1000 neurons, and suppose we start to initialise weights as they are normal distributed with mean 0 and variance 1 ().
At the second layer, we assume that only 500 first layer's neurons are activated, while the other 500 not.
The neuron's input of the second layer z will be the sum of :
so, it will be even normal distributed but with variance .
This means its value will be |z| >> 1 or |z| << 1, so neurons will saturate. The network will learn slowly at all.
A solution is to initialise weights as where is the number of the inputs of the first layer. In this way z will be and so less spreader, consequently neurons are less prone to saturate.
This trick can help as a start but in deep neural networks, due to the presence of hidden multi-layers, the weights initialisation should be done at each layer. A method may be using the batch normalization
Besides this from your code I can see you'v chosen as cost function the MSE, so it is a quadratic cost function. I don't know if your problem is a classification one, but if this is the case I suggest you to use a cross-entropy function as cost function for increasing the learning rate of your network.

Number of neurons in input layer for Feedforward neural network

I'm trying to classify 1D data with 3-layered feedforward neural network (multilayer perceptron).
Currently I have input samples (time-series) consisting of 50 data points each. I've read on many sources that number of neurons in input layer should be equal to number of data points (50 in my case), however, after experimenting with cross validation a bit, I've found that I can get slightly better average classification (with lover variation as well) performance with 25 neurons in input layer.
I'm trying to understand math behind it: does it makes any sense to have lower number of neurons than data points in input layer? Or maybe results are better just because of some errors?
Also - are there any other rules to set number of neurons in input layer?
Update - to clarify what I mean:
I use Keras w tensorflow backend for this. My model looks like this:
model = Sequential()
model.add(Dense(25, input_dim=50, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(input_data, output_data, epochs=150, batch_size=10)
predictions = model.predict(X)
rounded = [round(x[0]) for x in predictions]
print(rounded)
input_data, output_data - numpy arrays with my data points in former and corresponding value of 1 or 0 in latter.
25 is number of neurons in first layer and input_dim is number of my data points, therefore technically it works, yet I'm not sure whether it makes sense to do so or I misunderstood concept of neurons in input layer and what they do.

Keras: model with one input and two outputs, trained jointly on different data (semi-supervised learning)

I would like to code with Keras a neural network that acts both as an autoencoder AND a classifier for semi-supervised learning. Take for example this dataset where there is a few labeled images and a lot of unlabeled images: https://cs.stanford.edu/~acoates/stl10/
Some papers listed here achieved that, or very similar things, successfully.
To sum up: if the model would have the same input data shape and the same "encoding" convolutional layers, but would split into two heads (fork-style), so there is a classification head and a decoding head, in a way that the unsupervised autoencoder will contribute to a good learning for the classification head.
With TensorFlow there would be no problem doing that as we have full control over the computational graph.
But with Keras, things are more high-level and I feel that all the calls to ".fit" must always provide all the data at once (so it would force me to tie together the classification head and the autoencoding head into one time-step).
One way in keras to almost do that would be with something that goes like this:
input = Input(shape=(32, 32, 3))
cnn_feature_map = sequential_cnn_trunk(input)
classification_predictions = Dense(10, activation='sigmoid')(cnn_feature_map)
autoencoded_predictions = decode_cnn_head_sequential(cnn_feature_map)
model = Model(inputs=[input], outputs=[classification_predictions, ])
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit([images], [labels, images], epochs=10)
However, I think and I fear that if I just want to fit things in that way it will fail and ask for the missing head:
for epoch in range(10):
# classifications step
model.fit([images], [labels, None], epochs=1)
# "semi-unsupervised" autoencoding step
model.fit([images], [None, images], epochs=1)
# note: ".train_on_batch" could probably be used rather than ".fit" to avoid doing a whole epoch each time.
How should one implement that behavior with Keras? And could the training be done jointly without having to split the two calls to the ".fit" function?
Sometimes when you don't have a label you can pass zero vector instead of one hot encoded vector. It should not change your result because zero vector doesn't have any error signal with categorical cross entropy loss.
My custom to_categorical function looks like this:
def tricky_to_categorical(y, translator_dict):
encoded = np.zeros((y.shape[0], len(translator_dict)))
for i in range(y.shape[0]):
if y[i] in translator_dict:
encoded[i][translator_dict[y[i]]] = 1
return encoded
When y contains labels, and translator_dict is a python dictionary witch contains labels and its unique keys like this:
{'unisex':2, 'female': 1, 'male': 0}
If an UNK label can't be found in this dictinary then its encoded label will be a zero vector
If you use this trick you also have to modify your accuracy function to see real accuracy numbers. you have to filter out all zero vectors from our metrics
def tricky_accuracy(y_true, y_pred):
mask = K.not_equal(K.sum(y_true, axis=-1), K.constant(0)) # zero vector mask
y_true = tf.boolean_mask(y_true, mask)
y_pred = tf.boolean_mask(y_pred, mask)
return K.cast(K.equal(K.argmax(y_true, axis=-1), K.argmax(y_pred, axis=-1)), K.floatx())
note: You have to use larger batches (e.g. 32) in order to prevent zero matrix update, because It can make your accuracy metrics crazy, I don't know why
Alternative solution
Use Pseudo Labeling :)
you can train jointly, you have to pass an array insted of single label.
I used fit_generator, e.g.
model.fit_generator(
batch_generator(),
steps_per_epoch=len(dataset) / batch_size,
epochs=epochs)
def batch_generator():
batch_x = np.empty((batch_size, img_height, img_width, 3))
gender_label_batch = np.empty((batch_size, len(gender_dict)))
category_label_batch = np.empty((batch_size, len(category_dict)))
while True:
i = 0
for idx in np.random.choice(len(dataset), batch_size):
image_id = dataset[idx][0]
batch_x[i] = load_and_convert_image(image_id)
gender_label_batch[i] = gender_labels[idx]
category_label_batch[i] = category_labels[idx]
i += 1
yield batch_x, [gender_label_batch, category_label_batch]

Resources