I want to converge the keras model with just FC-layer - machine-learning

I have a keras model with just FC-layer(Dense). I got train image size 227*227 and 100 class, each class having 1 train image, I would like to overfit and get 100% training accuracy.
I tried to babysit model hyperparameters, but it's not converging to 100% train accuracy. Although, It's just FC-layer even.
Here's my code:
X_train, y_train = ...
# Create a Keras Model
model = Sequential()
model.add(Dense(100, input_dim=input_dim, activation='softmax',
# Callback and training
csv_logger = CSVLogger('training_log_v1.csv')
model.fit(x_train, y_train, epochs=10000, batch_size=100, callbacks=[csv_logger])
Here's plot for above code.
I have ran different hyper-params experiment with 10K to 20K epochs. Loss after some epochs not decreasing and no improvement in train-accuracy.
I tried to play with Different Optimizers(& hyper-params), regularization as well. There's not much hyperparams to play with except optimizer & regularizers here, Right?
If anyone can help me for converging the model that would be great.Thank You!

I am able to overfit. Hyper-params, I have used for over-fitting this experiment.
Class: 100
Samples_per_class: 1
Op: Adam
lr: 0.00001
Epochs set to: 50000
batch_size: 256
I got 99% Train-accuracy #around 12K epochs and continued decreasing loss till around 25K epochs.


How to improve accuracy with keras multi class classification?

I am trying to do multi class classification with tf keras. I have total 20 labels and total data I have is 63952and I have tried the following code
features = features.astype(float)
labels = df_test["label"].values
encoder = LabelEncoder()
encoded_Y = encoder.transform(labels)
dummy_y = np_utils.to_categorical(encoded_Y)
def baseline_model():
model = Sequential()
model.add(Dense(50, input_dim=3, activation='relu'))
model.add(Dense(40, activation='softmax'))
model.add(Dense(30, activation='softmax'))
model.add(Dense(20, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
history = model.fit(data,dummy_y,
I have a very poor accuray with this. How can I improve that ?
softmax activations in the intermediate layers do not make any sense at all. Change all of them to relu and keep softmax only in the last layer.
Having done that, and should you still be getting unsatisfactory accuracy, experiment with different architectures (different numbers of layers and nodes) with a short number of epochs (say ~ 50), in order to get a feeling of how your model behaves, before going for a full fit with your 5,000 epochs.
You did not give us vital information, but here are some guidelines:
1. Reduce the number of Dense layer - you have a complicated layer with a small amount of data (63k is somewhat small). You might experience overfitting on your train data.
2. Did you check that the test has the same distribution as your train?
3. Avoid using softmax in middle Dense layers - softmax should be used in the final layer, use sigmoid or relu instead.
4. Plot a loss as a function of epoch curve and check if it is reduces - you can then understand if your learning rate is too high or too small.

Val_loss is very high (over 100)

I'm trying to create a neural network for image classification. This is my Model summary. I have done normalization to my dataset and shuffling to my data.
. When I run model.fit the val_loss is very high sometimes close to 100 whereas my loss is less than 0.8
When you don't normalize test data, validation loss will be very high when compared to training data that was normalized. I used simple mnist model to demonstrate the point of normalization.
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# this is to demonstrate the importance of normalizing both training and testing data
x_train, x_test = x_train / 255.0, x_test / 1.
When we don't normalize test data where as training data was normalized,
training loss is loss: 0.0771 where as loss during test is 13.1599. Please check the complete code here. Thanks!

What represents the loss or accuracy in training results in Keras

I have two classes with 3 images each. I tried this code in Keras.
trainingDataGenerator = ImageDataGenerator()
trainGenerator = trainingDataGenerator.flow_from_directory(
target_size=(28, 28),
batch_size = 1,
FilterSize = (3,3)
inputShape = (imageWidth, imageHeight,3)
model = Sequential()
model.add (Conv2D(32, FilterSize, input_shape= inputShape))
model.add (Activation('relu'))
model.add ( MaxPooling2D(pool_size=(2,2)))
optimizer = 'rmsprop',
My Output:
When I train this model, I get this output:
Using TensorFlow backend.
Found 2 images belonging to 2 classes.
Epoch 1/1
3/3 [==============================] - 0s - loss: 5.3142 - acc: 0.6667
My Question:
I wonder how it determines the loss and accuracy and on what basis? (ie: loss: 5.3142 - acc: 0.6667 ). I have not given any validation image to validate the model to find accuracy and loss. Does this loss, and accuracy is against the input image itself?
In short, can we say something like this: "This model has accuracy of %, and loss of % without validation images"?
The training loss and accuracy is calculated not by comparing to validation data but rather by comparing the prediction of your neural network of sample x with the label y for that sample that you provide in your training set.
You initialize your neural network and (usually) set all weights to a random value with a certain deviation. After that you feed the features of your training dataset into your network, and let it "guess" the outcome aka the label that you have (if you do supervised learning like in your case).
Then your framework compares that guess with the actual label and calculates the error which it then backpropagates through your network thereby adjusting and improving all weights.
This works perfectly well without any validation data.
Validation data serves you to see the quality of your model (loss, accuracy etc.) by letting the model predict on unseen data. With that you get the so called validation loss / accuracy and with this information you tune your hyperparameters.
In a last step you use your test data to evaluate the final quality of your training.

How to avoid overfitting on a simple feed forward network

Using the pima indians diabetes dataset I'm trying to build an accurate model using Keras. I've written the following code:
# Visualize training history
from keras import callbacks
from keras.layers import Dropout
tb = callbacks.TensorBoard(log_dir='/.logs', histogram_freq=10, batch_size=32,
write_graph=True, write_grads=True, write_images=False,
embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None)
# Visualize training history
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
import numpy
# fix random seed for reproducibility
seed = 7
# load pima indians dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:, 0:8]
Y = dataset[:, 8]
# create model
model = Sequential()
model.add(Dense(12, input_dim=8, kernel_initializer='uniform', activation='relu', name='first_input'))
model.add(Dense(500, activation='tanh', name='first_hidden'))
model.add(Dropout(0.5, name='dropout_1'))
model.add(Dense(8, activation='relu', name='second_hidden'))
model.add(Dense(1, activation='sigmoid', name='output_layer'))
# Compile model
# Fit the model
history = model.fit(X, Y, validation_split=0.33, epochs=1000, batch_size=10, verbose=0, callbacks=[tb])
# list all data in history
# summarize history for accuracy
plt.title('model accuracy')
plt.legend(['train', 'test'], loc='upper left')
# summarize history for loss
plt.title('model loss')
plt.legend(['train', 'test'], loc='upper left')
After several tries, I've added dropout layers in order to avoid overfitting, but with no luck. The following graph shows that the validation loss and training loss gets separate at one point.
What else could I do to optimize this network?
based on the comments I got I've tweaked the code like so:
model = Sequential()
model.add(Dense(12, input_dim=8, kernel_initializer='uniform', kernel_regularizer=regularizers.l2(0.01),
activity_regularizer=regularizers.l1(0.01), activation='relu',
name='first_input')) # added regularizers
model.add(Dense(8, activation='relu', name='first_hidden')) # reduced to 8 neurons
model.add(Dropout(0.5, name='dropout_1'))
model.add(Dense(5, activation='relu', name='second_hidden'))
model.add(Dense(1, activation='sigmoid', name='output_layer'))
Here are the graphs for 500 epochs
The first example gave a validation accuracy > 75% and the second one gave an accuracy of < 65% and if you compare the losses for epochs below 100, its less than < 0.5 for the first one and the second one was > 0.6. But how is the second case better?.
The second one to me is a case of under-fitting: the model doesnt have enough capacity to learn. While the first case has a problem of over-fitting because its training was not stopped when overfitting started (early stopping). If the training was stopped at say 100 epoch, it would be a far better model compared between the two.
The goal should be to obtain small prediction error in unseen data and for that you increase the capacity of the network till a point beyond which overfitting starts to happen.
So how to avoid over-fitting in this particular case? Adopt early stopping.
CODE CHANGES: To include early stopping and input scaling.
# input scaling
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Early stopping
early_stop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=1, mode='auto')
# create model - almost the same code
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu', name='first_input'))
model.add(Dense(500, activation='relu', name='first_hidden'))
model.add(Dropout(0.5, name='dropout_1'))
model.add(Dense(8, activation='relu', name='second_hidden'))
model.add(Dense(1, activation='sigmoid', name='output_layer')))
history = model.fit(X, Y, validation_split=0.33, epochs=1000, batch_size=10, verbose=0, callbacks=[tb, early_stop])
The Accuracy and loss graphs:
First, try adding some regularization (https://keras.io/regularizers/) like with this code:
model.add(Dense(12, input_dim=12,
Also, make sure to decrease your network size i.e. you don't need a hidden layer of 500 neurons - try just taking that out to decrease the representation power and maybe even another layer if it's still overfitting. Also, only use relu activation. Maybe also try increasing your dropout rate to something like 0.75 (although it's already high). You probably also don't need to run it for so many epochs - it will just begin to overfit after long enough.
For a dataset like the Diabetes one you can use a much simpler network. Try to reduce the neurons in your second layer. (Is there a specific reason why you chose tanh as the activation there?).
In addition you simply can add an EarlyStopping callback to your training: https://keras.io/callbacks/

Why pretraining for DNN is not specified in keras?

The problem is more about the training algorithm for DNN rather than the software keras.
As far as I know, deep neural network works due to the improvement in training algorithm. From the 1980s, the BP algorithm has been used to train neural network but will result in over-fitting problem when the network is deep. About 10 years ago, Hinton improved the algorithm by first pre-traning the network using unlabeled data and then using BP algorithm. The pre-traning plays an important role to avoid over-fitting.
However, as I begin to try Keras, the example (in the below) of mnist DNN using SGD algorithm without any mention about the pre-training process leads to a very high prediction accuracy. So, I begin to wonder where has the pre-training gone? Wheter I misundertood the deep learning training algorithm (I think the classical BP is almost the same as SGD)? Or a new traning technique has replaced the pre-traning process?
Very grateful for your help!
'''Trains a simple deep NN on the MNIST dataset.
Gets to 98.40% test accuracy after 20 epochs
(there is *a lot* of margin for parameter tuning).
2 seconds per epoch on a K520 GPU.
from __future__ import print_function
import numpy as np
np.random.seed(1337) # for reproducibility
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.optimizers import SGD, Adam, RMSprop
from keras.utils import np_utils
batch_size = 128
nb_classes = 10
nb_epoch = 20
# the data, shuffled and split between train and test sets
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')
X_train /= 255
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')
# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, nb_classes)
Y_test = np_utils.to_categorical(y_test, nb_classes)
model = Sequential()
model.add(Dense(512, input_shape=(784,)))
history = model.fit(X_train, Y_train,
batch_size=batch_size, nb_epoch=nb_epoch,
verbose=1, validation_data=(X_test, Y_test))
score = model.evaluate(X_test, Y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])
You are wrong.
Past vs. Today
The difference between Neural Networks in the past and the ones today is not about the training algorithm. Every DNN is trained with Backpropagation based on some SGD-based algorithm, exactly like in the past. (There are some new algorithms trying to reduce parameter-tuning with adaptive learning-rates like Adam, RMSprop and co.; but plain SGD is still the most common algorithm and was used for AlphaGo for example)
The difference is just the size = number of layers (deepness; which is possible due to GPU-based evaluation) and the choosings of activation-functions. ReLU is just working better than the classic Sigmoid or Tanh activations (regarding speed and stability).
I also think, that pre-training was very popular 5-10 years ago but nobody is doing that today (if you got enough data)!
Let me quote from here:
It's true that unsupervised pre-training was initially what made it possible to train deeper networks, but the last few years the pre-training approach has been largely obsoleted.
Nowadays, deep neural networks are a lot more similar to their 80's cousins. Instead of pre-training, the difference is now in the activation functions and regularisation methods used (and sometimes in the optimisation algorithm, although much more rarely).
I would say that the "pre-training era", which started around 2006, ended in the early '10s when people started using rectified linear units (ReLUs), and later dropout, and discovered that pre-training was no longer beneficial for this type of networks.
I can recommend these slides as introduction to modern Deep Learning (as starting point).
Pretraining Is actually gaining again a lot of traction in NLP community, see OpenAI's GPT: the idea is that pretraining acts as an unsupervised initialization step before fine-tuning the model with the supervised data. This is because unlabeled data is much more abundant that labeled counterpart and it can be exploited to derive sensible weights inside the model that express the hidden links inside the dataset structures.
Hope that the explanation was not too goofy :)
