I have a training dataset with images that looks like this:
x=[image1,image2...imageN]
and an output dataset that looks like this:
y=[output1,output2...]
I don't understand how does the model.fit works in regards to processing the images. Meaning, if I choose shuffle=False will the model take the first image first and go through the whole feedforward, backprop, etc. and compare it to output1, and then the second image, and so on?
Or does the model randomly select images from my dataset?
If you specify shuffle = True the generator will shuffle the dataset before each epoch. It will then go through the shuffled dataset one batch at a time, if it reaches then end before the next epoch it will go back to the start.
If you specify shuffle = False it will go through the dataset in the same order every epoch.
I believe a similar question is asked here.
shuffle in the model.fit of keras
As far as I know, your thought process is correct up to a certain extent. The model takes a random image from the dataset and the associated output for that index and then trains on it. Quite similar to using a random number to select an image from the batch, training it comparing with the output and then marking it as trained to avoid retraining on the same example.
Related
I implemented the proposed GAN Model from the Paper Edge-Connect (https://github.com/knazeri/edge-connect) in Keras and did some trainings on the KITTI dataset. Now I am trying to figure out what's going on inside my model and therefore I have a few questions.
1. Initial Training (100 Epochs, 500 batches/epoch, 10 Samples/Batch)
At first I trained the model as proposed in the paper (incuding style-, perceptual-, L1- and adversarial loss)
At first sight, the model converges to nice results:
This is the output of the generator(left) for the masked input(right)
Most of the graphs from the tensorboard look quite good as well:
(These are all values from the GAN-Model, containing the total loss of the generator(GENERATOR_Loss), different losses based on the generated image (L1, perc, style) as well as the adversarial loss (DISCRIMINATOR_loss)
When closely looking at the discriminator, things look different. The adversarial loss of the discriminiator for the generated images steadly increases.
The loss while training the discriminator (50/50 fake/real examples) doesn't change at all:
![] (https://i.stack.imgur.com/o5jCA.png)
And when looking at the histogram of activations of the output of the discriminator it always outputs values around 0.5.
Coming to my questions/conclusions where I would appreciate your feedback:
So I assume now, that my model learned a lot but nothing from the discriminator, right? The results are all based on the losses other
than the adversarial loss?
It seems that the Discriminator could not keep up with the generator generating better images. I think the discriminators activations should somehow early move to two peaks at around 0 (fake labels) and 1 (real lables) and stay there?
I know that my final goal is that the discriminator outputs 0.5 probability for real as well as fake... but what does it mean when this happens right from the beginning and doesn't change during training?
Did I stop training too early? Could the discriminator catch up (since the output of the generator doesn't change much anymore) and eliminate the last tiny faults of the generator?
2. Thus I started a second training, this time only using the adversarial loss in the generator! (~16 Epochs, 500 batches/epoch, 10 Samples/Batch)
This time the discriminator seems to be able to differentiate between real and fake after a while.
(prob_real is the mean probability assigned to real images and vice versa)
The histogram of activations looks good as well:
But somehow after around 4k Samples things start to change and at around 7k it diverges...
Also all samples from the generator look like this:
Coming to my second part of questions/conclusions:
Should I pretrain the discriminator so it gets a head start? I guess it needs to somehow be able to differentiate between real and fake (outputting large probabilites for real and vice versa) so the generator can learn usefull things from it? Should I train the discriminator multiple times while training the generator one step for the same reason?
What happend in the second training? Was the learn rate for the discriminator too high? (Opt: ADAM, lr=1.0E-3)
Many hints on the internet for training GANs aim for increasing the difficulty of the discriminators job (Label noise/label flipping, instance noise, label smoothing etc). Here I think the discriminator rather needs to be boosted? (-> I also trained the Disc without changing the generator and it converges nicely)
If discriminator outputs 0.5 probability directly in the beginning of the network it means that the weights of the discriminator are not being updated and it plays no role in training, which further indicates it is not able to differentiate between real and fake image coming from the generator. To solve this issue try to add Gaussian noise as an input to the discriminator or do label smoothing which are very simple and effective techniques.
In answer to your this question, that The results are all based on the losses other than the adversarial loss , the trick that can be used is try to train the network first on all the losses except the adversarial loss and then fine tune on the adversarial losses, hope it helps.
For the second part of your questions, the generated images seem to face the problem of mode collpase where they tend to learn color, degradation from 1 image and pass the same to the other images , try to solve it out by either decreasing the batch size or using unrolled gans,
I have a dataset of images for classification purposes. The dataset is very large and most of the images are duplicates of each other. So essentially, the same image occurs multiple times. Moreover, the dataset is unbalanced.
I understand the motivation of cleaning the dataset of duplicates. But it is extensive and very time consuming to do so.
Is there a way to train a net on this dataset, and not overfit the model?
Could enforcing harsher regularization, dropouts, penalize the losses still produce a usable model?
As suggested by Jon.H in comments, instead of training your model on a dataset with duplicates, you could use image hashing to detect and remove them from the dataset. Although the cryptographic hashing (like MD5 and SHA1) will suffice to find exact duplicates, according to your comment you also would like to get rid of similar images, not just exact duplicates (Do you really want to do this? Having a bigger dataset is usually better for training, and keeping similar images with small variations, e.g. in color, is not necessarily a bad thing -- see "data augmentation").
Generating a hash for images is not robust to slight changes in pixel
values, say minor lighting changes which aren't visible to the eye but
the pixel value differs. - Ronica Jethwa
One solution to this is to use perceptual hashing which is quite robust to minor differences in color, rotation, aspect ratio of images etc. In particular I would suggest you to try the pHash algorithm based on Discrete Cosine Transform as described in Looks-Like-It. There is a python library that implements it, called imagehash. Here's how to use it:
from PIL import Image
import imagehash
# Compute the perception-hash values (64 bit) for two images
phash_1 = imagehash.phash(Image.open('image_1')) # e.g. d58e11ce51ee15aa
phash_2 = imagehash.phash(Image.open('image_2')) # e.g. d58e01ae519e559e
# Compare the images using the Hamming distance of their perception hashes
dist = phash_1 - phash_2
Then it's up to you to choose the similarity threshold for the Hamming distance.
Duplicates don't imply over-fitting; they give that image more weight in the training. Yes, you can train on the data set; the results will be valid. For instance, if you have the same quantity of duplicates (say, 10 of everything). then you'll get the same results as if you had just one -- or almost: the shuffling order can slightly affect the balance of training, since a single image can now appear multiple times near the start of epoch 1.
The various counter-measures you list are good tools against over-fitting, but your main danger is merely what you have anyway: the potential of a small set of unique examples.
Adding my cent to this old question.
During training the problem arises only if you have a high chance of having many duplicates in a single batch.
Let's say you choose a batch size of 64; since you will randomly sample the images to compose the batch it could be that on average you have only 2 duplicates. This really depends on how many times (on average) an image is duplicated in proportion to the total number of images.
Anyway the problem is alleviated by using (online) data augmentation which introduces some differences, even between identical images.
The biggest problem is on the test set because the accuracy estimation will be biased towards the images with more duplicates, so I would embrace the effort and deduplicate the test (and validation) sets.
If you have the same images in the validation set as in the train set, but different in the test set, the validation will give a better (accuracy) score than test. In this case, it will be like overfitting. Duplicates occur naturally everywhere, therefore it must be ok.
Train with duplicate data. Use the representation vector i.e output of last convolution. If you using pretrained CNN model use the final out of that. Apply knn or clustering on the representation vectors and identify duplicates. Remove duplicates and retain your model.
I have 6.5 GB worth of training data to be used for my GRU network. I intend to split the training time, i.e. pause and resume training since I am using a laptop computer. I'm assuming that it will take me days to train my neural net using the whole 6.5 GB, so, I'll be pausing the training and then resume again at some other time.
Here's my question. If I will shuffle the batches of training data, will the neural net remember which data has been used already for training or not?
Please note that I'm using the global_step parameter of tf.train.Saver().save.
Thank you very much in advance!
I would advise you to save your model at certain epoch,lets say you have 80 epochs,it would be wise to save your model at each 20epochs(20,40,60)but again this will depend on the capacity of your laptop,the reason is that at one epoch,your network will have seen all the datasets(training set).If your whole dataset can't be processed in a single epoch,i would advise you to randomly sample from your whole dataset what will be the training set.The whole point of shuffling is to let the network do some generalization over the whole dataset and it is usually done on either batch or selecting training dataset,or starting a new training epoch.As for your main question,its definetly ok to shuffle bacthes when training and resuming.Shuffling batches ensures that the gradients are calculated along the batch instead of over one image
I've working on a CNN over several hundred GBs of images. I've created a training function that bites off 4Gb chunks of these images and calls fit over each of these pieces. I'm worried that I'm only training on the last piece on not the entire dataset.
Effectively, my pseudo-code looks like this:
DS = lazy_load_400GB_Dataset()
for section in DS:
X_train = section.images
Y_train = section.classes
model.fit(X_train, Y_train, batch_size=16, nb_epoch=30)
I know that the API and the Keras forums say that this will train over the entire dataset, but I can't intuitively understand why the network wouldn't relearn over just the last training chunk.
Some help understanding this would be much appreciated.
Best,
Joe
This question was raised at the Keras github repository in Issue #4446: Quick Question: can a model be fit for multiple times? It was closed by François Chollet with the following statement:
Yes, successive calls to fit will incrementally train the model.
So, yes, you can call fit multiple times.
For datasets that do not fit into memory, there is an answer in the Keras Documentation FAQ section
You can do batch training using model.train_on_batch(X, y) and
model.test_on_batch(X, y). See the models documentation.
Alternatively, you can write a generator that yields batches of
training data and use the method model.fit_generator(data_generator, samples_per_epoch, nb_epoch).
You can see batch training in action in our CIFAR10 example.
So if you want to iterate your dataset the way you are doing, you should probably use model.train_on_batch and take care of the batch sizes and iteration yourself.
One more thing to note is that you should make sure the order in which the samples you train your model with is shuffled after each epoch. The way you have written the example code seems to not shuffle the dataset. You can read a bit more about shuffling here and here
I have two datasets, one of them is the real dataset and one of them is a randomized dataset
where the class attribute has been randomly shuffled. How can I determine which is
which? Thanks
Train a classifier. The data set where you can get a working classifier is probably the one with the real labels. On the shuffled one, no classifier should work!
There is no guarantee you can detect it. If your data was random before, it doesn't get more random by shuffling; so you cannot decide then. But if the data set had a nice structure before, then shuffling should usually destroy this.