Datageneration for Neural Networks

Datageneration for Neural Networks - machine-learning

I have images that I want to process. First features are extracted from those images and then those features are fed into a neural network for training. I do not have many images though and would like to generate more data.
1) What yields less overfitting: Should I generate more images from the original images and then feed the entire pipeline with them, or should I bring variation into the extracted features and simply train the neural network with more data this way?
The second approach would be computationally cheaper, but yields better results?
2) What techniques are tried and true for generating more data - either more images or the features?

Is true that when you don't have enough data the performance of your model can be poor. So you have to try a few things:
You can modify the data that you have applying translations, rotations, etc; for example move all the pixel of the image a few pixel to the left. This are operation on images.
Also you can generate more images through generative models: Restricted Boltzmann Machines, Deep Belief Networks etc.
Also you have a way of determine if you need more training data. In the coordinate axis you draw the score of the training data and validation data. In the x axis goes the size of the sets(10% of the all set, 20% of the all set, ..., 90% of the all set) and in the y axis is the score. Then you look at the graph. For understand well enough this what i'm saying i strongly recommend the videos of Andrew Ng of Machine Learning(https://www.coursera.org/learn/machine-learning) specifically the Week 6(Advice for Applying Machine Learning)

Related

Pytorch - Purpose of images preprocessing in the transfer learning tutorial

In the Pytorch transfer learning tutorial, the images in both the training and the test sets are being pre-processed using the following code:
data_transforms = {
'train': transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]),
'val': transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]),
}
My question is - what is the intuition behind this choice of transforms? In particular, what is the intuition behind choosing RandomResizedCrop(224) and RandomHorizontalFlip()? Wouldn't it be better to just let the neural network train on the entire image? (or at least, augment the dataset using these transformation)? I understand why it is reasonable to insert only the portion of the image that contains the ant/bees to the neural network but can't understand why it is reasonable to insert a random crop...
Hope I managed to make all my questions clear
Thanks!

Regarding RandomResizedCrop
Why ...ResizedCrop? - This answer is straightforward. Resizing crops to the same dimensions allows you to batch your input data. Since the training images in your toy dataset have different dimensions, this is the best way to make your training more efficient.
Why Random...? - Generating different random crops per image every iteration (i.e. random center and random cropping dimensions/ratio before resizing) is a nice way to artificially augment your dataset, i.e. feeding your network different-looking inputs (extracted from the same original images) every iteration. This helps to partially avoid over-fitting for small datasets, and makes your network overall more robust.
You are however right that, since some of your training images are up to 500px wide and the semantic targets (ant/bee) sometimes cover only a small portion of the images, there is a chance that some of these random crops won't contain an insect... But as long as the chances this happens stay relatively low, it won't really impact your training. The advantage of feeding different training crops every iteration (instead of always the same non-augmented images) vastly counterbalances the side-effect of sometimes giving "empty" crops. You could verify this assertion by replacing RandomResizedCrop(224) by Resize(224) (fixed resizing) in your code and compare the final accuracies on the test set.
Furthermore, I would add that neural networks are smart cookies, and sometimes learn to recognize images through features you wouldn't expect (i.e. they tend to learn recognition shortcuts if your dataset or losses are biased, c.f. over-fitting). I wouldn't be surprised if this toy network is performing so well despite being trained sometimes on "empty" crops just because it learns e.g. to distinguish between usual "ant backgrounds" (ground floor, leaves, etc.) and "bee backgrounds" (flowers).
Regarding RandomHorizontalFlip
Its purpose is also to artificially augment your dataset. For the network, an image and its flipped version are two different inputs, so you are basically artificially doubling the size of your training dataset for "free".
There are plenty more operations one can use to augment training datasets (e.g. RandomAffine, ColorJitter, etc). One has however to be careful to choose transformations which are meaningful for the target use-case / which are not impacting the target semantic information (e.g. for ant/bee classification, RandomHorizontalFlip is fine as you will probably get as many images of insects facing right than facing left; however RandomVerticalFlip doesn't make much sense as you won't get pictures of insects upside-down most certainly).

Finding the suitable CNN architecture for the calssification

I want to use convolutional Neural Network (CNN) to classify between two classes of images. I built several CNN architectures, but I always get the same result; the network always classify all cases as a second class sample. Therefore, I always get 50% accuracy in leave-one-out. The data is balanced in terms of the number of samples of each class (16 from 1st, and 16 from 2nd). Could you please clarify what does this mean.

With such small number of training samples, Your CNN model is very likely to overfit the data giving good training accuracy and worst test accuracy.
Else your model can be skewed predicting the same class at all times.
Below are some of the solutions you can try:
1) As you have commented, if you cannot get any more images, then try creating new images by modifying the ones already available. For ex: Let's say you have 16 images of a cat (cat is the class). You can crop the cat and paste it in different backgrounds, try varying the brightness, intensity etc, Try rotation, translation operations etc.
This will help you create a good training set.
2) Try creating a smaller model (with one or two layers) and check if it improves your accuracy.
3) You can do transfer learning by using a good pre-trained model as it can learn pretty well when compared to creating a model from base.

How to compute similarity score between two images using their feature vectors?

I am working on face recognition project using deep learning architecture to classify the images into respective classes. The output of network at softmax layer is the predicted class label and the output of last but one layer at the dense layer is a feature representation of the input image. Here the feature vector is a 1-D matrix of size 1000 for each image. Predicting classes is recognition type problem, but I'm interested in verification problem.
So given two sample images, I need to compare the similarity/dissimilarity score between two given images using their feature representations. If the match score is greater than the threshold then it's a hit else no hit. Please let me know if there are any standard approaches?
Example of similar faces (which should ideally generate matchscore>threshold): https://3c1703fe8d.site.internapcdn.net/newman/gfx/news/hires/2014/yvyughbujh.jpg

Your project has two solutions:
Train your own network (using pretrained one) with output in 1000 classes. This approach is not the simplest one because of the necessity of having enough (say huge) amount of data for each class, approximately 1000 samples per class.
Another approach is to use Distance Metrics Learning. By this "distance" we usually mean Euclidean norm. This approach is much wider and deeper than just extract features and match them to the nearest one. Try to search for it.
Good luck!

Training with duplicates in dataset

I have a dataset of images for classification purposes. The dataset is very large and most of the images are duplicates of each other. So essentially, the same image occurs multiple times. Moreover, the dataset is unbalanced.
I understand the motivation of cleaning the dataset of duplicates. But it is extensive and very time consuming to do so.
Is there a way to train a net on this dataset, and not overfit the model?
Could enforcing harsher regularization, dropouts, penalize the losses still produce a usable model?

As suggested by Jon.H in comments, instead of training your model on a dataset with duplicates, you could use image hashing to detect and remove them from the dataset. Although the cryptographic hashing (like MD5 and SHA1) will suffice to find exact duplicates, according to your comment you also would like to get rid of similar images, not just exact duplicates (Do you really want to do this? Having a bigger dataset is usually better for training, and keeping similar images with small variations, e.g. in color, is not necessarily a bad thing -- see "data augmentation").
Generating a hash for images is not robust to slight changes in pixel
values, say minor lighting changes which aren't visible to the eye but
the pixel value differs. - Ronica Jethwa
One solution to this is to use perceptual hashing which is quite robust to minor differences in color, rotation, aspect ratio of images etc. In particular I would suggest you to try the pHash algorithm based on Discrete Cosine Transform as described in Looks-Like-It. There is a python library that implements it, called imagehash. Here's how to use it:
from PIL import Image
import imagehash
# Compute the perception-hash values (64 bit) for two images
phash_1 = imagehash.phash(Image.open('image_1')) # e.g. d58e11ce51ee15aa
phash_2 = imagehash.phash(Image.open('image_2')) # e.g. d58e01ae519e559e
# Compare the images using the Hamming distance of their perception hashes
dist = phash_1 - phash_2
Then it's up to you to choose the similarity threshold for the Hamming distance.

Duplicates don't imply over-fitting; they give that image more weight in the training. Yes, you can train on the data set; the results will be valid. For instance, if you have the same quantity of duplicates (say, 10 of everything). then you'll get the same results as if you had just one -- or almost: the shuffling order can slightly affect the balance of training, since a single image can now appear multiple times near the start of epoch 1.
The various counter-measures you list are good tools against over-fitting, but your main danger is merely what you have anyway: the potential of a small set of unique examples.

Adding my cent to this old question.
During training the problem arises only if you have a high chance of having many duplicates in a single batch.
Let's say you choose a batch size of 64; since you will randomly sample the images to compose the batch it could be that on average you have only 2 duplicates. This really depends on how many times (on average) an image is duplicated in proportion to the total number of images.
Anyway the problem is alleviated by using (online) data augmentation which introduces some differences, even between identical images.
The biggest problem is on the test set because the accuracy estimation will be biased towards the images with more duplicates, so I would embrace the effort and deduplicate the test (and validation) sets.

If you have the same images in the validation set as in the train set, but different in the test set, the validation will give a better (accuracy) score than test. In this case, it will be like overfitting. Duplicates occur naturally everywhere, therefore it must be ok.

Train with duplicate data. Use the representation vector i.e output of last convolution. If you using pretrained CNN model use the final out of that. Apply knn or clustering on the representation vectors and identify duplicates. Remove duplicates and retain your model.

Random Perturbation of Data to get Training Data for Neural Networks

I am working on Soil Spectral Classification using neural networks and I have data from my Professor obtained from his lab which consists of spectral reflectance from wavelength 1200 nm to 2400 nm. He only has 270 samples.
I have been unable to train the network for accuracy more than 74% since the training data is very less (only 270 samples). I was concerned that my Matlab code is not correct, but when I used the Neural Net Toolbox in Matlab, I got the same results...nothing more than 75% accuracy.
When I talked to my Professor about it, he said that he does not have any more data, but asked me to do random perturbation on this data to obtain more data. I have research online about random perturbation of data, but have come up short.
Can someone point me in the right direction for performing random perturbation on 270 samples of data so that I can get more data?
Also, since by doing this, I will be constructing 'fake' data, I don't see how the neural network would be any better cos isn't the point of neural nets using actual real valid data to train the network?
Thanks,
Faisal.

I think trying to fabricate more data is a bad idea: you can't create anything with higher information content than you already have, unless you know the true distribution of the data to sample from. If you did, however, you'd be able to classify with the Bayes optimal error rate, which would be impossible to beat.
What I'd be looking at instead is whether you can alter the parameters of your neural net to improve performance. The thing that immediately springs to mind with small amounts of training data is your weight regulariser (are you even using regularised weights), which can be seen as a prior on the weights if you're that way inclined. I'd also look at altering the activation functions if you're using simple linear activations, and the number of hidden nodes in addition (with so few examples, I'd use very few, or even bypass the hidden layer entirely since it's hard to learn nonlinear interactions with limited data).
While I'd not normally recommend it, you should probably use cross-validation to set these hyper-parameters given the limited size, as you're going to get unhelpful insight from a 10-20% test set size. You might hold out 10-20% for final testing, however, so as to not bias the results in your favour.

First, some general advice:
Normalize each input and output variable to [0.0, 1.0]
When using a feedforward MLP, try to use 2 or more hidden layers
Make sure your number of neurons per hidden layer is big enough, so the network is able to tackle the complexity of your data
It should always be possible to get to 100% accuracy on a training set if the complexity of your model is sufficient. But be careful, 100% training set accuracy does not necessarily mean that your model does perform well on unseen data (generalization performance).
Random perturbation of your data can improve generalization performance, if the perturbation you are adding occurs in practice (or at least similar perturbation). This works because this means teaching your network on how the data could look different but still belong to the given labels.
In the case of image classification, you could rotate, scale, noise, etc. the input image (the output stays the same, naturally). You will need to figure out what kind of perturbation could apply to your data. For some problems this is difficult or does not yield any improvement, so you need to try it out. If this does not work, it does not necessarily mean your implementation or data are broken.

The easiest way to add random noise to your data would be to apply gaussian noise.
I suppose your measures have errors associated with them (a measure without errors has almost no meaning). For each measured value M+-DeltaM you can generate a new number with N(M,DeltaM), where n is the normal distribution.
This will add new points as experimental noise from previous ones, and will add help take into account exprimental errors in the measures for the classification. I'm not sure however if it's possible to know in advance how helpful this will be !

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart