In the Pytorch transfer learning tutorial, the images in both the training and the test sets are being pre-processed using the following code:
data_transforms = {
'train': transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]),
'val': transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]),
}
My question is - what is the intuition behind this choice of transforms? In particular, what is the intuition behind choosing RandomResizedCrop(224) and RandomHorizontalFlip()? Wouldn't it be better to just let the neural network train on the entire image? (or at least, augment the dataset using these transformation)? I understand why it is reasonable to insert only the portion of the image that contains the ant/bees to the neural network but can't understand why it is reasonable to insert a random crop...
Hope I managed to make all my questions clear
Thanks!
Regarding RandomResizedCrop
Why ...ResizedCrop? - This answer is straightforward. Resizing crops to the same dimensions allows you to batch your input data. Since the training images in your toy dataset have different dimensions, this is the best way to make your training more efficient.
Why Random...? - Generating different random crops per image every iteration (i.e. random center and random cropping dimensions/ratio before resizing) is a nice way to artificially augment your dataset, i.e. feeding your network different-looking inputs (extracted from the same original images) every iteration. This helps to partially avoid over-fitting for small datasets, and makes your network overall more robust.
You are however right that, since some of your training images are up to 500px wide and the semantic targets (ant/bee) sometimes cover only a small portion of the images, there is a chance that some of these random crops won't contain an insect... But as long as the chances this happens stay relatively low, it won't really impact your training. The advantage of feeding different training crops every iteration (instead of always the same non-augmented images) vastly counterbalances the side-effect of sometimes giving "empty" crops. You could verify this assertion by replacing RandomResizedCrop(224) by Resize(224) (fixed resizing) in your code and compare the final accuracies on the test set.
Furthermore, I would add that neural networks are smart cookies, and sometimes learn to recognize images through features you wouldn't expect (i.e. they tend to learn recognition shortcuts if your dataset or losses are biased, c.f. over-fitting). I wouldn't be surprised if this toy network is performing so well despite being trained sometimes on "empty" crops just because it learns e.g. to distinguish between usual "ant backgrounds" (ground floor, leaves, etc.) and "bee backgrounds" (flowers).
Regarding RandomHorizontalFlip
Its purpose is also to artificially augment your dataset. For the network, an image and its flipped version are two different inputs, so you are basically artificially doubling the size of your training dataset for "free".
There are plenty more operations one can use to augment training datasets (e.g. RandomAffine, ColorJitter, etc). One has however to be careful to choose transformations which are meaningful for the target use-case / which are not impacting the target semantic information (e.g. for ant/bee classification, RandomHorizontalFlip is fine as you will probably get as many images of insects facing right than facing left; however RandomVerticalFlip doesn't make much sense as you won't get pictures of insects upside-down most certainly).
Related
tl;dr - I use an autoencoder to try to reduce input dimensions for a reinforcement-learning (RL) agent to learn how to play Atari-KungFu. But it fails at encoding/decoding thrown knives, because they are only a couple pixels and getting them wrong probably has negligible impact on the autoencoder MSE loss (see green arrows in bottom left of image). This will probably permanently hobble the results. I want to figure out if there is a way to solve this -- preferably with a generalized solution, but I'd be happy for now with something specific to this problem.
Background:
I am working on Week5 of the "Practical Reinforcement Learning" course on Coursera (National Research University HSE), and I decided to spend extra time trying to expand performance on the Atari-KungFu assignment using Actor-Critic architecture. This post is not about actor-critic, but more about an interesting sub-problem I ran into related to autoencoders.
I create an encoder which outputs a tanh-64-neuron layer, which is used as a common input to the decoder, policy learner (actor), and value learner (critic). During training, the simulator returns batches of four sequential frames (64 x 144 x 4) and rewards from the last action. Then images are first used to train the autoencoder, then used with the rewards to train the actor & critic branches.
I display some metrics and example frames every 25000 iterations to see how it's doing. If the reconstructed images are accurate, then the inputs to the actor & critic branches should be getting good distilled information for efficient learning.
You can see below that the autoencoder is pretty good except for the thrown knives (see bottom-left). Arguably this is because missing those couple pixels minimally increases the MSE loss of the reconstructed image, so it has little incentive to learn it (and also there's not a lot of frames that have knives). Yet, seeing those knives is critical for the RL agent to learn to how to survive.
I haven't seen this kind of problem addressed before. A tiny artifact in the input images is crucial for learning, but is unlikely to be learned by the autoencoder. Can we fix/improve this?
IMO your problem is loss specific, some things which would probably help autoencoder reconstruct knife as well:
Find knives in input image using image processing techniques. Regions where knives are present should have higher loss value in MSE, say 10 times more. One way to find those semi-automatically could probably be convolution with big kernel; White pixels at the strict center would give more weight and only zeros around it would give it more weight as well. Something along these lines should find a region where only knives are located (throwing guys wouldn't, as they contain too many white pixels and holes). Using some threshold found empirically for the value of this kernel should be enough to correctly find them.
Lower loss for images when no knive was found, say divided by half. This would focus autoencoder harder on rarely seen cases when knive is seen.
On the downside - I suppose it could introduce some artifacts. In such case you may think about usage of pretrained encoder (like some version of ResNet) and increase model's capabilities.
I have a dataset of images for classification purposes. The dataset is very large and most of the images are duplicates of each other. So essentially, the same image occurs multiple times. Moreover, the dataset is unbalanced.
I understand the motivation of cleaning the dataset of duplicates. But it is extensive and very time consuming to do so.
Is there a way to train a net on this dataset, and not overfit the model?
Could enforcing harsher regularization, dropouts, penalize the losses still produce a usable model?
As suggested by Jon.H in comments, instead of training your model on a dataset with duplicates, you could use image hashing to detect and remove them from the dataset. Although the cryptographic hashing (like MD5 and SHA1) will suffice to find exact duplicates, according to your comment you also would like to get rid of similar images, not just exact duplicates (Do you really want to do this? Having a bigger dataset is usually better for training, and keeping similar images with small variations, e.g. in color, is not necessarily a bad thing -- see "data augmentation").
Generating a hash for images is not robust to slight changes in pixel
values, say minor lighting changes which aren't visible to the eye but
the pixel value differs. - Ronica Jethwa
One solution to this is to use perceptual hashing which is quite robust to minor differences in color, rotation, aspect ratio of images etc. In particular I would suggest you to try the pHash algorithm based on Discrete Cosine Transform as described in Looks-Like-It. There is a python library that implements it, called imagehash. Here's how to use it:
from PIL import Image
import imagehash
# Compute the perception-hash values (64 bit) for two images
phash_1 = imagehash.phash(Image.open('image_1')) # e.g. d58e11ce51ee15aa
phash_2 = imagehash.phash(Image.open('image_2')) # e.g. d58e01ae519e559e
# Compare the images using the Hamming distance of their perception hashes
dist = phash_1 - phash_2
Then it's up to you to choose the similarity threshold for the Hamming distance.
Duplicates don't imply over-fitting; they give that image more weight in the training. Yes, you can train on the data set; the results will be valid. For instance, if you have the same quantity of duplicates (say, 10 of everything). then you'll get the same results as if you had just one -- or almost: the shuffling order can slightly affect the balance of training, since a single image can now appear multiple times near the start of epoch 1.
The various counter-measures you list are good tools against over-fitting, but your main danger is merely what you have anyway: the potential of a small set of unique examples.
Adding my cent to this old question.
During training the problem arises only if you have a high chance of having many duplicates in a single batch.
Let's say you choose a batch size of 64; since you will randomly sample the images to compose the batch it could be that on average you have only 2 duplicates. This really depends on how many times (on average) an image is duplicated in proportion to the total number of images.
Anyway the problem is alleviated by using (online) data augmentation which introduces some differences, even between identical images.
The biggest problem is on the test set because the accuracy estimation will be biased towards the images with more duplicates, so I would embrace the effort and deduplicate the test (and validation) sets.
If you have the same images in the validation set as in the train set, but different in the test set, the validation will give a better (accuracy) score than test. In this case, it will be like overfitting. Duplicates occur naturally everywhere, therefore it must be ok.
Train with duplicate data. Use the representation vector i.e output of last convolution. If you using pretrained CNN model use the final out of that. Apply knn or clustering on the representation vectors and identify duplicates. Remove duplicates and retain your model.
I have images that I want to process. First features are extracted from those images and then those features are fed into a neural network for training. I do not have many images though and would like to generate more data.
1) What yields less overfitting: Should I generate more images from the original images and then feed the entire pipeline with them, or should I bring variation into the extracted features and simply train the neural network with more data this way?
The second approach would be computationally cheaper, but yields better results?
2) What techniques are tried and true for generating more data - either more images or the features?
Is true that when you don't have enough data the performance of your model can be poor. So you have to try a few things:
You can modify the data that you have applying translations, rotations, etc; for example move all the pixel of the image a few pixel to the left. This are operation on images.
Also you can generate more images through generative models: Restricted Boltzmann Machines, Deep Belief Networks etc.
Also you have a way of determine if you need more training data. In the coordinate axis you draw the score of the training data and validation data. In the x axis goes the size of the sets(10% of the all set, 20% of the all set, ..., 90% of the all set) and in the y axis is the score. Then you look at the graph. For understand well enough this what i'm saying i strongly recommend the videos of Andrew Ng of Machine Learning(https://www.coursera.org/learn/machine-learning) specifically the Week 6(Advice for Applying Machine Learning)
Are there known methods of continuous training and graceful degradation of a neural net while it shrinks or grows in size (by number of nodes, connections, whatever)?
To the best of my memory, everything I've read about neural networks is from a static perspective. You define the net and then train it.
If there is some neural network X with N nodes (neurons, whatever), is it possible to train the network (X) so that while N increases or decreases, the network is still useful and capable of performing?
In general, changing network architecture (adding new layers, adding more neurons into existing layers) once the network was already trained makes sense and a rather common operation in Deep Learning domain. One example is the dropout - during training half of the neurons randomly get switched off completely and only remaining half participates in training during specific iteration (each iteration or 'epoch' as it often is named has different random list of switched off neurons). Another example is transfer learning - where you learn network on one set of input data, cut off part of the outcoming layers, replace them with new layers and re-learn the model on another dataset.
To better explain why it makes sense lets step back for a moment. In deep networks, where you have lots of hidden layers each layer learns some abstraction from the incoming data. Each additional layer uses abstract representations learned by previous layer and builds upon them, combining such abstraction to form a higher level of the data representation. For instance, you could be trying to classify the images with DNN. First layer will learn rather simple concepts from images - like edges or points in data. Next layer could combine this simple concepts to learn primitives - like triangles or circles of squares. Next layer could drive it further and combine this primitives to represent some objects which you could find in images, like 'a car' or 'a house'and using softmax it calculates the probabilities of the answer you are looking for (what to actually output). I need to mention that these facts and learned representations could be actually checked. You could visualize the activation of your hidden layer and see what it learned. For example this was done with google's project 'inceptionism'. With that in mind let's get back to what I mentioned earlier.
Dropout is used to improve generalization of the network. It forces each neuron to 'not be so sure' that some pieces of the information from the previous layer will be available and makes it to try to learn the representations relying on less favorable and informative pieces of abstractions from previous layer. It forces it to consider all of the representations from previous layer to make decisions instead of putting all of its weight into couple of neurons it 'likes most of all'. By doing this the network is usually better prepared to new data where the input will be different from the training set.
Q: "As far as you're aware is the quality of the stored knowledge (whatever training has done to the net) still usable following the dropout? Maybe random halves could be substituted by random 10ths with a single 10th dropping, that might result in less knowledge loss during the transition period."
A: Unfortunately I can't properly answer why precisely half of the neurons is switched off and not 10% (or any other number). Maybe there is an explanation but I haven't seen it. In general it just works and that's it.
Also I need to mention that the task of dropout is to ensure that each neuron doesn't consider just several of the neurons from previous layer and is ready to make some decision even if neurons which usually helped it to make correct decision are not available. This is used for generalization only and helps the network to better cope with the data it haven't seen previously, nothing else is achieved with a dropout.
Now let's consider Transfer Learning again. Consider that you have a network with 4 layers. You train it to recognize specific objects in pictures (cat, dog, table, car etc). Than you cut off last layer, replace it with three additional layers and now you train the resulting 6-layered network on a dataset which, for instance, wrights short sentences about what is shown on this image ('a cat is on the car', 'house with windows and tree nearby' etc). What we did with such operation? Our original 4-layer network was capable to understand if some specific object is in the image we feed it with. Its first 3 layers learned good representations of the images - first layer learned about possible edges or points or some extremely primitive geometric shapes in images. Second layer learned some more elaborate geometric figures like 'circle' or 'square'. Last layer knows how to combine them to form some higher level objects - 'car', 'cat', 'house'. Now, we could just re-use this good representation which we learned in different domain and just add several more layers. Each of them will use abstractions from last (3rd) layer of original network and learn how combine them to create meaningful descriptions of images. While you will perform learning on new dataset with images as input and sentences as output it will adjust first 3 layers which we got from original network but these adjustments will be mostly minor, while 3 new layers will be adjusted by learning significantly. What we achieve with transfer learning is:
1) We can learn a much better data representations. We could create a network which is very good at specific task and than build upon that network to perform something different.
2) We can save training time - first layers of network will already be trained well enough so that your layers which are closer to output already get a rather good data representations. So the training should finish much faster using pre-trained first layers.
So the bottom line is that pre-training some network and than re-using part or whole network in another network makes perfect sense and is not something uncommon.
This is something I have seen in the likes of this video...
https://youtu.be/qv6UVOQ0F44
There are links to further resources in the video description.
And is based on a process called NEAT. Neuro Evolution of Augmenting Topologies.
It uses a genetic algorithm and evolutionary process to design and evolve a neural net from scratch with no prior assumptions of structure or complexity of the neural net.
I believe this is what you are looking for.
I am working on Soil Spectral Classification using neural networks and I have data from my Professor obtained from his lab which consists of spectral reflectance from wavelength 1200 nm to 2400 nm. He only has 270 samples.
I have been unable to train the network for accuracy more than 74% since the training data is very less (only 270 samples). I was concerned that my Matlab code is not correct, but when I used the Neural Net Toolbox in Matlab, I got the same results...nothing more than 75% accuracy.
When I talked to my Professor about it, he said that he does not have any more data, but asked me to do random perturbation on this data to obtain more data. I have research online about random perturbation of data, but have come up short.
Can someone point me in the right direction for performing random perturbation on 270 samples of data so that I can get more data?
Also, since by doing this, I will be constructing 'fake' data, I don't see how the neural network would be any better cos isn't the point of neural nets using actual real valid data to train the network?
Thanks,
Faisal.
I think trying to fabricate more data is a bad idea: you can't create anything with higher information content than you already have, unless you know the true distribution of the data to sample from. If you did, however, you'd be able to classify with the Bayes optimal error rate, which would be impossible to beat.
What I'd be looking at instead is whether you can alter the parameters of your neural net to improve performance. The thing that immediately springs to mind with small amounts of training data is your weight regulariser (are you even using regularised weights), which can be seen as a prior on the weights if you're that way inclined. I'd also look at altering the activation functions if you're using simple linear activations, and the number of hidden nodes in addition (with so few examples, I'd use very few, or even bypass the hidden layer entirely since it's hard to learn nonlinear interactions with limited data).
While I'd not normally recommend it, you should probably use cross-validation to set these hyper-parameters given the limited size, as you're going to get unhelpful insight from a 10-20% test set size. You might hold out 10-20% for final testing, however, so as to not bias the results in your favour.
First, some general advice:
Normalize each input and output variable to [0.0, 1.0]
When using a feedforward MLP, try to use 2 or more hidden layers
Make sure your number of neurons per hidden layer is big enough, so the network is able to tackle the complexity of your data
It should always be possible to get to 100% accuracy on a training set if the complexity of your model is sufficient. But be careful, 100% training set accuracy does not necessarily mean that your model does perform well on unseen data (generalization performance).
Random perturbation of your data can improve generalization performance, if the perturbation you are adding occurs in practice (or at least similar perturbation). This works because this means teaching your network on how the data could look different but still belong to the given labels.
In the case of image classification, you could rotate, scale, noise, etc. the input image (the output stays the same, naturally). You will need to figure out what kind of perturbation could apply to your data. For some problems this is difficult or does not yield any improvement, so you need to try it out. If this does not work, it does not necessarily mean your implementation or data are broken.
The easiest way to add random noise to your data would be to apply gaussian noise.
I suppose your measures have errors associated with them (a measure without errors has almost no meaning). For each measured value M+-DeltaM you can generate a new number with N(M,DeltaM), where n is the normal distribution.
This will add new points as experimental noise from previous ones, and will add help take into account exprimental errors in the measures for the classification. I'm not sure however if it's possible to know in advance how helpful this will be !