I have created a DCGAN and already trained it for CIFAR-10 dataset. Now, i would like to train it for custom dataset.
I have already gathered around 1200 images, it is practicly impossible to gather more. What should i do?
We are going to post a paper in a coming week(s) about stochastic deconvolutions for generator, that can improve stability and variety for such a problem. If you are interested, I can send a current version of a paper right now. But generally speaking, the idea is simple:
Build a classic GAN
For deep layers of generator (let's say for a half of them) use stochastic deconvolutions (sdeconv)
sdeconv is just a normal deconv layer, but filters are being selected on a fly randomly from a bank of filters. So your filter bank shape can be, for instance, (16, 128, 3, 3) where 16 - number of banks, 128 - number of filters in each, 3x3 - size. Your selection of a filter set at each training step is [random uniform 0-16, :, :, :]. Unselected filters remain untrained. In tensorflow you want to select different filter sets for a different images in batch as well as tf keeps training variables even if it is not asked for (we believe it is a bug, tf uses last known gradients for all variables even if they are not being used in a current dynamic sub-graph, so you have to utilize as much variables as you can).
That's it. Having 3 layers with sdeconv of 16 sets in each bank, practically you'll have 16x16x16 = 4096 combinations of different internal routes to produce an output.
How is it helping on a small dataset? - Usually small datasets have relative large "topics" variance, but generally dataset is of one nature (photos of cats: all are realistc photos, but with different types of cats). In such datasets GAN collapses very quickly, however with sdeconv:
Upper normal deconv layers learns how to reconstruct a style "realistic photo"
Lower sdevond learns sub-distributions: "dark cat", "white cat", "red cat" and so on.
Model can be seen as ensemble of weak-generators, each sub-generator is weak and can collapse, but will be "supported" by another sub-generator that temorarily outperforms discriminator.
MNIST is a great example of such a dataset: high "topics" variance, but the same style of digits.
GAN+weight norm+prelu (collapsed after 1000 steps, died after 2000, can only describe one "topic"):
GAN+weight norm+prelu+sdeconv, 4388 steps (local variety degradation of sub-topics is seen, however not collapsed globally, global visual variety preserved):
Related
I have used bert base pretrained model with 512 dimensions to generate contextual features. Feeding those vectors to random forest classifier is providing 83 percent accuracy but in various researches i have seen that bert minimal gives 90 percent.
I have some other features too like word2vec, lexicon, TFIDF and punctuation features.
Even when i merged all the features i got 83 percent accuracy. The research paper which i am using as base paper mentioned an accuracy score of 92 percent but they have used an ensemble based approach in which they classified through bert and trained random forest on weights.
But i was willing to do some innovation thus didn't followed that approach.
My dataset is biased to positive reviews so according to me the accuracy is less as model is also biased for positive labels but still I am looking for an expert advise
Code implementation of bert
https://github.com/Awais-mohammad/Sentiment-Analysis/blob/main/Bert_Features.ipynb
Random forest on all features independently
https://github.com/Awais-mohammad/Sentiment-Analysis/blob/main/RandomForestClassifier.ipynb
Random forest on all features jointly
https://github.com/Awais-mohammad/Sentiment-Analysis/blob/main/Merging_Feature.ipynb
Regarding the "no improvements despite adding more features" - some researchers believe that the BERT word embeddings already contain all the available information presented in text, so then it doesn't matter how fancy a classification head you add to it, doesn't matter if it is a linear model that uses the embeddings, or a complicated ML algorithm with a number of other features, they will not provide significant improvements in many tasks. They argue, that since BERT is a context-aware, bidirectional language model - that is trained extensively on MLM and NSP tasks, it already grasps most of the things that additional features for punctuation, word2vec and tfidf could convey. The lexicon could probably help a little in the sentiment task, if it is relevant, but the one or two extra variables, that you likely use to represent it, probably get drowned in all the other features.
Other than that, the accuracy of BERT-based models depends on the dataset used, sometimes the data is simply too diverse to obtain a perfect score, e.g. if there are some instances of observations that are very similar, but with different class labels etc. You can see in the BERT papers, that the accuracy widely depends on the task, e.g. in some tasks it is indeed 90+%, but for some tasks, e.g. Masked Language Modeling, where the model needs to choose a particular word from a vocab of over 30K words, the accuracy of 20% could be impressive in some cases. So in order to obtain a reliable comparison with bert papers, you'd need to pick a dataset that they've used and then compare.
Regarding the dataset balance, for deep learning models in general, the rule of thumb is that the training set should be more or less balanced w.r.t. the fraction of data covered by each class label. So if you have 2 labels, should be ~50-50, if 5 labels, then each should be at around 20% of training dataset, etc.
That is because most NN's work in batches, where they update the model weights based on the feedback from each batch. So if you have too many values of one class, the batch updates will be dominated by that one class, effectively worsening the quality of your training.
So, if you want to improve the accuracy of your model, balancing the dataset could be an easy fix. And if you have e.g. 5 ordered classes with differing sizes, you may consider merging some of them (e.g. reviews from 1-2 as bad, 3 as neutral, 4-5 as good) and then rebalancing, if still necessary.
(Unless it's a situation where e.g. 1 class has 80% of data, and 4 classes share the remaining 20%. In such a case you should probably consider some more advanced options, such as partitioning the algo to two parts, one predicting whether or not an instance is in class 1 (so a binary classifier), the other to distinguish between the 4 underrepresented classes. )
In the Pytorch transfer learning tutorial, the images in both the training and the test sets are being pre-processed using the following code:
data_transforms = {
'train': transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]),
'val': transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]),
}
My question is - what is the intuition behind this choice of transforms? In particular, what is the intuition behind choosing RandomResizedCrop(224) and RandomHorizontalFlip()? Wouldn't it be better to just let the neural network train on the entire image? (or at least, augment the dataset using these transformation)? I understand why it is reasonable to insert only the portion of the image that contains the ant/bees to the neural network but can't understand why it is reasonable to insert a random crop...
Hope I managed to make all my questions clear
Thanks!
Regarding RandomResizedCrop
Why ...ResizedCrop? - This answer is straightforward. Resizing crops to the same dimensions allows you to batch your input data. Since the training images in your toy dataset have different dimensions, this is the best way to make your training more efficient.
Why Random...? - Generating different random crops per image every iteration (i.e. random center and random cropping dimensions/ratio before resizing) is a nice way to artificially augment your dataset, i.e. feeding your network different-looking inputs (extracted from the same original images) every iteration. This helps to partially avoid over-fitting for small datasets, and makes your network overall more robust.
You are however right that, since some of your training images are up to 500px wide and the semantic targets (ant/bee) sometimes cover only a small portion of the images, there is a chance that some of these random crops won't contain an insect... But as long as the chances this happens stay relatively low, it won't really impact your training. The advantage of feeding different training crops every iteration (instead of always the same non-augmented images) vastly counterbalances the side-effect of sometimes giving "empty" crops. You could verify this assertion by replacing RandomResizedCrop(224) by Resize(224) (fixed resizing) in your code and compare the final accuracies on the test set.
Furthermore, I would add that neural networks are smart cookies, and sometimes learn to recognize images through features you wouldn't expect (i.e. they tend to learn recognition shortcuts if your dataset or losses are biased, c.f. over-fitting). I wouldn't be surprised if this toy network is performing so well despite being trained sometimes on "empty" crops just because it learns e.g. to distinguish between usual "ant backgrounds" (ground floor, leaves, etc.) and "bee backgrounds" (flowers).
Regarding RandomHorizontalFlip
Its purpose is also to artificially augment your dataset. For the network, an image and its flipped version are two different inputs, so you are basically artificially doubling the size of your training dataset for "free".
There are plenty more operations one can use to augment training datasets (e.g. RandomAffine, ColorJitter, etc). One has however to be careful to choose transformations which are meaningful for the target use-case / which are not impacting the target semantic information (e.g. for ant/bee classification, RandomHorizontalFlip is fine as you will probably get as many images of insects facing right than facing left; however RandomVerticalFlip doesn't make much sense as you won't get pictures of insects upside-down most certainly).
I sort of understand what features are, say a ML algorithm that learns SPAM, certain keywords could be a feature?
But in the famous MNIST digits data set, I see a matrix of numbers, is the entire matrix one single feature? Or is a feature each number in the matrix?
In my opinion, you are lacking some critical literature review.
Here are some good papers about RNN and CNN that can be used for image recognition appications :
https://pdfs.semanticscholar.org/86ef/e7769f2b8a0e15ca213ab09881e6705caeb0.pdf
https://arxiv.org/pdf/1506.00019.pdf
What is a feature? A feature represents one of the elements of the input vector which will be used to train the model and produce output.
The feature set is to be determined depending on the application.
Each element of the input vector is a different (dependent or independent) feature.
Look at this tutorial for example using the MNIST digit data set:
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/recurrent_network.py
It says:
'''
To classify images using a recurrent neural network, we consider every image
row as a sequence of pixels. Because MNIST image shape is 28*28px, we will then
handle 28 sequences of 28 steps for every sample.
'''
The RNN is built on sequences, hence if the image is 28 by 28 you can break it in 28 sequences of 28 features.
# Network Parameters
num_input = 28 # MNIST data input (img shape: 28*28)
timesteps = 28 # timesteps
This is what you see in the network parameters. The 28 features (num_input = 28) representing one sequence of the image.
To repeat again, each element of the input vector is considered a feature. Furthermore, is the analyst's responsibility to properly define these features.
Technically, a feature is a numerical value which discriminatively represents (or attempts to discriminatively represent) input or some part(s) of input. In case of MNIST, where image size is 28 x 28, the entire image matrix is flattened (generally row-wise) into a 1D feature vector, each element of this feature vector is a feature (in this case, simply image intensity). The type or kind of feature which one wants to use is completely problem specific. For e.g., instead of flattening the entire MNIST digit image, you could have used number of white pixels as your feature; however, it boils down to how discriminative such a feature could be for the given problem.
In case of spam classification, generally the features are frequency of words (there are several other things involved, such as stop word elimination, stemming, etc.).
One can off-course select or design multiple features for a given problem, such as stroke length, curvature, number of edges, etc. which you mentioned in the comment above. However, the main idea is that features should be discriminative enough for all the classes and they should not be derived from each other (this point leads us to another problem called feature or dimensionality reduction). I suggest you to read this Wikipedia page here and then go on to read an academic presentation on feature extraction and dimensionality reduction, such as this (this one is specific to images). This would help you to understand the overall idea.
An additional note, the features are combined into a compact representation called a feature vector. In this particular case, as mentioned before, you have a 1-D feature vector, which contains image intensities as a features.
I have images that I want to process. First features are extracted from those images and then those features are fed into a neural network for training. I do not have many images though and would like to generate more data.
1) What yields less overfitting: Should I generate more images from the original images and then feed the entire pipeline with them, or should I bring variation into the extracted features and simply train the neural network with more data this way?
The second approach would be computationally cheaper, but yields better results?
2) What techniques are tried and true for generating more data - either more images or the features?
Is true that when you don't have enough data the performance of your model can be poor. So you have to try a few things:
You can modify the data that you have applying translations, rotations, etc; for example move all the pixel of the image a few pixel to the left. This are operation on images.
Also you can generate more images through generative models: Restricted Boltzmann Machines, Deep Belief Networks etc.
Also you have a way of determine if you need more training data. In the coordinate axis you draw the score of the training data and validation data. In the x axis goes the size of the sets(10% of the all set, 20% of the all set, ..., 90% of the all set) and in the y axis is the score. Then you look at the graph. For understand well enough this what i'm saying i strongly recommend the videos of Andrew Ng of Machine Learning(https://www.coursera.org/learn/machine-learning) specifically the Week 6(Advice for Applying Machine Learning)
In this case i want to make letter recognition, the letter is scanned from a paper. the result of that process i have 5 x 5 binary matrix. so, it would use 25 input node. but i don't understand how to determine total hidden layer nodes and outputs node for that cases.i want to build the architecture of multilayer perecptron for that cases. thanks for your help!
Every NN has three types of layers: input, hidden, and output.
Creating the NN architecture therefore means coming up with values for the number of layers of each type and the number of nodes in each of these layers.
The Input Layer
Simple--every NN has exactly one of them--no exceptions that I'm aware of.
With respect to the number of neurons comprising this layer, this parameter is completely and uniquely determined once you know the shape of your training data. Specifically, the number of neurons comprising that layer is equal to the number of features (columns) in your data. Some NN configurations add one additional node for a bias term.
The Output Layer
Like the Input layer, every NN has exactly one output layer. Determining its size (number of neurons) is simple; it is completely determined by the chosen model configuration.
Is your NN going running in Machine Mode or Regression Mode (the ML convention of using a term that is also used in statistics but assigning a different meaning to it is very confusing). Machine mode: returns a class label (e.g., "Premium Account"/"Basic Account"). Regression Mode returns a value (e.g., price).
If the NN is a regressor, then the output layer has a single node.
If the NN is a classifier, then it also has a single node unless softmax is used
in which case the output layer has one node per class label in your model.
The Hidden Layers
So those few rules set the number of layers and size (neurons/layer) for both the input and output layers. That leaves the hidden layers.
How many hidden layers? Well if your data is linearly separable (which you often know by the time you begin coding a NN) then you don't need any hidden layers at all. Of course, you don't need an NN to resolve your data either, but it will still do the job.
Beyond that, as you probably know, there's a mountain of commentary on the question of hidden layer configuration in NNs (see the insanely thorough and insightful NN FAQ for an excellent summary of that commentary). One issue within this subject on which there is a consensus is the performance difference from adding additional hidden layers: the situations in which performance improves with a second (or third, etc.) hidden layer are very small. One hidden layer is sufficient for the large majority of problems.
So what about size of the hidden layer(s)--how many neurons? There are some empirically-derived rules-of-thumb, of these, the most commonly relied on is 'the optimal size of the hidden layer is usually between the size of the input and size of the output layers'. Jeff Heaton, author of Introduction to Neural Networks in Java offers a few more.
In sum, for most problems, one could probably get decent performance (even without a second optimization step) by setting the hidden layer configuration using just two rules: (i) number of hidden layers equals one; and (ii) the number of neurons in that layer is the mean of the neurons in the input and output layers.
Optimization of the Network Configuration
Pruning describes a set of techniques to trim network size (by nodes not layers) to improve computational performance and sometimes resolution performance. The gist of these techniques is removing nodes from the network during training by identifying those nodes which, if removed from the network, would not noticeably affect network performance (i.e., resolution of the data). (Even without using a formal pruning technique, you can get a rough idea of which nodes are not important by looking at your weight matrix after training; look weights very close to zero--it's the nodes on either end of those weights that are often removed during pruning.) Obviously, if you use a pruning algorithm during training then begin with a network configuration that is more likely to have excess (i.e., 'prunable') nodes--in other words, when deciding on a network architecture, err on the side of more neurons, if you add a pruning step.
Put another way, by applying a pruning algorithm to your network during training, you can approach optimal network configuration; whether you can do that in a single "up-front" (such as a genetic-algorithm-based algorithm) I don't know, though I do know that for now, this two-step optimization is more common.
Formula
One additional rule of thumb for supervised learning networks, the upperbound on the number of hidden neurons that won't result in over-fitting is:
Others recommend setting alpha to a value between 5 and 10, but I find a value of 2 will often work without overfitting. As explained by this excellent NN Design text, you want to limit the number of free parameters in your model (its degree or number of nonzero weights) to a small portion of the degrees of freedom in your data. The degrees of freedom in your data is the number samples * degrees of freedom (dimensions) in each sample or Ns∗(Ni+No) (assuming they're all independent). So alpha is a way to indicate how general you want your model to be, or how much you want to prevent overfitting.
For an automated procedure you'd start with an alpha of 2 (twice as many degrees of freedom in your training data as your model) and work your way up to 10 if the error for training data is significantly smaller than for the cross-validation data set.
References
Advameg (2016) Comp.Ai.Neural-nets FAQ, part 1 of 7: Introduction. Available at: http://www.faqs.org/faqs/ai-faq/neural-nets/part1/preamble.html
How to choose the number of hidden layers and nodes in a feedforward neural network? (2016a) Available at: https://stats.stackexchange.com/a/136542
How to choose the number of hidden layers and nodes in a feedforward neural network? (2016b) Available at: https://stats.stackexchange.com/a/1097
Legal, H.R. - and Info, C. (2016) Introduction to neural networks for java, 2nd edition. Available at: http://www.heatonresearch.com/book/programming-neural-networks-java-2.html