Biased initial dataset active learning - machine-learning

Does selecting a biased initial(seed) dataset effect the training and accuracy of the machine built using active learning?

It may. Suppose a seed data sample is heavily biased and model has not seen any examples of a particular cluster. Then while predicting, the model may predict them as belonging to some other class and do this with high certainty (i.e. it has gotten heavily biased). And so it wouldn't feel the need to query labels for such data instances and won't learn them. But when we later test model's results with true labels, it will show low accuracy because these were actually wrong predictions.
Having said that, we also may not desire a 'perfectly uniform' distribution of training data in seed dataset, since if we have a considerable number of outliers or incorrect label by human error or heavily skewed but less probable data cluster which can be undesired, it would hamper the model.
One solution can be 'active cleaning' of such instances, or otherwise, we can allow seed data to have some amount of intentional bias (which can be towards high-density clusters or influential labels or ensemble disagreements or uncertainty of model). We then make sure to account for this introduced bias in the model in our further decision-making process based on the model's results.

Related

Is the validation accuracy always higher than testing accuracy in deep learning?

I met a professor who told me that generally speaking, validation accuracy is always higher than testing accuracy.
He claimed that testing dataset is used only for testing the final model. Although validation dataset is used to only tweak hyperparameters and only training data is shown to the model, the model developer could try to carefully pick the best model according to validation accuracies for numerous times of training.
However, since testing data is generally limited with the number of testing. For example, in some competitions, one evaluation for submitting the testing result per day is quite common. This way, we couldn't cherry-pick the best model which could achieve the best accuracy in both validation & testing datasets. Therefore, our best model which achieved the best results in validation data is usually not the best one in testing data. However, this speaker still believes so when the GT of testing dataset is released in some datasets.
I know that the data distribution in validation dataset and testing dataset is generally designed to be similar. However, this is not guaranteed. For example, in a general purpose object detection dataset, the "difficulty" between the same class of objects in the validation dataset and the testing dataset might be different. To be more specific, let's assume the detection target is person and we all know that small, occluded or truncated person is harder to be detected. However, it is practically difficult to control the distribution according to size, occlusion and truncation level in validation and testing dataset, accordingly.
Therefore, it is possible that the testing accuracy is higher than the validation accuracy when the GT of both datasets is available.
No. There is no strong clue indicating which one would be higher unless a bias of sampling is identified or introduced.
Consider an extreme case in which your model is highly overfitting to the training set. The relationship between $p_{train}$, $p_{val}$, and $p_{test}$ are defined as below.
$$p_{train} = p_{val} != p_{test} $$
In this case, the validation accuracy would be significantly higher than testing accuracy, and vice versa.

How can less amount of data lead to overfitting?

I am studying Machine learning course by Andrew Ng and in it he says that more number of features and less amount of data can lead to overfitting. Can someone elaborate on this.
In general, the less data you have the better your model can memorize the exceptions in your training set which leads to high accuracy on training but low accuracy on test set since your model generalizes what it has learned from the small training set.
For example, consider a Bayesian classifier. We want to predict the math grades of students based on
their grade on science
their last years math grade
their height
As we know the last feature is probably irrelevant. provided we have enough data, our model will learn that this data is irrelevant since there will be a people with different heights getting different grades if we the dataset is big enough.
now consider a very small dataset (e.g. only one class). in this case it's very unlikely that students grades are uncorrelated with their heights (e.g. the tall students will be better or less than average). so our model will be able to make use of that feature. the problem is our model has learned a correlation between grade and height that does not exist outside training dataset.
It could also go the other way, our model might learn that everyone who got a good grade last semester will get a good grade this semester (since that might hold in small datasets) and not use other features at all.
A more general reason, as I mentioned earlier, is that the model can memorize the dataset. There are always outlayer samples, which can't be classified easily. When data size is small the model can find a way to detect these outlayers since there are only few of them. However the it will not be able to predict the real outliers in the test set.

why too many epochs will cause overfitting?

I am reading the a deep learning with python book.
After reading chapter 4, Fighting Overfitting, I have two questions.
Why might increasing the number of epochs cause overfitting?
I know increasing increasing the number of epochs will involve more attempts at gradient descent, will this cause overfitting?
During the process of fighting overfitting, will the accuracy be reduced ?
I'm not sure which book you are reading, so some background information may help before I answer the questions specifically.
Firstly, increasing the number of epochs won't necessarily cause overfitting, but it certainly can do. If the learning rate and model parameters are small, it may take many epochs to cause measurable overfitting. That said, it is common for more training to do so.
To keep the question in perspective, it's important to remember that we most commonly use neural networks to build models we can use for prediction (e.g. predicting whether an image contains a particular object or what the value of a variable will be in the next time step).
We build the model by iteratively adjusting weights and biases so that the network can act as a function to translate between input data and predicted outputs. We turn to such models for a number of reasons, often because we just don't know what the function is/should be or the function is too complex to develop analytically. In order for the network to be able to model such complex functions, it must be capable of being highly-complex itself. Whilst this complexity is powerful, it is dangerous! The model can become so complex that it can effectively remember the training data very precisely but then fail to act as an effective, general function that works for data outside of the training set. I.e. it can overfit.
You can think of it as being a bit like someone (the model) who learns to bake by only baking fruit cake (training data) over and over again – soon they'll be able to bake an excellent fruit cake without using a recipe (training), but they probably won't be able to bake a sponge cake (unseen data) very well.
Back to neural networks! Because the risk of overfitting is high with a neural network there are many tools and tricks available to the deep learning engineer to prevent overfitting, such as the use of dropout. These tools and tricks are collectively known as 'regularisation'.
This is why we use development and training strategies involving test datasets – we pretend that the test data is unseen and monitor it during training. You can see an example of this in the plot below (image credit). After about 50 epochs the test error begins to increase as the model has started to 'memorise the training set', despite the training error remaining at its minimum value (often training error will continue to improve).
So, to answer your questions:
Allowing the model to continue training (i.e. more epochs) increases the risk of the weights and biases being tuned to such an extent that the model performs poorly on unseen (or test/validation) data. The model is now just 'memorising the training set'.
Continued epochs may well increase training accuracy, but this doesn't necessarily mean the model's predictions from new data will be accurate – often it actually gets worse. To prevent this, we use a test data set and monitor the test accuracy during training. This allows us to make a more informed decision on whether the model is becoming more accurate for unseen data.
We can use a technique called early stopping, whereby we stop training the model once test accuracy has stopped improving after a small number of epochs. Early stopping can be thought of as another regularisation technique.
More attempts of decent(large number of epochs) can take you very close to the global minima of the loss function ideally, Now since we don't know anything about the test data, fitting the model so precisely to predict the class labels of the train data may cause the model to lose it generalization capabilities(error over unseen data). In a way, no doubt we want to learn the input-output relationship from the train data, but we must not forget that the end goal is for the model to perform well over the unseen data. So, it is a good idea to stay close but not very close to the global minima.
But still, we can ask what if I reach the global minima, what can be the problem with that, why would it cause the model to perform badly on unseen data?
The answer to this can be that in order to reach the global minima we would be trying to fit the maximum amount of train data, this will result in a very complex model(since it is less probable to have a simpler spatial distribution of the selected number of train data that is fortunately available with us). But what we can assume is that a large amount of unseen data(say for facial recognition) will have a simpler spatial distribution and will need a simpler Model for better classification(I mean the entire world of unseen data, will definitely have a pattern that we can't observe just because we have an access small fraction of it in the form of training data)
If you incrementally observe points from a distribution(say 50,100,500, 1000 ...), we will definitely find the structure of the data complex until we have observed a sufficiently large number of points (max: the entire distribution), but once we have observed enough points we can expect to observe the simpler pattern present in the data that can be easily classified.
In short, a small fraction of train data should have a complex structure as compared to the entire dataset. And overfitting to the train data may cause our model to perform worse on the test data.
One analogous example to emphasize the above phenomenon from day to day life is as follows:-
Say we meet N number of people till date in our lifetime, while meeting them we naturally learn from them(we become what we are surrounded with). Now if we are heavily influenced by each individual and try to tune to the behaviour of all the people very closely, we develop a personality that closely resembles the people we have met but on the other hand we start judging every individual who is unlike me -> unlike the people we have already met. Becoming judgemental takes a toll on our capability to tune in with new groups since we trained very hard to minimize the differences with the people we have already met(the training data). This according to me is an excellent example of overfitting and loss in genralazition capabilities.

What to do with corrected wrongly classified random forest predictions?

I have trained a multi-class Random Forest model and So now if the model predicts something wrong we manually correct it, SO the thing is What can we do to with that corrected label and make the predictions better.
Thoughts:
Can't retrain the model again and again.(Trained on 0.7 million rows so it might treat the new data as noise)
Can not train small models of RF as they will also create a mess
Random FOrest works better then NN, So not thinking to go that way.
What do you mean by "manually correct" - i.e. there may be various different points in the decision trees that were executed leading to a wrong prediction, not to mention the numerous decision trees used to get your final prediction.
I think there is some misunderstanding in your first point. Unless the distribution is non-stationary (in which case your trained model is of diminished value to begin with), the new data is treated is treated as "noise" in the sense that including it in the final model is unlikely to change future predictions all that much. As far as I can tell this is how it should be, without specifying other factors like a changing distribution, etc. That is, if future data you want to predict will look a lot more like the data you failed to predict correctly, then you would indeed want to upweight the importance of classifying this sample in your new model.
Anyway, it sounds like you're describing an online learning problem(you want a model that updates itself in response to streaming data). You can find some general ideas just searching for online random forests, for example:
[Online random forests] (http://www.ymer.org/amir/research/online-random-forests/) and [online multiclass lpboost] (https://github.com/amirsaffari/online-multiclass-lpboost) describe a general framework akin to what you may have in mind: the input to the model is a stream of new observations; the forest learns on this new data by dropping those trees which perform poorly and eventually growing new trees that include the new data.
The general idea described here is used in a number of boosting algorithms (for example, AdaBoost aggregates an ensemble of "weak learners", for example individual decision trees grown on different + incomplete subsets of data, into a better whole by training subsequent weak learners specifically on formerly misclassified instances. The idea here is that those instances where your current model is wrong are the most informative for future performance improvements.
I don't know the specific details of how the linked implementations accomplish this, though the idea is inline with what you might expect.
You might try these, or other such algorithms you find from searching around.
That all said, I suspect something like the online random forest algorithm is relatively good when old data becomes obsolete over time. If it doesn't -- i.e. if your future data and early data are pulled from the same distribution -- it's not obvious to me that successively retraining your model (by which I mean the random forest itself and any cross validation / model selection procedures you might have to transform forest predictions into a final assignment) data on the whole batch of examples you have is a bad idea, modulo data in a very high dimensional feature space, or really quickly incoming data.

How do neural networks learn functions instead of memorize them?

For a class project, I designed a neural network to approximate sin(x), but ended up with a NN that just memorized my function over the data points I gave it. My NN took in x-values with a batch size of 200. Each x-value was multiplied by 200 different weights, mapping to 200 different neurons in my first layer. My first hidden layer contained 200 neurons, each one a linear combination of the x-values in the batch. My second hidden layer also contained 200 neurons, and my loss function was computed between the 200 neurons in my second layer and the 200 values of sin(x) that the input mapped to.
The problem is, my NN perfectly "approximated" sin(x) with 0 loss, but I know it wouldn't generalize to other data points.
What did I do wrong in designing this neural network, and how can I avoid memorization and instead design my NN's to "learn" about the patterns in my data?
It is same with any machine learning algorithm. You have a dataset based on which you try to learn "the" function f(x), which actually generated the data. In real life datasets, it is impossible to get the original function from the data, and therefore we approximate it using something g(x).
The main goal of any machine learning algorithm is to predict unseen data as best as possible using the function g(x).
Given a dataset D you can always train a model, which will perfectly classify all the datapoints (you can use a hashmap to get 0 error on the train set), but which is overfitting or memorization.
To avoid such things, you yourself have to make sure that the model does not memorise and learns the function. There are a few things which can be done. I am trying to write them down in an informal way (with links).
Train, Validation, Test
If you have large enough dataset, use Train, Validation, Test splits. Split the dataset in three parts. Typically 60%, 20% and 20% for Training, Validation and Test, respectively. (These numbers can vary based on need, also in case of imbalanced data, check how to get stratified partitions which preserve the class ratios in every split). Next, forget about the Test partition, keep it somewhere safe, don't touch it. Your model, will be trained using the Training partition. Once you have trained the model, evaluate the performance of the model using the Validation set. Then select another set of hyper-parameter configuration for your model (eg. number of hidden layer, learaning algorithm, other parameters etc.) and then train the model again, and evaluate based on Validation set. Keep on doing this for several such models. Then select the model, which got you the best validation score.
The role of validation set here is to check what the model has learned. If the model has overfit, then the validation scores will be very bad, and therefore in the above process you will discard those overfit models. But keep in mind, although you did not use the Validation set to train the model, directly, but the Validation set was used indirectly to select the model.
Once you have selected a final model based on Validation set. Now take out your Test set, as if you just got new dataset from real life, which no one has ever seen. The prediction of the model on this Test set will be an indication how well your model has "learned" as it is now trying to predict datapoints which it has never seen (directly or indirectly).
It is key to not go back and tune your model based on the Test score. This is because once you do this, the Test set will start contributing to your mode.
Crossvalidation and bootstrap sampling
On the other hand, if your dataset is small. You can use bootstrap sampling, or k-fold cross-validation. These ideas are similar. For example, for k-fold cross-validation, if k=5, then you split the dataset in 5 parts (also be carefull about stratified sampling). Let's name the parts a,b,c,d,e. Use the partitions [a,b,c,d] to train and get the prediction scores on [e] only. Next, use the partitions [a,b,c,e] and use the prediction scores on [d] only, and continue 5 times, where each time, you keep one partition alone and train the model with the other 4. After this, take an average of these scores. This is indicative of that your model might perform if it sees new data. It is also a good practice to do this multiple times and perform an average. For example, for smaller datasets, perform a 10 time 10-folds cross-validation, which will give a pretty stable score (depending on the dataset) which will be indicative of the prediction performance.
Bootstrap sampling is similar, but you need to sample the same number of datapoints (depends) with replacement from the dataset and use this sample to train. This set will have some datapoints repeated (as it was a sample with replacement). Then use the missing datapoins from the training dataset to evaluate the model. Perform this multiple times and average the performance.
Others
Other ways are to incorporate regularisation techniques in the classifier cost function itself. For example in Support Vector Machines, the cost function enforces conditions such that the decision boundary maintains a "margin" or a gap between two class regions. In neural networks one can also do similar things (although it is not same as in SVM).
In neural network you can use early stopping to stop the training. What this does, is train on the Train dataset, but at each epoch, it evaluates the performance on the Validation dataset. If the model starts to overfit from a specific epoch, then the error for Training dataset will keep on decreasing, but the error of the Validation dataset will start increasing, indicating that your model is overfitting. Based on this one can stop training.
A large dataset from real world tends not to overfit too much (citation needed). Also, if you have too many parameters in your model (to many hidden units and layers), and if the model is unnecessarily complex, it will tend to overfit. A model with lesser pameter will never overfit (though can underfit, if parameters are too low).
In the case of you sin function task, the neural net has to overfit, as it is ... the sin function. These tests can really help debug and experiment with your code.
Another important note, if you try to do a Train, Validation, Test, or k-fold crossvalidation on the data generated by the sin function dataset, then splitting it in the "usual" way will not work as in this case we are dealing with a time-series, and for those cases, one can use techniques mentioned here
First of all, I think it's a great project to approximate sin(x). It would be great if you could share the snippet or some additional details so that we could pin point the exact problem.
However, I think that the problem is that you are overfitting the data hence you are not able to generalize well to other data points.
Few tricks that might work,
Get more training points
Go for regularization
Add a test set so that you know whether you are overfitting or not.
Keep in mind that 0 loss or 100% accuracy is mostly not good on training set.

Resources