Is there any technique to know in advance the amount of training examples you need to make deep learning get good performance? - machine-learning

Deep learning has been a revolution recently and its success is related with the huge amount of data that we can currently manage and the generalization of the GPUs.
So here is the problem I'm facing. I know that deep neural nets have the best performance, there is no doubt about it. However, they have a good performance when the number of training examples is huge. If the number of training examples is low it is better to use a SVM or decision trees.
But what is huge? what is low? In this paper of face recognition (FaceNet by Google) they show the performance vs the flops (which can be related with the number of training examples)
They used between 100M and 200M training examples, which is huge.
My question is:
Is there any method to predict in advance the number of training examples I need to have a good performance in deep learning??? The reason I ask this is because it is a waste of time to manually classify a dataset if the performance is not going to be good.

My question is: Is there any method to predict in advance the number of training examples I need to have a good performance in deep learning??? The reason I ask this is because it is a waste of time to manually classify a dataset if the performance is not going to be good.
The short answer is no. You do not have this kind of knowledge, furthermore you will never have. These kind of problems are impossible to solve, ever.
What you can have are just some general heuristics/empirical knowledge, which will say if it is probable that DL will not work well (as it is possible to predict fail of the method, while nearly impossible to predict the success), nothing more. In current research, DL rarely works well for datasets smaller than hundreads thousands/milions of samples (I do not count MNIST because everything works well on MNIST). Furthermore, DL is heavily studied actually in just two types of problems - NLP and image processing, thus you cannot really extraplate it to any other kind of problems (no free lunch theorem).
Update
Just to make it a bit more clear. What you are asking about is to predit whether given estimator (or set of estimators) will yield a good results given a particular training set. In fact you even restrict just to the size.
The simpliest proof (based on your simplification) is as follows: for any N (sample size) I can construct N-mode (or N^2 to make it even more obvious) distribution which no estimator can reasonably estimate (including deep neural network) and I can construct trivial data with just one label (thus perfect model requires just one sample). End of proof (there are two different answers for the same N).
Now let us assume that we do have access to the training samples (without labels for now) and not just sample size. Now we are given X (training samples) of size N. Again I can construct N-mode labeling yielding impossible to estimate distribution (by anything) and trivial labeling (just a single label!). Again - two different answers for the exact same input.
Ok, so maybe given training samples and labels we can predict what will behave well? Now we cannot manipulate samples nor labels to show that there are no such function. So we have to get back to statistics and what we are trying to answer. We are asking about expected value of loss function over whole probability distribution which generated our training samples. So now again, the whole "clue" is to see, that I can manipulate the underlying distributions (construct many different ones, many of which impossible to model well by deep neural network) and still expect that my training samples come from them. This is what statisticians call the problem of having non-representible sample from a pdf. In particular, in ML, we often relate to this problem with curse of dimensionality. In simple words - in order to estimate the probability well we need enormous number of samples. Silverman shown that even if you know that your data is just a normal distribution and you ask "what is value in 0?" You need exponentialy many samples (as compared to space dimensionality). In practise our distributions are multi-modal, complex and unknown thus this amount is even higher. We are quite safe to say that given number of samples we could ever gather we cannot ever estimate reasonably well distributions with more than 10 dimensions. Consequently - whatever we do to minimize the expected error we are just using heuristics, which connect the empirical error (fitting to the data) with some kind of regularization (removing overfitting, usually by putting some prior assumptions on distributions families). To sum up we cannot construct a method able to distinguish if our model will behave good, because this would require deciding which "complexity" distribution generated our samples. There will be some simple cases when we can do it - and probably they will say something like "oh! this data is so simple even knn will work well!". You cannot have generic tool, for DNN or any other (complex) model though (to be strict - we can have such predictor for very simple models, because they simply are so limited that we can easily check if your data follows this extreme simplicity or not).
Consequently, this boils down nearly to the same question - to actually building a model... thus you will need to try and validate your approach (thus - train DNN to answer if DNN works well). You can use cross validation, bootstraping or anything else here, but all essentialy do the same - build multiple models of your desired type and validate it.
To sum up
I do not claim we will not have a good heuristics, heuristic drive many parts of ML quite well. I only answer if there is a method able to answer your question - and there is no such thing and cannot exist. There can be many rules of thumb, which for some problems (classes of problems) will work well. And we already do have such:
for NLP/2d images you should have ~100,000 samples at least to work with DNN
having lots of unlabeled instances can partially substitute the above number (thus you can have like 30,000 labeled ones + 70,000 unlabeled) with pretty reasonable results
Furthermore this does not mean that given this size of data DNN will be better than kernelized SVM or even linear model. This is exactly what I was refering to earlier - you can easily construct counterexamples of distributions where SVM will work the same or even better despite number of samples. The same applies for any other technique.
Yet still, even if you are just interested if DNN will work well (and not better than others) these are just empirical, trivial heuristics, which are based on at most 10 (!) types of problems. This could be very harmfull to treat these as rules or methods. This are just rough, first intuitions gained through extremely unstructured, random research that happened in last decade.
Ok, so I am lost now... when should I use DL? And the answer is exteremly simple:
Use deep learning only if:
You already tested "shallow" techniques and they do not work well
You have large amounts of data
You have huge computational resources
You have experience with neural networks (this are very tricky and ungreatful models, really)
You have great amount of time to spare, even if you will just get a few % better results as an effect.

Related

How to scale up a model in a training dataset to cover all aspects of training data

I was asked in an interview to solve a use case with the help of machine learning. I have to use a Machine Learning algorithm to identify fraud from transactions. My training dataset has lets say 100,200 transactions, out of which 100,000 are legal transactions and 200 are fraud.
I cannot use the dataset as a whole to make the model because it would be a biased dataset and the model would be a very bad one.
Lets say for example I take a sample of 200 good transactions which represent the dataset well(good transactions), and the 200 fraud ones and make the model using this as the training data.
The question I was asked was that how would I scale up the 200 good transactions to the whole data set of 100,000 good records so that my result can be mapped to all types of transactions. I have never solved this kind of a scenario so I did not know how to approach it.
Any kind of guidance as to how I can go about it would be helpful.
This is a general question thrown in an interview. Information about the problem is succinct and vague (we don't know for example the number of features!). First thing you need to ask yourself is What do the interviewer wants me to respond? So, based on this context the answer has to be formulated in a similar general way. This means that we don't have to find 'the solution' but instead give arguments that show that we actually know how to approach the problem instead of solving it.
The problem we have presented with is that the minority class (fraud) is only a ~0.2% of the total. This is obviously a huge imbalance. A predictor that only predicted all cases as 'non fraud' would get a classification accuracy of 99.8%! Therefore, definitely something has to be done.
We will define our main task as a binary classification problem where we want to predict whether a transaction is labelled as positive (fraud) or negative (not fraud).
The first step would be considering what techniques we do have available to reduce imbalance. This can be done either by reducing the majority class (undersampling) or increasing the number of minority samples (oversampling). Both have drawbacks though. The first implies a severe loss of potential useful information from the dataset, while the second can present problems of overfitting. Some techniques to improve overfitting are SMOTE and ADASYN, which use strategies to improve variety in the generation of new synthetic samples.
Of course, cross-validation in this case becomes paramount. Additionally, in case we are finally doing oversampling, this has to be 'coordinated' with the cross-validation approach to ensure we are making the most of these two ideas. Check http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation for more details.
Apart from these sampling ideas, when selecting our learner, many ML methods can be trained/optimised for specific metrics. In our case, we do not want to optimise accuracy definitely. Instead, we want to train the model to optimise either ROC-AUC or specifically looking for a high recall even at a loss of precission, as we want to predict all the positive 'frauds' or at least raise an alarm even though some will prove false alarms. Models can adapt internal parameters (thresholds) to find the optimal balance between these two metrics. Have a look at this nice blog for more info about metrics: https://www.analyticsvidhya.com/blog/2016/02/7-important-model-evaluation-error-metrics/
Finally, is only a matter of evaluate the model empirically to check what options and parameters are the most suitable given the dataset. Following these ideas does not guarantee 100% that we are going to be able to tackle the problem at hand. But it ensures we are in a much better position to try to learn from data and being able to get rid of those evil fraudsters out there, while perhaps getting a nice job along the way ;)
In this problem you want to classify transactions as good or fraud. However your data is really imbalance. In that you will probably be interested by Anomaly detection. I will let you read all the article for more details but I will quote a few parts in my answer.
I think this will convince you that this is what you are looking for to solve this problem:
Is it not just Classification?
The answer is yes if the following three conditions are met.
You have labeled training data Anomalous and normal classes are
balanced ( say at least 1:5) Data is not autocorrelated. ( That one
data point does not depend on earlier data points. This often breaks
in time series data). If all of above is true, we do not need an
anomaly detection techniques and we can use an algorithm like Random
Forests or Support Vector Machines (SVM).
However, often it is very hard to find training data, and even when
you can find them, most anomalies are 1:1000 to 1:10^6 events where
classes are not balanced.
Now to answer your question:
Generally, the class imbalance is solved using an ensemble built by
resampling data many times. The idea is to first create new datasets
by taking all anomalous data points and adding a subset of normal data
points (e.g. as 4 times as anomalous data points). Then a classifier
is built for each data set using SVM or Random Forest, and those
classifiers are combined using ensemble learning. This approach has
worked well and produced very good results.
If the data points are autocorrelated with each other, then simple
classifiers would not work well. We handle those use cases using time
series classification techniques or Recurrent Neural networks.
I would also suggest another approach of the problem. In this article the author said:
If you do not have training data, still it is possible to do anomaly
detection using unsupervised learning and semi-supervised learning.
However, after building the model, you will have no idea how well it
is doing as you have nothing to test it against. Hence, the results of
those methods need to be tested in the field before placing them in
the critical path.
However you do have a few fraud data to test if your unsupervised algorithm is doing well or not, and if it is doing a good enough job, it can be a first solution that will help gathering more data to train a supervised classifier later.
Note that I am not an expert and this is just what I've come up with after mixing my knowledge and some articles I read recently on the subject.
For more question about machine learning I suggest you to use this stackexchange community
I hope it will help you :)

How to prove the reliability of a predictive model to executives?

I trained data from 500 devices to predict their performance. Then I applied my trained model to a test data set for another 500 devices and show pretty good prediction results. Now my executives want me to prove this model will work well on one million devices not only on 500. Obviously we don't have data for one million devices. And if the model is not reliable, they want me to discover the required amount of train data in order to make a reliable prediction on one million devices. How should I deal with these executives who don't have a background in statistical analysis and modeling? Any suggestions? Thanks
I have suggested to #cep to write up his comment as an answer - including providing the variance and bias calculations. In any case it could be added
"Do not be quick to assume Execs are essentially incapable in terms of
technical or mathematical concepts"
While there may be Dilbert managers out there .. somewhere I have seen few of them myself. More often managers get to their positions through hard work. They are likely to be rusty - but the abilities are likely still there.
In this case whether or not they have a "background in statistical analysis and modeling" they are applying common sense.
The first thing you might do is to provide the proper context and terminology. #cel has mentioned some of it: providing concrete values for :
assumptions
what assumptions are you making about the data.
What basis is there to consider extrapolation of the limited data
why should said extrapoated results be trusted to apply to the 99.5% of untested data
data distribution
basic descriptive statistics
your take on the apriori distribution of the data. Justify why you chose it
modeling
which models/approaches were considered and why
which model you actually chose and why
how did you arrive at the hyperparameters
how you trained the model
results
statistical measures of fit and error rate

How Did Neural Networks Overcome the Bias/Variance Dilemma?

Deep learning has been seen as a rebranding of Neural Networks.
Were the issues presented in the paper "Neural Networks and the Bias/Variance Dilemma" by Stuart Geman ever resolved in the architectures in use today?
We learned a lot about NN, in particular:
we now learn better representations due to progress in unsupervised/autoregressive learning, such as restricted boltzman machines, autoencoders, denoising autoencoders, variational autoencoders, which help as stabilize the process, learn from reasonable representations
we have better priors - not neceserly in the strict probabilistic sense, but we know, that for example in image processing a good architecture is the convolutional one, thus we have a smaller (in terms of parameters), but better suited for the problem - models. Consequently we are less prone to overfitting.
we have better optimization techniques and activation functions - which help us with underfitting (we can learn larger networks), in particular - we can learn deeper networks. Why is deep often better then wide? Because again - this is another prior, the assumption that representation should be hierarchical, and it seems to be valid prior for many modern problems (even that not all of them).
dropout, and other techniques brought as better regularization methods (than previously known and used simple weights priors) - which again limits problem with overfitting (variance).
There are many more things that changed, but in general - we were simply able to find better architectures, better assumptions, thus we now search in more narrow class of hypotheses. Consequently - we overfit less (variance), and underfit less (bias) - yet there is still lots to be done!
Next thing is, as #david pointed out, amount of data. We have huge datasets now, we often have access to more data that we can process in a reasonable time, and obviously more data means less variance - even highly overfitting models start to behave well.
Last, but not least - hardware. This is something that every single deep learning expert will tell you - our computers got stronger. We still use the same algorithms, the same architectures (with many little tweaks, but the core is the same), but our hardware is exponentially faster, and this changes a lot.
#lejlot gave a good overview. I want to point to two specific parts of the whole process.
First, neural networks are universal approximators. That means, their bias in principle can be made arbitrarily small. The problem that was rather thought to be severe was overfitting -- too large variance.
Now, a common and successful way in Machine Learning to deal with too large variance is by "averaging it away" over many different predictions -- which should be as uncorrelated as possible. This worked in Random Forests, for instance, and in this way I tend to understand current Neural Networks as well (particular the maxout+dropout stuff). Of course, this is a narrow view -- there is further this whole representational learning stuff, the not explaining-away property, etc. -- but it's one I find suitable for your question regarding the bias/variance tradeoff.
Second point: there is no better way to prevent overfitting than having very much data. And currently we're in the situation to gather a lot of data.

How to build a good training data set for machine learning and predictions?

I have a school project to make a program that uses the Weka tools to make predictions on football (soccer) games.
Since the algorithms are already there (the J48 algorithm), I need just the data. I found a website that offers football game data for free and I tried it in Weka but the predictions were pretty bad so I assume my data is not structured properly.
I need to extract the data from my source and format it another way in order to make new attributes and classes for my model. Does anyone know of a course/tutorial/guide on how to properly create your attributes and classes for machine learning predictions? Is there a standard that describes the best way of choosing the attributes of a data set for training a machine learning algorithm? What's the approach on this?
here's an example of the data that I have at the moment: http://www.football-data.co.uk/mmz4281/1516/E0.csv
and here is what the columns mean: http://www.football-data.co.uk/notes.txt
The problem may be that the data set you have is too small. Suppose you have ten variables and each variable has a range of 10 values. There are 10^10 possible configurations of these variables. It is unlikely your data set will be this large let alone cover all of the possible configurations. The trick is to narrow down the variables to the most relevant to avoid this large potential search space.
A second problem is that certain combinations of variables may be more significant than others.
The J48 algorithm attempts to to find the most relevant variable using entropy at each level in the tree. each path through the tree can be thought of as an AND condition: V1==a & V2==b ...
This covers the significance due to joint interactions. But what if the outcome is a result of A&B&C OR W&X&Y? The J48 algorithm will find only one and it will be the one where the the first variable selected will have the most overall significance when considered alone.
So, to answer your question, you need to not only find a training set which will cover the most common variable configurations in the "general" population but find an algorithm which will faithfully represent these training cases. Faithful meaning it will generally apply to unseen cases.
It's not an easy task. Many people and much money are involved in sports betting. If it were as easy as selecting the proper training set, you can be sure it would have been found by now.
EDIT:
It was asked in the comments how to you find the proper algorithm. The answer is the same way you find a needle in a haystack. There is no set rule. You may be lucky and stumble across it but in a large search space you won't ever know if you have. This is the same problem as finding the optimum point in a very convoluted search space.
A short-term answer is to
Think about what the algorithm can really accomplish. The J48 (and similar) algorithms are best suited for classification where the influence of the variables on the result are well known and follow a hierarchy. Flower classification is one example where it will likely excel.
Check the model against the training set. If it does poorly with the training set then it will likely have poor performance with unseen data. In general, you should expect the model to performance against the training to exceed the performance against unseen data.
The algorithm needs to be tested with data it has never seen. Testing against the training set, while a quick elimination test, will likely lead to overconfidence.
Reserve some of your data for testing. Weka provides a way to do this. The best case scenario would be to build the model on all cases except one (Leave On Out Approach) then see how the model performs on the average with these.
But this assumes the data at hand are not in some way biased.
A second pitfall is to let the test results bias the way you build the model.For example, trying different models parameters until you get an acceptable test response. With J48 it's not easy to allow this bias to creep in but if it did then you have just used your test set as an auxiliary training set.
Continue collecting more data; testing as long as possible. Even after all of the above, you still won't know how useful the algorithm is unless you can observe its performance against future cases. When what appears to be a good model starts behaving poorly then it's time to go back to the drawing board.
Surprisingly, there are a large number of fields (mostly in the soft sciences) which fail to see the need to verify the model with future data. But this is a matter better discussed elsewhere.
This may not be the answer you are looking for but it is the way things are.
In summary,
The training data set should cover the 'significant' variable configurations
You should verify the model against unseen data
Identifying (1) and doing (2) are the tricky bits. There is no cut-and-dried recipe to follow.

Should I remove test samples that are identical to some training sample?

I've been having a bit of a debate with my adviser about this issue, and I'd like to get your opinion on it.
I have a fairly large dataset that I've used to build a classifier. I have a separate, smaller testing dataset that was obtained independently from the training set (in fact, you could say that each sample in either set was obtained independently). Each sample has a class label, along with metadata such as collection date and location.
There is no sample in the testing set that has the same metadata as any sample in the training set (as each sample was collected at a different location or time). However, it is possible that the feature vector itself could be identical to some sample in the training set. For example, there could be two virus strains that were sampled in Africa and Canada, respectively, but which both have the same protein sequence (the feature vector).
My adviser thinks that I should remove such samples from the testing set. His reasoning is that these are like "freebies" when it comes to testing, and may artificially boost the reported accuracy.
However, I disagree and think they should be included, because it may actually happen in the real world that the classifier sees a sample that it has already seen before. To remove these samples would bring us even further from reality.
What do you think?
It would be nice to know if you're talking about a couple of repetitions in million samples or 10 repetitions in 15 samples.
In general I don't find what you're doing reasonable. I think your advisor has a very good point. Your evaluation needs to be as close as possible to using your classifier outside your control -- You can't just assume your going to be evaluated on a datapoint you've already seen. Even if each data point is independent, you're going to be evaluated on never-before-seen data.
My experience is in computer vision, and it would be very highly questionable to train and test with the same picture of a one subject. In fact I wouldn't be comfortable training and testing with frames of the same video (not even the same frame).
EDIT:
There are two questions:
The distribution permits that these repetitions naturally happen. I
believe you, you know your experiment, you know your data, you're
the expert.
The issue that you're getting a boost by doing this and that this
boost is possibly unfair. One possible way to address your advisor's
concerns is to evaluate how significant a leverage you're getting
from the repeated data points. Generate 20 test cases 10 in which
you train with 1000 and test on 33 making sure there are not
repetitions in the 33, and another 10 cases in which you train with
1000 and test on 33 with repetitions allowed as they occur
naturally. Report the mean and standard deviation of both
experiments.
It depends... Your adviser suggested the common practice. You usually test a classifier on samples which have not been used for training. If the samples of the test set matching the training set are very few, your results are not going to have statistical difference because of the reappearance of the same vectors. If you want to be formal and still keep your logic, you have to prove that the reappearance of the same vectors has no statistical significance on the testing process. If you proved this theoretically, I would accept your logic. See this ebook on statistics in general, and this chapter as a start point on statistical significance and null hypothesis testing.
Hope I helped!
In as much as the training and testing datasets are representative of the underlying data distribution, I think it's perfectly valid to leave in repetitions. The test data should be representative of the kind of data you would expect your method to perform on. If you genuinely can get exact replicates, that's fine. However, I would question what your domain is where it's possible to generate exactly the same sample multiple times. Are your data synthetic? Are you using a tiny feature set with few possible values for each of your features, such that different points in input space map to the same point in feature space?
The fact that you're able to encounter the same instance multiple times is suspicious to me. Also, if you have 1,033 instances, you should be using far more than 33 of them for testing. The variance in your test accuracy will be huge. See the answer here.
Having several duplicate or very similar samples seems somewhat analogous to the distribution of the population you're attempting to classify being non-uniform. That is, certain feature combinations are more common than others, and the high occurrence of them in your data is giving them more weight. Either that, or your samples are not representative.
Note: Of course, even if a population is uniformly distributed there is always some likelihood of drawing similar samples (perhaps even identical depending on the distribution).
You could probably make some argument that identical observations are a special case, but are they really? If your samples are representative it seems perfectly reasonable that some feature combinations would be more common than others (perhaps even identical depending on your problem domain).

Resources