Testing the maximum theoretical accuracy for a data set? - machine-learning

I am applying several machine learning methods to a real-world medical data set but I can't achieve high accuracy (its around 80% now) for the test data set. The problem is to predict if the disease is present or not.
Is there any way to prove how much maximum accuracy can be achieved? Or something similar that can tell the expected accuracy of a particular machine learning model for the data set?
If not, how can I prove the accuracy I am getting is the best (or near best) accuracy possible from the data set?

It depends on how deterministic your data is. I will illustrate with two variables, y as a function of x.
If y = x, then the theoretical best accuracy is 100%. It should be possible to get a perfect result.
Now suppose that y = x + rnorm(n, 0, sigma) where n is the number of points and you get to choose sigma. You can predict x, but you cannot predict the random part. The bigger sigma is, the worse your predictions. You can make the best possible accuracy arbitrarily low by choosing a large enough sigma.
With real data, you don't usually know how well your input variables determine the output, so you cannot state a meaningful theoretical limit just accuracy is between 0 and 1.

What is the accuracy rate for the detections done by humans?
If it is almost the accuracy you get via the machine, you are doing great! Even if the machine is doing a bit worse, it can even be considered good.
In the industry, such a question is mostly Product management questions rather than a scientific one.

Related

How does regularization prevent overfitting?

I don't understand how adding the product of lambda and the sum of squared thetas to the cost function would decrease the amount of over fitting in a data set. Can someone please explain?
Imagine two extreme cases:
You do not need to learn anything ==> You need 0 parameters to learn (an extreme case of underfitting).
You want to memorize everything you see (in the training set) ==> You need a huge amount of parameters to remember everything (an extreme case of overfitting)
The real training should happen between these two cases, to lead to a good generalizations. A good generalization helps to get more realistic predictions on the unseen test data.
When you try to minimize a cost function, you are penalizing the machine for each of the wrong predictions on the training set. To overcome this penalization, and most of the time easier than getting to a real generalization, machine would prefer to memorize everything in training, so that it gets to a lower loss, and get less penalized. This easily happens when you provide a complex network( with big amount of training parameters, AKA when W is big)
To prevent this trick from the machine, we force the machine to reduce the cost, but also we put the condition to use not a very large set of parameters. That is one way to do the regularization.

Python/SKlearn: Using KFold Results in big ROC_AUC Variations

Based on data that our business department supplied to us, I used the sklearn decision tree algorithm to determine the ROC_AUC for a binary classification problem.
The data consists of 450 rows and there are 30 features in the data.
I used 10 times StratifiedKFold repetition/split of training and test data. As a result, I got the following ROC_AUC values:
0.624
0.594
0.522
0.623
0.585
0.656
0.629
0.719
0.589
0.589
0.592
As I am new in machine learning, I am unsure whether such a variation in the ROC_AUC values can be expected (with minimum values of 0.522 and maximum values of 0.719).
My questions are:
Is such a big variation to be expected?
Could it be reduced with more data (=rows)?
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?
Well, you do k-fold splits to actually evaluate how well your model generalizes.
Therefore, from your current results I would assume the following:
This is a difficult problem, the AUCs are usually low.
0.71 is an outlier, you were just lucky there (probably).
Important questions that will help us help you:
What is the proportion of the binary classes? Are they balanced?
What are the features? Are they all continuous? If categorical, are they ordinal or nominal?
Why Decision Tree? Have you tried other methods? Logistic Regression for instance is a good start before you move on to more advanced ML methods.
You should run more iterations, instead of k fold use the ShuffleSplit function and run at least 100 iterations, compute the Average AUC with 95% Confidence Intervals. That will give you a better idea of how well the models perform.
Hope this helps!
Is such a big variation to be expected?
This is a textbook case of high variance.
Depending on the difficulty of your problem, 405 training samples may not be enough for it to generalize properly, and the random forest may be too powerful.
Try adding some regularization, by limiting the number of splits that the trees are allowed to make. This should reduce the variance in your model, though you might expect a potentially lower average performance.
Could it be reduced with more data (=rows)?
Yes, adding data is the other popular way of lowering the variance of your model. If you're familiar with deep learning, you'll know that deep models usually need LOTS of samples to learn properly. That's because they are very powerful models with an intrinsically high variance, and therefore a lot of data is needed for them to generalize.
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?
Variance will decrease with regularization and adding data, it has no relation to the actual performance "number" that you get.
Cheers

Setting up multiclass decision forest/neural network on smaller dataset

So I have a set of data, 1900 rows and 22 columns. 21 column is just numbers but that one crucial that I want to train the data on has 3 stages: a,b, and c.
I have tried both decision trees/jungles, and neural networks and no matter how I set them up I can't get more than 55% precision.
Usually it's around 50% accuracy and the best I was ever able to get was 55% overall accuracy and around 70% average.
Should I even use NN on a such small dataset? As I said I tried with other ML algorithms but they don't yield anything better.
I think that there is no clear answer to your question. Low accuracy score may come from a few reasons. I will state some of them in the following points :
When you use decision trees / neural networks - low accuracy may be a result of a wrong setup of metaparameters (like maximum height of a tree or number of trees in DT or wrong topology or data preparation in NN case). What I advise you is to use a grid or random search for both NN and DT to look for the best metaparameters for your algorithm (in case of "static" (not sequential data) packages like e.g. h20 in R or Scikit-learn in Python may do a great job) and in neural network case - normalize your data properly (e.g. subtract mean and divide by standard deviation every x column of your data).
Your dataset might be inconsistent. If e.g. your data has not a property that there exists a functional dependency between x and y (what means that y = f(x) for some f) then what is learnt during a training session is a probability that given x - your example belong to some specified class. This inconsistency might seriously harm your accuracy. What I advice you in this case is to try specify if that phenomenon occurs and then e.g. try to segmentate your data to solve the problem.
Your data set might be simply too small. Try to get more data in this case.

what does Maximum Likelihood Estimation exactly mean?

When we are training our model we usually use MLE to estimate our model. I know it means that the most probable data for such a learned model is our training set. But I'm wondering if its probability match 1 exactly or not?
You almost have it right. The Likelihood of a model (theta) for the observed data (X) is the probability of observing X, given theta:
L(theta|X) = P(X|theta)
For Maximum Likelihood Estimation (MLE), you choose the value of theta that provides the greatest value of P(X|theta). This does not necessarily mean that the observed value of X is the most probable for the MLE estimate of theta. It just means that there is no other value of theta that would provide a higher probability for the observed value of X.
In other words, if T1 is the MLE estimate of theta, and if T2 is any other possible value of theta, then P(X|T1) > P(X|T2). However, there still could be another possible value of the data (Y) different than the observed data (X) such that P(Y|T1) > P(X|T1).
The probability of X for the MLE estimate of theta is not necessarily 1 (and probably never is except for trivial cases). This is expected since X can take multiple values that have non-zero probabilities.
To build on what bogatron said with an example, the parameters learned from MLE are the ones that explain the data you see (and nothing else) the best. And no, the probability is not 1 (except in trivial cases).
As an example (that has been used billions of times) of what MLE does is:
If you have a simple coin-toss problem, and you observe 5 results of coin tosses (H, H, H, T, H) and you do MLE, you will end up giving p(coin_toss == H) a high probability (0.80) because you see Heads way too many times. There are good and bad things about MLE obviously...
Pros: It is an optimization problem, so it is generally quite fast to solve (even if there isn't an analytical solution).
Cons: It can overfit when there isn't a lot of data (like our coin-toss example).
The example I got in my stat classes was as follows:
A suspect is on the run ! Nothing is known about them, except that they're approximatively 1m80 tall. Should the police look for a man or a woman ?
The idea here is that you have a parameter for your model (M/F), and probabilities given that parameter. There are tall men, tall women, short men and short women. However, in the absence of any other information, the probability of a man being 1m80 is larger than the probability of a woman being 1m80. Likelihood (as bogatron very well explained) is a formalisation of that, and maximum likelihood is the estimation method based on favouring parameters which are more likely to result in the actual observations.
But that's just a toy example, with a single binary variable... Let's expand it a bit: I threw two identical die, and the sum of their value is 7. How many side did my die have ? Well, we all know that the probability of two D6 summing to 7 is quite high. But it might as well be D4, D20, D100, ... However, P(7 | 2D6) > P(7 | 2D20), and P(7 | 2D6) > P(7 | 2D100) ..., so you might estimate that my die are 6-faced. That doesn't mean it's true, but its a reasonable estimation, in the absence of any additional information.
That's better, but we're not in machine-learning territory yet... Let's get there: if you want to fit your umptillion-layer neural network on some empirical data, you can consider all possible parameterisations, and how likely each of them is to return the empirical data. That's exploring an umptillion-dimensional space, each dimensions having infinitely many possibilities, but you can map every single one of these points to a likelihood. It is then reasonable to fit your network using these parameters: given that the empirical data did occur, it is reasonable to assume that they should be likely under your model.
That doesn't mean that your parameters are likely ! Just that under these parameters, the observed value is likely. Statistical estimation is usually not a closed problem with a single solution (like solving an equation might be, and where you would have a probability of 1), but we need to find a best solution, according to some metric. Likelihood is such a metric, and is used widely because it has some interesting properties:
It makes intuitive sense
It's reasonably simple to compute, fit and optimise, for a large family of models
For normal variables (which tend to crop up everywhere) MLE gives the same results as other methods, such as least-squares estimations
Its formulation in terms of conditional probabilities makes it easy to use/manipulate it in Bayesian frameworks

Distance measure for categorical attributes for k-Nearest Neighbor

For my class project, I am working on the Kaggle competition - Don't get kicked
The project is to classify test data as good/bad buy for cars. There are 34 features and the data is highly skewed. I made the following choices:
Since the data is highly skewed, out of 73,000 instances, 64,000 instances are bad buy and only 9,000 instances are good buy. Since building a decision tree would overfit the data, I chose to use kNN - K nearest neighbors.
After trying out kNN, I plan to try out Perceptron and SVM techniques, if kNN doesn't yield good results. Is my understanding about overfitting correct?
Since some features are numeric, I can directly use the Euclid distance as a measure, but there are other attributes which are categorical. To aptly use these features, I need to come up with my own distance measure. I read about Hamming distance, but I am still unclear on how to merge 2 distance measures so that each feature gets equal weight.
Is there a way to find a good approximate for value of k? I understand that this depends a lot on the use-case and varies per problem. But, if I am taking a simple vote from each neighbor, how much should I set the value of k? I'm currently trying out various values, such as 2,3,10 etc.
I researched around and found these links, but these are not specifically helpful -
a) Metric for nearest neighbor, which says that finding out your own distance measure is equivalent to 'kernelizing', but couldn't make much sense from it.
b) Distance independent approximation of kNN talks about R-trees, M-trees etc. which I believe don't apply to my case.
c) Finding nearest neighbors using Jaccard coeff
Please let me know if you need more information.
Since the data is unbalanced, you should either sample an equal number of good/bad (losing lots of "bad" records), or use an algorithm that can account for this. I think there's an SVM implementation in RapidMiner that does this.
You should use Cross-Validation to avoid overfitting. You might be using the term overfitting incorrectly here though.
You should normalize distances so that they have the same weight. By normalize I mean force to be between 0 and 1. To normalize something, subtract the minimum and divide by the range.
The way to find the optimal value of K is to try all possible values of K (while cross-validating) and chose the value of K with the highest accuracy. If a "good" value of K is fine, then you can use a genetic algorithm or similar to find it. Or you could try K in steps of say 5 or 10, see which K leads to good accuracy (say it's 55), then try steps of 1 near that "good value" (ie 50,51,52...) but this may not be optimal.
I'm looking at the exact same problem.
Regarding the choice of k, it's recommended be an odd value to avoid getting "tie votes".
I hope to expand this answer in the future.

Resources