Machine learning where labelling of training data might not be 100% accurate - machine-learning

I have a dataset which consists of people who have diabetes, and who have not. Using this data, I want to train a model to calculate a risk probability for people with unknown diabetes status. I know that the majority of people who have not been diagnosed with diabetes in the training do not have diabetes, but it is likely that some of these people may have undiagnosed diabetes.
This appears to present a catch 22 situation. I want to identify people who are at-risk, or potentially have undiagnosed diabetes, however I know some of the people in my training dataset are incorrectly labelled as not having diabetes because they have not yet been diagnosed. Has anyone encountered such a problem? Can one still proceed on the basis that there may be some incorrectly labelled data, if it only counts for a small percentage of the data?

There might be several approaches to solving your problem.
First - it might not be a problem after all. If the mislabeled data accounts for a small part of your training set, it might not matter. Actually, there are some cases when adding mislabeled data or just random noise improves robustness and generalization power of your classifier.
Second - you might want to use the training set to train the classifier and then check the data points for which the classifier gave the incorrect classification. It is possible that the classifier was actually right and directs you to the incorrectly labeled data. This data can be subsequently manually checked if such a thing is possible.
Third - you can filter the data up front using methods like consensus filters. This article might be a good way to start your research on this topic: Identifying Mislabeled Training Data - C.E. Brody and M.A. Friedl.

Related

How to scale up a model in a training dataset to cover all aspects of training data

I was asked in an interview to solve a use case with the help of machine learning. I have to use a Machine Learning algorithm to identify fraud from transactions. My training dataset has lets say 100,200 transactions, out of which 100,000 are legal transactions and 200 are fraud.
I cannot use the dataset as a whole to make the model because it would be a biased dataset and the model would be a very bad one.
Lets say for example I take a sample of 200 good transactions which represent the dataset well(good transactions), and the 200 fraud ones and make the model using this as the training data.
The question I was asked was that how would I scale up the 200 good transactions to the whole data set of 100,000 good records so that my result can be mapped to all types of transactions. I have never solved this kind of a scenario so I did not know how to approach it.
Any kind of guidance as to how I can go about it would be helpful.
This is a general question thrown in an interview. Information about the problem is succinct and vague (we don't know for example the number of features!). First thing you need to ask yourself is What do the interviewer wants me to respond? So, based on this context the answer has to be formulated in a similar general way. This means that we don't have to find 'the solution' but instead give arguments that show that we actually know how to approach the problem instead of solving it.
The problem we have presented with is that the minority class (fraud) is only a ~0.2% of the total. This is obviously a huge imbalance. A predictor that only predicted all cases as 'non fraud' would get a classification accuracy of 99.8%! Therefore, definitely something has to be done.
We will define our main task as a binary classification problem where we want to predict whether a transaction is labelled as positive (fraud) or negative (not fraud).
The first step would be considering what techniques we do have available to reduce imbalance. This can be done either by reducing the majority class (undersampling) or increasing the number of minority samples (oversampling). Both have drawbacks though. The first implies a severe loss of potential useful information from the dataset, while the second can present problems of overfitting. Some techniques to improve overfitting are SMOTE and ADASYN, which use strategies to improve variety in the generation of new synthetic samples.
Of course, cross-validation in this case becomes paramount. Additionally, in case we are finally doing oversampling, this has to be 'coordinated' with the cross-validation approach to ensure we are making the most of these two ideas. Check http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation for more details.
Apart from these sampling ideas, when selecting our learner, many ML methods can be trained/optimised for specific metrics. In our case, we do not want to optimise accuracy definitely. Instead, we want to train the model to optimise either ROC-AUC or specifically looking for a high recall even at a loss of precission, as we want to predict all the positive 'frauds' or at least raise an alarm even though some will prove false alarms. Models can adapt internal parameters (thresholds) to find the optimal balance between these two metrics. Have a look at this nice blog for more info about metrics: https://www.analyticsvidhya.com/blog/2016/02/7-important-model-evaluation-error-metrics/
Finally, is only a matter of evaluate the model empirically to check what options and parameters are the most suitable given the dataset. Following these ideas does not guarantee 100% that we are going to be able to tackle the problem at hand. But it ensures we are in a much better position to try to learn from data and being able to get rid of those evil fraudsters out there, while perhaps getting a nice job along the way ;)
In this problem you want to classify transactions as good or fraud. However your data is really imbalance. In that you will probably be interested by Anomaly detection. I will let you read all the article for more details but I will quote a few parts in my answer.
I think this will convince you that this is what you are looking for to solve this problem:
Is it not just Classification?
The answer is yes if the following three conditions are met.
You have labeled training data Anomalous and normal classes are
balanced ( say at least 1:5) Data is not autocorrelated. ( That one
data point does not depend on earlier data points. This often breaks
in time series data). If all of above is true, we do not need an
anomaly detection techniques and we can use an algorithm like Random
Forests or Support Vector Machines (SVM).
However, often it is very hard to find training data, and even when
you can find them, most anomalies are 1:1000 to 1:10^6 events where
classes are not balanced.
Now to answer your question:
Generally, the class imbalance is solved using an ensemble built by
resampling data many times. The idea is to first create new datasets
by taking all anomalous data points and adding a subset of normal data
points (e.g. as 4 times as anomalous data points). Then a classifier
is built for each data set using SVM or Random Forest, and those
classifiers are combined using ensemble learning. This approach has
worked well and produced very good results.
If the data points are autocorrelated with each other, then simple
classifiers would not work well. We handle those use cases using time
series classification techniques or Recurrent Neural networks.
I would also suggest another approach of the problem. In this article the author said:
If you do not have training data, still it is possible to do anomaly
detection using unsupervised learning and semi-supervised learning.
However, after building the model, you will have no idea how well it
is doing as you have nothing to test it against. Hence, the results of
those methods need to be tested in the field before placing them in
the critical path.
However you do have a few fraud data to test if your unsupervised algorithm is doing well or not, and if it is doing a good enough job, it can be a first solution that will help gathering more data to train a supervised classifier later.
Note that I am not an expert and this is just what I've come up with after mixing my knowledge and some articles I read recently on the subject.
For more question about machine learning I suggest you to use this stackexchange community
I hope it will help you :)

Multiple cross-validation + testing on a small dataset to improve confidence

I am currently working on a very small dataset of about 25 samples (200 features) and I need to perform model selection and also have a reliable classification accuracy. I was planning to split the dataset in a training set (for a 4-fold CV) and a test set (for testing on unseen data). The main problem is that the resulting accuracy obtained from the test set is not reliable enough.
So, performing multiple time the cross-validation and testing could solve the problem?
I was planning to perform multiple times this process in order to have a better confidence on the classification accuracy. For instance: I would run one cross-validation plus testing and the output would be one "best" model plus the accuracy on the test set. The next run I would perform the same process, however, the "best" model may not be the same. By performing this process multiple times I eventually end up with one predominant model and the accuracy will be the average of the accuracies obtained on that model.
Since I never heard about a testing framework like this one, does anyone have any suggestion or critics on the algorithm proposed?
Thanks in advance.
The algorithm seems interesting but you need to make lots of passes through data and ensure that some specific model is really dominant (that it surfaces in real majority of tests, not just 'more than others'). In general, in ML a real problem is having too little data. As anyone will tell you, not the team with the most complicated algorithm wins, but the team with biggest amount of data.
In your case I would also suggest one additional approach - bootstrapping. Details are here:
what is the bootstrapped data in data mining?
Or can be googled. Long story short it is a sampling with replacement, which should help you to expand your dataset from 25 samples to something more interesting.
When the data is small like yours you should consider 'LOOCV' or leave one out cross validation. In this case you partition the data into 25 different samples where and each one a single different observatin is held out. Performance is then calcluated using the 25 individual held out predictions.
This will allow you to use the most data in your modeling and you will still have a good measure of performance.

Working with inaccurate (incorrect) dataset

This is my problem description:
"According to the Survey on Household Income and Wealth, we need to find out the top 10% households with the most income and expenditures. However, we know that these collected data is not reliable due to many misstatements. Despite these misstatements, we have some features in the dataset which are certainly reliable. But these certain features are just a little part of information for each household wealth."
Unreliable data means that households tell lies to government. These households misstate their income and wealth in order to unfairly get more governmental services. Therefore, these fraudulent statements in original data will lead to incorrect results and patterns.
Now, I have below questions:
How should we deal with unreliable data in data science?
Is there any way to figure out these misstatements and then report the top 10% rich people with better accuracy using Machine Learning algorithms?
-How can we evaluate our errors in this study? Since we have unlabeled dataset, should I look for labeling techniques? Or, should I use unsupervised methods? Or, should I work with semi-supervised learning methods?
Is there any idea or application in Machine Learning which tries to improve the quality of collected data?
Please introduce me any ideas or references which can help me in this issue.
Thanks in advance.
Q: How should we deal with unreliable data in data science
A: Use feature engineering to fix unreliable data (make some transformations on unreliable data to make it reliable) or drop them out completely - bad features could significantly decrease the quality of the model
Q: Is there any way to figure out these misstatements and then report the top 10% rich people with better accuracy using Machine Learning algorithms?
A: ML algorithms are not magic sticks, they can't figure out anything unless you tell them what you are looking for. Can you describe what means 'unreliable'? If yes, you can, as I mentioned, use feature engineering or write a code which will fix the data. Otherwise no ML algorithm will be able to help you, without the description of what exactly you want to achieve
Q: Is there any idea or application in Machine Learning which tries to improve the quality of collected data?
A: I don't think so just because the question itself is too open-ended. What means 'the quality of the data'?
Generally, here are couple of things for you to consider:
1) Spend some time on googling feature engineering guides. They cover how to prepare your data for you ML algorithms, refine it, fix it. Good data with good features dramatically increase the results.
2) You don't need to use all of features from original data. Some of features of original dataset are meaningless and you don't need to use them. Try to run gradient boosting machine or random forest classifier from scikit-learn on your dataset to perform classification (or regression, if you do regression). These algorithms also evaluate importance of each feature of original dataset. Part of your features will have extremely low importance for classification, so you may wish to drop them out completely or try to combine unimportant features together somehow to produce something more important.

Evaluating models on the entire training set with no cross-validation

We have a dataset with 10,000 manually labeled instances, and a classifier that was trained on all of this data.
The classifier was then evaluated on ALL of this data to obtain a 95% success rate.
What exactly is wrong with this approach? Is it just that the statistic 95% is not very informative in this setup? Can there still be some value in this 95% number? While I understand that, theoretically, it is not a good idea, I don't have enough experience in this area to be sure by myself. Also note that I have neither built nor evaluated the classifier in question.
Common sense aside, could someone give me a very solid, authoritative reference, saying that this setup is somehow wrong?
For example, this page does say
Evaluating model performance with the data used for training is not acceptable in data mining because it can easily generate overoptimistic and overfitted models.
However, this is hardly an authoritative reference. In fact, this quote is plainly wrong, as the evaluation has nothing to do with generating overfitted models. It could generate overoptimistic data scientists who would choose the wrong model, but a particular evaluation strategy does not have anything to do with overfitting models per se.
The problem is the possibility of overfitting. That does not mean that there is no value in the accuracy you reported for that entire data set, as it can be considered an estimate of the upper bound for the performance of the classifier on new data.
It is subjective to say who constitutes a "very solid, authoritative reference"; however Machine Learning by Tom Mitchell (ISBN 978-0070428072) is a widely read and oft-cited text that discusses the problem of overfitting in general and specifically with regard to decision trees and artificial neural networks. In addition to discussion of overfitting, the text also discusses various approaches to the training and validation set approach (e.g., cross-validation).

How can I know training data is enough for machine learning

For example: If I want to train a classifier (maybe SVM), how many sample do I need to collect? Is there a measure method for this?
It is not easy to know how many samples you need to collect. However you can follow these steps:
For solving a typical ML problem:
Build a dataset a with a few samples, how many? it will depend on the kind of problem you have, don't spend a lot of time now.
Split your dataset into train, cross, test and build your model.
Now that you've built the ML model, you need to evaluate how good it is. Calculate your test error
If your test error is beneath your expectation, collect new data and repeat steps 1-3 until you hit a test error rate you are comfortable with.
This method will work if your model is not suffering "high bias".
This video from Coursera's Machine Learning course, explains it.
Unfortunately, there is no simple method for this.
The rule of thumb is the bigger, the better, but in practical use, you have to gather the sufficient amount of data. By sufficient I mean covering as big part of modeled space as you consider acceptable.
Also, amount is not everything. The quality of test samples is very important too, i.e. training samples should not contain duplicates.
Personally, when I don't have all possible training data at once, I gather some training data and then train a classifier. Then I classifier quality is not acceptable, I gather more data, etc.
Here is some piece of science about estimating training set quality.
This depends a lot on the nature of the data and the prediction you are trying to make, but as a simple rule to start with, your training data should be roughly 10X the number of your model parameters. For instance, while training a logistic regression with N features, try to start with 10N training instances.
For an empirical derivation of the "rule of 10", see
https://medium.com/#malay.haldar/how-much-training-data-do-you-need-da8ec091e956

Resources