Working with inaccurate (incorrect) dataset - machine-learning

This is my problem description:
"According to the Survey on Household Income and Wealth, we need to find out the top 10% households with the most income and expenditures. However, we know that these collected data is not reliable due to many misstatements. Despite these misstatements, we have some features in the dataset which are certainly reliable. But these certain features are just a little part of information for each household wealth."
Unreliable data means that households tell lies to government. These households misstate their income and wealth in order to unfairly get more governmental services. Therefore, these fraudulent statements in original data will lead to incorrect results and patterns.
Now, I have below questions:
How should we deal with unreliable data in data science?
Is there any way to figure out these misstatements and then report the top 10% rich people with better accuracy using Machine Learning algorithms?
-How can we evaluate our errors in this study? Since we have unlabeled dataset, should I look for labeling techniques? Or, should I use unsupervised methods? Or, should I work with semi-supervised learning methods?
Is there any idea or application in Machine Learning which tries to improve the quality of collected data?
Please introduce me any ideas or references which can help me in this issue.
Thanks in advance.

Q: How should we deal with unreliable data in data science
A: Use feature engineering to fix unreliable data (make some transformations on unreliable data to make it reliable) or drop them out completely - bad features could significantly decrease the quality of the model
Q: Is there any way to figure out these misstatements and then report the top 10% rich people with better accuracy using Machine Learning algorithms?
A: ML algorithms are not magic sticks, they can't figure out anything unless you tell them what you are looking for. Can you describe what means 'unreliable'? If yes, you can, as I mentioned, use feature engineering or write a code which will fix the data. Otherwise no ML algorithm will be able to help you, without the description of what exactly you want to achieve
Q: Is there any idea or application in Machine Learning which tries to improve the quality of collected data?
A: I don't think so just because the question itself is too open-ended. What means 'the quality of the data'?
Generally, here are couple of things for you to consider:
1) Spend some time on googling feature engineering guides. They cover how to prepare your data for you ML algorithms, refine it, fix it. Good data with good features dramatically increase the results.
2) You don't need to use all of features from original data. Some of features of original dataset are meaningless and you don't need to use them. Try to run gradient boosting machine or random forest classifier from scikit-learn on your dataset to perform classification (or regression, if you do regression). These algorithms also evaluate importance of each feature of original dataset. Part of your features will have extremely low importance for classification, so you may wish to drop them out completely or try to combine unimportant features together somehow to produce something more important.

Related

How to scale up a model in a training dataset to cover all aspects of training data

I was asked in an interview to solve a use case with the help of machine learning. I have to use a Machine Learning algorithm to identify fraud from transactions. My training dataset has lets say 100,200 transactions, out of which 100,000 are legal transactions and 200 are fraud.
I cannot use the dataset as a whole to make the model because it would be a biased dataset and the model would be a very bad one.
Lets say for example I take a sample of 200 good transactions which represent the dataset well(good transactions), and the 200 fraud ones and make the model using this as the training data.
The question I was asked was that how would I scale up the 200 good transactions to the whole data set of 100,000 good records so that my result can be mapped to all types of transactions. I have never solved this kind of a scenario so I did not know how to approach it.
Any kind of guidance as to how I can go about it would be helpful.
This is a general question thrown in an interview. Information about the problem is succinct and vague (we don't know for example the number of features!). First thing you need to ask yourself is What do the interviewer wants me to respond? So, based on this context the answer has to be formulated in a similar general way. This means that we don't have to find 'the solution' but instead give arguments that show that we actually know how to approach the problem instead of solving it.
The problem we have presented with is that the minority class (fraud) is only a ~0.2% of the total. This is obviously a huge imbalance. A predictor that only predicted all cases as 'non fraud' would get a classification accuracy of 99.8%! Therefore, definitely something has to be done.
We will define our main task as a binary classification problem where we want to predict whether a transaction is labelled as positive (fraud) or negative (not fraud).
The first step would be considering what techniques we do have available to reduce imbalance. This can be done either by reducing the majority class (undersampling) or increasing the number of minority samples (oversampling). Both have drawbacks though. The first implies a severe loss of potential useful information from the dataset, while the second can present problems of overfitting. Some techniques to improve overfitting are SMOTE and ADASYN, which use strategies to improve variety in the generation of new synthetic samples.
Of course, cross-validation in this case becomes paramount. Additionally, in case we are finally doing oversampling, this has to be 'coordinated' with the cross-validation approach to ensure we are making the most of these two ideas. Check http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation for more details.
Apart from these sampling ideas, when selecting our learner, many ML methods can be trained/optimised for specific metrics. In our case, we do not want to optimise accuracy definitely. Instead, we want to train the model to optimise either ROC-AUC or specifically looking for a high recall even at a loss of precission, as we want to predict all the positive 'frauds' or at least raise an alarm even though some will prove false alarms. Models can adapt internal parameters (thresholds) to find the optimal balance between these two metrics. Have a look at this nice blog for more info about metrics: https://www.analyticsvidhya.com/blog/2016/02/7-important-model-evaluation-error-metrics/
Finally, is only a matter of evaluate the model empirically to check what options and parameters are the most suitable given the dataset. Following these ideas does not guarantee 100% that we are going to be able to tackle the problem at hand. But it ensures we are in a much better position to try to learn from data and being able to get rid of those evil fraudsters out there, while perhaps getting a nice job along the way ;)
In this problem you want to classify transactions as good or fraud. However your data is really imbalance. In that you will probably be interested by Anomaly detection. I will let you read all the article for more details but I will quote a few parts in my answer.
I think this will convince you that this is what you are looking for to solve this problem:
Is it not just Classification?
The answer is yes if the following three conditions are met.
You have labeled training data Anomalous and normal classes are
balanced ( say at least 1:5) Data is not autocorrelated. ( That one
data point does not depend on earlier data points. This often breaks
in time series data). If all of above is true, we do not need an
anomaly detection techniques and we can use an algorithm like Random
Forests or Support Vector Machines (SVM).
However, often it is very hard to find training data, and even when
you can find them, most anomalies are 1:1000 to 1:10^6 events where
classes are not balanced.
Now to answer your question:
Generally, the class imbalance is solved using an ensemble built by
resampling data many times. The idea is to first create new datasets
by taking all anomalous data points and adding a subset of normal data
points (e.g. as 4 times as anomalous data points). Then a classifier
is built for each data set using SVM or Random Forest, and those
classifiers are combined using ensemble learning. This approach has
worked well and produced very good results.
If the data points are autocorrelated with each other, then simple
classifiers would not work well. We handle those use cases using time
series classification techniques or Recurrent Neural networks.
I would also suggest another approach of the problem. In this article the author said:
If you do not have training data, still it is possible to do anomaly
detection using unsupervised learning and semi-supervised learning.
However, after building the model, you will have no idea how well it
is doing as you have nothing to test it against. Hence, the results of
those methods need to be tested in the field before placing them in
the critical path.
However you do have a few fraud data to test if your unsupervised algorithm is doing well or not, and if it is doing a good enough job, it can be a first solution that will help gathering more data to train a supervised classifier later.
Note that I am not an expert and this is just what I've come up with after mixing my knowledge and some articles I read recently on the subject.
For more question about machine learning I suggest you to use this stackexchange community
I hope it will help you :)

How to prove the reliability of a predictive model to executives?

I trained data from 500 devices to predict their performance. Then I applied my trained model to a test data set for another 500 devices and show pretty good prediction results. Now my executives want me to prove this model will work well on one million devices not only on 500. Obviously we don't have data for one million devices. And if the model is not reliable, they want me to discover the required amount of train data in order to make a reliable prediction on one million devices. How should I deal with these executives who don't have a background in statistical analysis and modeling? Any suggestions? Thanks
I have suggested to #cep to write up his comment as an answer - including providing the variance and bias calculations. In any case it could be added
"Do not be quick to assume Execs are essentially incapable in terms of
technical or mathematical concepts"
While there may be Dilbert managers out there .. somewhere I have seen few of them myself. More often managers get to their positions through hard work. They are likely to be rusty - but the abilities are likely still there.
In this case whether or not they have a "background in statistical analysis and modeling" they are applying common sense.
The first thing you might do is to provide the proper context and terminology. #cel has mentioned some of it: providing concrete values for :
assumptions
what assumptions are you making about the data.
What basis is there to consider extrapolation of the limited data
why should said extrapoated results be trusted to apply to the 99.5% of untested data
data distribution
basic descriptive statistics
your take on the apriori distribution of the data. Justify why you chose it
modeling
which models/approaches were considered and why
which model you actually chose and why
how did you arrive at the hyperparameters
how you trained the model
results
statistical measures of fit and error rate

Random forest algorithms able to switch data sets

I'm curious as to whether research been done into random forests that combine unsupervised with supervised learning in a way allowing a single algorithm to find patterns in, and work with, multiple different data sets. I have googled every possible way to find research on this, and have come up empty. Can anyone point me in the right direction?
Note: I have already asked this question in the Data Sciences forum, but it's basically a dead forum so I came here.
(also read the comments and will incorporate the content in my answer)
From what I read between the lines is that you want to use Deep networks in a transfer learning setting. However, this would not be based on decision trees.
http://jmlr.csail.mit.edu/proceedings/papers/v27/mesnil12a/mesnil12a.pdf
There are many elements in your question:
1.) Machine learning algorithms, in general, don't care about the source of your data set. So basically you can feed the learning algorithms 20 different data sets and it will use all of them. However, the data should have the same underlying concept (except in the transfer learning case see below). This means: if you combine cats/dogs data with bills data this will not work or make it much harder for the algorithms. At least all input features need to be identical (exceptions exists), e.g, it is hard to combine images with text.
2.) labeled/unlabeled: Two important terms: a data set is a set of data points with a fixed number of dimensions. Datapoint i might be described as {Xi1,....Xin} where each Xi might for example be a pixel. A label Yi is from another domain, e.g., cats and dogs
3.) unsupervised learning data without any labels. (I have the gut feeling that this is not what you want.
4.) semi-supervised learning: The idea is basically that you combine data where you have labels with data without labels. Basically you have a set of images labeled as cats and dogs {Xi1,..,Xin,Yi} and a second set which contains images with cats/dogs but no labels {Xj1,..,Xjn}. The algorithm can use this information to build better classifiers as the unlabeld data provide information on how images look in general.
3.) transfer learning (I think this come closest to what you want). The Idea is that you provide a data set of cats and dogs and learn a classifier. Afterwards you want to train the classifier with images of cats/dogs/hamster. The training does not need to start from scratch but can use the cats/dogs classifier to converge much faster
4.) feature generation / feature construction The idea is that the algoritm learns features like "eyes". This features are used in the next step to learn the classifier. I'm mainly aware of this in the context of deep learning. Where the algoritm learns in the first step concepts like edges and constructs increasingly complex features like faces cats intolerant it can describe things like "the man on the elephant. This combined with transfer learning is probably what you want. However deep learning is based on Neural networks besides a few exceptions.
5.) outlier detection you provide a data set of cats/dogs as known images. When you provide the cats/dogs/hamster classifier. The classifier tells you that it has never seen something like a hamster before.
6.) active learning The idea is that you don't provide labels for all examples (Data points) beforehand, but that the algorithms asks you to label certain data points. This way you need to label much less data.

Use feedback or reinforcement in machine learning?

I am trying to solve some classification problem. It seems many classical approaches follow a similar paradigm. That is, train a model with some training set and than use it to predict the class labels for new instances.
I am wondering if it is possible to introduce some feedback mechanism into the paradigm. In control theory, introducing a feedback loop is an effective way to improve system performance.
Currently a straight forward approach on my mind is, first we start with a initial set of instances and train a model with them. Then each time the model makes a wrong prediction, we add the wrong instance into the training set. This is different from blindly enlarge the training set because it is more targeting. This can be seen as some kind of negative feedback in the language of control theory.
Is there any research going on with the feedback approach? Could anyone shed some light?
There are two areas of research that spring to mind.
The first is Reinforcement Learning. This is an online learning paradigm that allows you to get feedback and update your policy (in this instance, your classifier) as you observe the results.
The second is active learning, where the classifier gets to select examples from a pool of unclassified examples to get labelled. The key is to have the classifier choose the examples for labelling which best improve its accuracy by choosing difficult examples under the current classifier hypothesis.
I have used such feedback for every machine-learning project I worked on. It allows to train on less data (thus training is faster) than by selecting data randomly. The model accuracy is also improved faster than by using randomly selected training data. I'm working on image processing (computer vision) data so one other type of selection I'm doing is to add clustered false (wrong) data instead of adding every single false data. This is because I assume I will always have some fails, so my definition for positive data is when it is clustered in the same area of the image.
I saw this paper some time ago, which seems to be what you are looking for.
They are basically modeling classification problems as Markov decision processes and solving using the ACLA algorithm. The paper is much more detailed than what I could write here, but ultimately they are getting results that outperform the multilayer perceptron, so this looks like a pretty efficient method.

Best approach to what I think is a machine learning problem [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am wanting some expert guidance here on what the best approach is for me to solve a problem. I have investigated some machine learning, neural networks, and stuff like that. I've investigated weka, some sort of baesian solution.. R.. several different things. I'm not sure how to really proceed, though. Here's my problem.
I have, or will have, a large collection of events.. eventually around 100,000 or so. Each event consists of several (30-50) independent variables, and 1 dependent variable that I care about. Some independent variables are more important than others in determining the dependent variable's value. And, these events are time relevant. Things that occur today are more important than events that occurred 10 years ago.
I'd like to be able to feed some sort of learning engine an event, and have it predict the dependent variable. Then, knowing the real answer for the dependent variable for this event (and all the events that have come along before), I'd like for that to train subsequent guesses.
Once I have an idea of what programming direction to go, I can do the research and figure out how to turn my idea into code. But my background is in parallel programming and not stuff like this, so I'd love to have some suggestions and guidance on this.
Thanks!
Edit: Here's a bit more detail about the problem that I'm trying to solve: It's a pricing problem. Let's say that I'm wanting to predict prices for a random comic book. Price is the only thing I care about. But there are lots of independent variables one could come up with. Is it a Superman comic, or a Hello Kitty comic. How old is it? What's the condition? etc etc. After training for a while, I want to be able to give it information about a comic book I might be considering, and have it give me a reasonable expected value for the comic book. OK. So comic books might be a bogus example. But you get the general idea. So far, from the answers, I'm doing some research on Support vector machines and Naive Bayes. Thanks for all of your help so far.
Sounds like you're a candidate for Support Vector Machines.
Go get libsvm. Read "A practical guide to SVM classification", which they distribute, and is short.
Basically, you're going to take your events, and format them like:
dv1 1:iv1_1 2:iv1_2 3:iv1_3 4:iv1_4 ...
dv2 1:iv2_1 2:iv2_2 3:iv2_3 4:iv2_4 ...
run it through their svm-scale utility, and then use their grid.py script to search for appropriate kernel parameters. The learning algorithm should be able to figure out differing importance of variables, though you might be able to weight things as well. If you think time will be useful, just add time as another independent variable (feature) for the training algorithm to use.
If libsvm can't quite get the accuracy you'd like, consider stepping up to SVMlight. Only ever so slightly harder to deal with, and a lot more options.
Bishop's Pattern Recognition and Machine Learning is probably the first textbook to look to for details on what libsvm and SVMlight are actually doing with your data.
If you have some classified data - a bunch of sample problems paired with their correct answers -, start by training some simple algorithms like K-Nearest-Neighbor and Perceptron and seeing if anything meaningful comes out of it. Don't bother trying to solve it optimally until you know if you can solve it simply or at all.
If you don't have any classified data, or not very much of it, start researching unsupervised learning algorithms.
It sounds like any kind of classifier should work for this problem: find the best class (your dependent variable) for an instance (your events). A simple starting point might be Naive Bayes classification.
This is definitely a machine learning problem. Weka is an excellent choice if you know Java and want a nice GPL lib where all you have to do is select the classifier and write some glue. R is probably not going to cut it for that many instances (events, as you termed it) because it's pretty slow. Furthermore, in R you still need to find or write machine learning libs, though this should be easy given that it's a statistical language.
If you believe that your features (independent variables) are conditionally independent (meaning, independent given the dependent variable), naive Bayes is the perfect classifier, as it is fast, interpretable, accurate and easy to implement. However, with 100,000 instances and only 30-50 features you can likely implement a fairly complex classification scheme that captures a lot of the dependency structure in your data. Your best bet would probably be a support vector machine (SMO in Weka) or a random forest (Yes, it's a silly name, but it helped random forest catch on.) If you want the advantage of easy interpretability of your classifier even at the expense of some accuracy, maybe a straight up J48 decision tree would work. I'd recommend against neural nets, as they're really slow and don't usually work any better in practice than SVMs and random forest.
The book Programming Collective Intelligence has a worked example with source code of a price predictor for laptops which would probably be a good starting point for you.
SVM's are often the best classifier available. It all depends on your problem and your data. For some problems other machine learning algorithms might be better. I have seen problems that neural networks (specifically recurrent neural networks) were better at solving. There is no right answer to this question since it is highly situationally dependent but I agree with dsimcha and Jay that SVM's are the right place to start.
I believe your problem is a regression problem, not a classification problem. The main difference: In classification we are trying to learn the value of a discrete variable, while in regression we are trying to learn the value of a continuous one. The techniques involved may be similar, but the details are different. Linear Regression is what most people try first. There are lots of other regression techniques, if linear regression doesn't do the trick.
You mentioned that you have 30-50 independent variables, and some are more important that the rest. So, assuming that you have historical data (or what we called a training set), you can use PCA (Principal Componenta Analysis) or other dimensionality reduction methods to reduce the number of independent variables. This step is of course optional. Depending on situations, you may get better results by keeping every variables, but add a weight to each one of them based on relevant they are. Here, PCA can help you to compute how "relevant" the variable is.
You also mentioned that events that are occured more recently should be more important. If that's the case, you can weight the recent event higher and the older event lower. Note that the importance of the event doesn't have to grow linearly accoding to time. It may makes more sense if it grow exponentially, so you can play with the numbers here. Or, if you are not lacking of training data, perhaps you can considered dropping off data that are too old.
Like Yuval F said, this does look more like a regression problem rather than a classification problem. Therefore, you can try SVR (Support Vector Regression), which is regression version of SVM (Support Vector Machine).
some other stuff you can try are:
Play around with how you scale the value range of your independent variables. Say, usually [-1...1] or [0...1]. But you can try other ranges to see if they help. Sometimes they do. Most of the time they don't.
If you suspect that there are "hidden" feature vector with a lower dimension, say N << 30 and it's non-linear in nature, you will need non-linear dimensionality reduction. You can read up on kernel PCA or more recently, manifold sculpting.
What you described is a classic classification problem. And in my opinion, why code fresh algorithms at all when you have a tool like Weka around. If I were you, I would run through a list of supervised learning algorithms (I don't completely understand whey people are suggesting unsupervised learning first when this is so clearly a classification problem) using 10-fold (or k-fold) cross validation, which is the default in Weka if I remember, and see what results you get! I would try:
-Neural Nets
-SVMs
-Decision Trees (this one worked really well for me when I was doing a similar problem)
-Boosting with Decision trees/stumps
-Anything else!
Weka makes things so easy and you really can get some useful information. I just took a machine learning class and I did exactly what you're trying to do with the algorithms above, so I know where you're at. For me the boosting with decision stumps worked amazingly well. (BTW, boosting is actually a meta-algorithm and can be applied to most supervised learning algs to usually enhance their results.)
A nice thing aobut using Decision Trees (if you use the ID3 or similar variety) is that it chooses the attributes to split on in order of how well they differientiate the data - in other words, which attributes determine the classification the quickest basically. So you can check out the tree after running the algorithm and see what attribute of a comic book most strongly determines the price - it should be the root of the tree.
Edit: I think Yuval is right, I wasn't paying attention to the problem of discretizing your price value for the classification. However, I don't know if regression is available in Weka, and you can still pretty easily apply classification techniques to this problem. You need to make classes of price values, as in, a number of ranges of prices for the comics, so that you can have a discrete number (like 1 through 10) that represents the price of the comic. Then you can easily run classification it.

Resources