Removing duplicates for ML training set? - machine-learning

I am wondering what is the common practice (if there is any) for handling duplicate observations for machine learning training sets.
Dropping duplicate observations would surely speed up the computations so that's a benefit.
But would it not throw the model off by simplifying it? Do models take the number of duplicates into account? I have a feeling it depends on the model, but am not able to find a clear answer.

I can imagine this differing very much for your specific use case, your data, and the type of models you use.
Many models would tend towards getting a certain record right if there are many duplicates of a that record: whether it's the C4.5 algorithm behind many decision trees, or the stochastic gradient descent behind neural networks.
Removing duplicates could be a very legitimate thing to do if you learn that the duplicates are a result of faulty training data, because in that case you'd want to modify your data to represent the real world as accurately as possible.
Though if the nature of your data is just that many records are identical, but they're still legitimate data points, then for many applications you'd want your model to weigh those data points appropriately, because in the end, that's what your real-world data would look like as well.

Related

Train/Test Datasets in Machine Learning

I just have a general question:
In a previous job, I was tasked with building a series of non-linear models to quantify the impact of certain factors on the number of medical claims filed. We had a set of variables we would use in all models (eg: state, year, Sex, etc.). We used all of our data to build these models; meaning we never split the data into training and test data sets.
If I were to go back in time to this job and split the data into training and test data sets, what would the advantages of that approach be besides assessing the prediction accuracy of our models. What is an argument for not splitting the data and then fitting the model? Never really thought about it too much until now - curious as to why we didn't take that approach.
Thanks!
The sole purpose of setting aside a test set is to assess prediction accuracy. However, there is more to this than just checking the number and thinking "huh, that's how my model performs"!
Knowing how your model performs at a given moment gives you an important benchmark for potential improvements of the model. How will you know otherwise whether adding a feature increases model performance? Moreover, how do you know otherwise whether your model is at all better than mere random guessing? Sometimes, extremely simple models outperform the more complex ones.
Another thing is removal of features or observations. This depends a bit on the kind of models you use, but some models (e.g., k-Nearest-Neighbors) perform significantly better if you remove unimportant features from the data. Similarly, suppose you add more training data and suddenly your model's test performance drops significantly. Perhaps there is something wrong with the new observations? You should be aware of these things.
The only argument I can think of for not using a test set is that otherwise you'd have too little training data for the model to perform optimally.

How to treat outliers if you have data set with ~2,000 features and can't look at each feature individually

I'm wondering how one goes about treating outliers at scale. Based on my experiences, I usually need to understand why there are outliers from the first place. What causes it, are there any patterns, or it just happens randomly. I know that, theoretically, we usually define outliers as data points outside of 3 standard deviation. But in the case where data is so big that you can't treat each feature one by one, and don't know if the 3 standard deviation rule is applicable anymore because of sparsity, how do we most effectively treat the outliers.
My intuition about high dimensional data is that data is sparse so the definition of "outliers" is harder to determine. Do you guys think we would be able to just get away with using ML algorithms that are more robust to outliers (tree based models, robust SVM, etc) instead of trying to treat outliers during preprocessing step? And if we really want to treat it, what is the best way to do it?
I would first propose a frame work for understanding the data. Imagine you are handed a dataset with no explanation of what it is. Analytics could actually be used to enable us to get understanding. Usually rows are observations and columns parameters of some sort regarding the observations. You first want to have a frame work for what you are trying to achieve. Now matter is going on, all data centers around the interest of people...that is why we decided to record it in some format. Given that, we are at most interested in:
1.) Object
2.) Attributes of object
3.) Behaviors of object
4.) Preferences of object
4.) Behaviors and preferences of object over time
5.) Relationships of object to other objects
6.) Affects of attributes, behaviors, preferences and other objects on object
So you are wanting to identify these items. So you open a data set and maybe you instantly recognize a time stamp. You then see some categorical variables and start doing relationship analysis for what is one to one, one to many, many to many. You then identify continuous variables. These all come together to give a foundation for identifying what is an outlier.
If we are evaluating objects of over time....is the rare event indicative of something that happens rarely, but we want to know about. Forest fire are outlier events...but they are events of great concern. If I am analyzing machine data and having rare events, but these rare events are tied to machine failure, then it matters. Basically.....does the rare event-parameter show evidence that it correlates to something that you care about?
Now if you have so many dimensions that the above approach is not feasible to your judgement, then you are seeking dimension reduction alternatives. I am currently employing Single Value Decomposition as at technique. I am already seeing situations where I am accomplishing the same level of predictive ability with 25% of the data. Which segways into my final thought; find a mark to decide whether the outliers matter or not.
Begin with leaving them in then begin your analysis, and run the work again with them removed. What were the affects. I believe that when you are in doubt, simply do both and see how different the results are. If there is little difference than maybe you are good to go. If there is significant difference of concern, then you are wanting to take an evidenced based approach of the outlier occurring. Simply because it is rare in your data does not mean it is rare. Think of certain type crimes that are under-reported (via arrest records). Lack of data showing politicians being arrested for insider trading does not mean that politicians are not doing insider trader en masse.

How to scale up a model in a training dataset to cover all aspects of training data

I was asked in an interview to solve a use case with the help of machine learning. I have to use a Machine Learning algorithm to identify fraud from transactions. My training dataset has lets say 100,200 transactions, out of which 100,000 are legal transactions and 200 are fraud.
I cannot use the dataset as a whole to make the model because it would be a biased dataset and the model would be a very bad one.
Lets say for example I take a sample of 200 good transactions which represent the dataset well(good transactions), and the 200 fraud ones and make the model using this as the training data.
The question I was asked was that how would I scale up the 200 good transactions to the whole data set of 100,000 good records so that my result can be mapped to all types of transactions. I have never solved this kind of a scenario so I did not know how to approach it.
Any kind of guidance as to how I can go about it would be helpful.
This is a general question thrown in an interview. Information about the problem is succinct and vague (we don't know for example the number of features!). First thing you need to ask yourself is What do the interviewer wants me to respond? So, based on this context the answer has to be formulated in a similar general way. This means that we don't have to find 'the solution' but instead give arguments that show that we actually know how to approach the problem instead of solving it.
The problem we have presented with is that the minority class (fraud) is only a ~0.2% of the total. This is obviously a huge imbalance. A predictor that only predicted all cases as 'non fraud' would get a classification accuracy of 99.8%! Therefore, definitely something has to be done.
We will define our main task as a binary classification problem where we want to predict whether a transaction is labelled as positive (fraud) or negative (not fraud).
The first step would be considering what techniques we do have available to reduce imbalance. This can be done either by reducing the majority class (undersampling) or increasing the number of minority samples (oversampling). Both have drawbacks though. The first implies a severe loss of potential useful information from the dataset, while the second can present problems of overfitting. Some techniques to improve overfitting are SMOTE and ADASYN, which use strategies to improve variety in the generation of new synthetic samples.
Of course, cross-validation in this case becomes paramount. Additionally, in case we are finally doing oversampling, this has to be 'coordinated' with the cross-validation approach to ensure we are making the most of these two ideas. Check http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation for more details.
Apart from these sampling ideas, when selecting our learner, many ML methods can be trained/optimised for specific metrics. In our case, we do not want to optimise accuracy definitely. Instead, we want to train the model to optimise either ROC-AUC or specifically looking for a high recall even at a loss of precission, as we want to predict all the positive 'frauds' or at least raise an alarm even though some will prove false alarms. Models can adapt internal parameters (thresholds) to find the optimal balance between these two metrics. Have a look at this nice blog for more info about metrics: https://www.analyticsvidhya.com/blog/2016/02/7-important-model-evaluation-error-metrics/
Finally, is only a matter of evaluate the model empirically to check what options and parameters are the most suitable given the dataset. Following these ideas does not guarantee 100% that we are going to be able to tackle the problem at hand. But it ensures we are in a much better position to try to learn from data and being able to get rid of those evil fraudsters out there, while perhaps getting a nice job along the way ;)
In this problem you want to classify transactions as good or fraud. However your data is really imbalance. In that you will probably be interested by Anomaly detection. I will let you read all the article for more details but I will quote a few parts in my answer.
I think this will convince you that this is what you are looking for to solve this problem:
Is it not just Classification?
The answer is yes if the following three conditions are met.
You have labeled training data Anomalous and normal classes are
balanced ( say at least 1:5) Data is not autocorrelated. ( That one
data point does not depend on earlier data points. This often breaks
in time series data). If all of above is true, we do not need an
anomaly detection techniques and we can use an algorithm like Random
Forests or Support Vector Machines (SVM).
However, often it is very hard to find training data, and even when
you can find them, most anomalies are 1:1000 to 1:10^6 events where
classes are not balanced.
Now to answer your question:
Generally, the class imbalance is solved using an ensemble built by
resampling data many times. The idea is to first create new datasets
by taking all anomalous data points and adding a subset of normal data
points (e.g. as 4 times as anomalous data points). Then a classifier
is built for each data set using SVM or Random Forest, and those
classifiers are combined using ensemble learning. This approach has
worked well and produced very good results.
If the data points are autocorrelated with each other, then simple
classifiers would not work well. We handle those use cases using time
series classification techniques or Recurrent Neural networks.
I would also suggest another approach of the problem. In this article the author said:
If you do not have training data, still it is possible to do anomaly
detection using unsupervised learning and semi-supervised learning.
However, after building the model, you will have no idea how well it
is doing as you have nothing to test it against. Hence, the results of
those methods need to be tested in the field before placing them in
the critical path.
However you do have a few fraud data to test if your unsupervised algorithm is doing well or not, and if it is doing a good enough job, it can be a first solution that will help gathering more data to train a supervised classifier later.
Note that I am not an expert and this is just what I've come up with after mixing my knowledge and some articles I read recently on the subject.
For more question about machine learning I suggest you to use this stackexchange community
I hope it will help you :)

Multiple Naive Bayes classifiers

I'm looking at implement a Naive Byes classifier for a review site in order to identify spam reviews and have a couple of questions.
It occurs to me there are multiple types of spam, such as outright marketing rubbish with nothing to do with the thing they are reviewing, versus a deceptive review. Would it be wise to implement multiple classifiers for different purposes so that one gets better an general spam detection, whilst the other learns deceptive reviews?
On a similar vain, there are multiple categories of items being reviewed so for the "deceptive review" classifier, would it be best to have just one classifier that tries to learn from all reviews? or would it be better to have a classifier per category so that it may be able to learn nuances within those categories?
I know these won't be fool proof, it's just about flagging potential reviews for manual checking, but I'm just unclear on the best setup.
As long as you're using any sufficiently complex algorithm, you should be able to discriminate "good" vs "bad" data with either method. If you do it all with one model, you'll simply need to increase the model size so that the comprehensive model can build (at worst) independent paths to the two decisions, "spam" and "deception".
If you're training this on three separate classifications: good, spam, and deceptive; then you're doing fine either way. Note, however, that your model size is smaller with separate trainings, and your training times will be shorter, as there will be fewer inaccurate guesses on the way.
On the other hand, using two models for later actual use will likely slow down detection, since each output that passes the first model must be run through the second. For most applications, this time is not a significant factor.
Most of all, I would start with a separate model for each class: any problems with implementation and training will be faster to find and easier to isolate.

How to build a good training data set for machine learning and predictions?

I have a school project to make a program that uses the Weka tools to make predictions on football (soccer) games.
Since the algorithms are already there (the J48 algorithm), I need just the data. I found a website that offers football game data for free and I tried it in Weka but the predictions were pretty bad so I assume my data is not structured properly.
I need to extract the data from my source and format it another way in order to make new attributes and classes for my model. Does anyone know of a course/tutorial/guide on how to properly create your attributes and classes for machine learning predictions? Is there a standard that describes the best way of choosing the attributes of a data set for training a machine learning algorithm? What's the approach on this?
here's an example of the data that I have at the moment: http://www.football-data.co.uk/mmz4281/1516/E0.csv
and here is what the columns mean: http://www.football-data.co.uk/notes.txt
The problem may be that the data set you have is too small. Suppose you have ten variables and each variable has a range of 10 values. There are 10^10 possible configurations of these variables. It is unlikely your data set will be this large let alone cover all of the possible configurations. The trick is to narrow down the variables to the most relevant to avoid this large potential search space.
A second problem is that certain combinations of variables may be more significant than others.
The J48 algorithm attempts to to find the most relevant variable using entropy at each level in the tree. each path through the tree can be thought of as an AND condition: V1==a & V2==b ...
This covers the significance due to joint interactions. But what if the outcome is a result of A&B&C OR W&X&Y? The J48 algorithm will find only one and it will be the one where the the first variable selected will have the most overall significance when considered alone.
So, to answer your question, you need to not only find a training set which will cover the most common variable configurations in the "general" population but find an algorithm which will faithfully represent these training cases. Faithful meaning it will generally apply to unseen cases.
It's not an easy task. Many people and much money are involved in sports betting. If it were as easy as selecting the proper training set, you can be sure it would have been found by now.
EDIT:
It was asked in the comments how to you find the proper algorithm. The answer is the same way you find a needle in a haystack. There is no set rule. You may be lucky and stumble across it but in a large search space you won't ever know if you have. This is the same problem as finding the optimum point in a very convoluted search space.
A short-term answer is to
Think about what the algorithm can really accomplish. The J48 (and similar) algorithms are best suited for classification where the influence of the variables on the result are well known and follow a hierarchy. Flower classification is one example where it will likely excel.
Check the model against the training set. If it does poorly with the training set then it will likely have poor performance with unseen data. In general, you should expect the model to performance against the training to exceed the performance against unseen data.
The algorithm needs to be tested with data it has never seen. Testing against the training set, while a quick elimination test, will likely lead to overconfidence.
Reserve some of your data for testing. Weka provides a way to do this. The best case scenario would be to build the model on all cases except one (Leave On Out Approach) then see how the model performs on the average with these.
But this assumes the data at hand are not in some way biased.
A second pitfall is to let the test results bias the way you build the model.For example, trying different models parameters until you get an acceptable test response. With J48 it's not easy to allow this bias to creep in but if it did then you have just used your test set as an auxiliary training set.
Continue collecting more data; testing as long as possible. Even after all of the above, you still won't know how useful the algorithm is unless you can observe its performance against future cases. When what appears to be a good model starts behaving poorly then it's time to go back to the drawing board.
Surprisingly, there are a large number of fields (mostly in the soft sciences) which fail to see the need to verify the model with future data. But this is a matter better discussed elsewhere.
This may not be the answer you are looking for but it is the way things are.
In summary,
The training data set should cover the 'significant' variable configurations
You should verify the model against unseen data
Identifying (1) and doing (2) are the tricky bits. There is no cut-and-dried recipe to follow.

Resources