Good data set for Pre-processing - preprocessor

I am enrolled in an under-graduate course in Data Mining and I've got an assignment to code a Data Mining Pre-processor. I have the liberty to choose the programming language and the data set. I was wondering if anybody could suggest a good data set to use. I have been going through the UCI Repository and I've found many more such resources. But being a beginner I am not sure which data set would be a good choice. The preprocessor should be dealing with the following stuff:
Data cleaning
Missing Values
Errors
Outliers
Nomralization
De-duplication
Data Reduction
Sampling Techniques
Dimensionality Reduction
What kind of properties should I consider when choosing the data set? Any specific data set you would suggest?

You answered your own question. Choose list of data-set with the properties that you have mentioned as UCI repository has categorized dataset. You can chose anyone to start playing with it.
So to start with, if I were you,I would proceed step wise, have a feel how each of those look like and its effect on classifier performance and choose some of the popular dataset as they are used as benchmark dataset in most of the research paper. Much of those that you have listed are separate machine learning problems with lots of researches being conducted.
I would start with something like this :
for missing values : Iris, Voting,Heart disease
for Duplicate:921,810 song dataset(not form UCI I think)
Normalization : Any continuous valued dataset with different range for features
Sampling technique : Pima
Dimensionality reduction : Swiss Roll
Further, another best approach to look for the data set would be to refer some of respective publications. Such as , for dimensionality reduction, you can look into papers of PCA, ISOMAP etc, for sampling see SMOTE paper etc and see what type of data do they use for their experiments and proceed accordingly.

Related

Simple machine learning for website classification

I am trying to generate a Python program that determines if a website is harmful (porn etc.).
First, I made a Python web scraping program that counts the number of occurrences for each word.
result for harmful websites
It's a key value dictionary like
{ word : [ # occurrences in harmful websites, # of websites that contain these words] }.
Now I want my program to analyze the words from any websites to check if the website is safe or not. But I don't know which methods will suit to my data.
The key thing here is your training data. You need some sort of supervised learning technique where your training data consists of website's data itself (text document) and its label (harmful or safe).
You can certainly use the RNN but there also other natural language processing techniques and much faster ones.
Typically, you should use a proper vectorizer on your training data (think of each site page as a text document), for example tf-idf (but also other possibilities; if you use Python I would strongly suggest scikit that provides lots of useful machine learning techniques and mentioned sklearn.TfidfVectorizer is already within). The point is to vectorize your text document in enhanced way. Imagine for example the English word the how many times it typically exists in text? You need to think of biases such as these.
Once your training data is vectorized you can use for example stochastic gradient descent classifier and see how it performs on your test data (in machine learning terminology the test data means to simply take some new data example and test what your ML program outputs).
In either case you will need to experiment with above options. There are many nuances and you need to test your data and see where you achieve the best results (depending on ML algorithm settings, type of vectorizer, used ML technique itself and so on). For example Support Vector Machines are great choice when it comes to binary classifiers too. You may wanna play with that too and see if it performs better than SGD.
In any case, remember that you will need to obtain quality training data with labels (harmful vs. safe) and find the best fitting classifier. On your journey to find the best one you may also wanna use cross validation to determine how well your classifier behaves. Again, already contained in scikit-learn.
N.B. Don't forget about valid cases. For example there may be a completely safe online magazine where it only mentions the harmful topic in some article; it doesn't mean the website itself is harmful though.
Edit: As I think of it, if you don't have any experience with ML at all it could be useful to take any online course because despite the knowledge of API and libraries you will still need to know what it does and the math behind the curtain (at least roughly).
What you are trying to do is called sentiment classification and is usually done with recurrent neural networks (RNNs) or Long short-term memory networks (LSTMs). This is not an easy topic to start with machine learning. If you are new you should have a look into linear/logistic regression, SVMs and basic neural networks (MLPs) first. Otherwise it will be hard to understand what is going on.
That said: there are many libraries out there for constructing neural networks. Probably easiest to use is keras. While this library simplifies a lot of things immensely, it isn't just a magic box that makes gold from trash. You need to understand what happens under the hood to get good results. Here is an example of how you can perform sentiment classification on the IMDB dataset (basically determine whether a movie review is positive or not) with keras.
For people who have no experience in NLP or ML, I recommend using TFIDF vectorizer instead of using deep learning libraries. In short, it converts sentences to vector, taking each word in vocabulary to one dimension (degree is occurrence).
Then, you can calculate cosine similarity to resulting vector.
To improve performance, use stemming / lemmatizing / stopwords supported in NLTK libraires.

How to scale up a model in a training dataset to cover all aspects of training data

I was asked in an interview to solve a use case with the help of machine learning. I have to use a Machine Learning algorithm to identify fraud from transactions. My training dataset has lets say 100,200 transactions, out of which 100,000 are legal transactions and 200 are fraud.
I cannot use the dataset as a whole to make the model because it would be a biased dataset and the model would be a very bad one.
Lets say for example I take a sample of 200 good transactions which represent the dataset well(good transactions), and the 200 fraud ones and make the model using this as the training data.
The question I was asked was that how would I scale up the 200 good transactions to the whole data set of 100,000 good records so that my result can be mapped to all types of transactions. I have never solved this kind of a scenario so I did not know how to approach it.
Any kind of guidance as to how I can go about it would be helpful.
This is a general question thrown in an interview. Information about the problem is succinct and vague (we don't know for example the number of features!). First thing you need to ask yourself is What do the interviewer wants me to respond? So, based on this context the answer has to be formulated in a similar general way. This means that we don't have to find 'the solution' but instead give arguments that show that we actually know how to approach the problem instead of solving it.
The problem we have presented with is that the minority class (fraud) is only a ~0.2% of the total. This is obviously a huge imbalance. A predictor that only predicted all cases as 'non fraud' would get a classification accuracy of 99.8%! Therefore, definitely something has to be done.
We will define our main task as a binary classification problem where we want to predict whether a transaction is labelled as positive (fraud) or negative (not fraud).
The first step would be considering what techniques we do have available to reduce imbalance. This can be done either by reducing the majority class (undersampling) or increasing the number of minority samples (oversampling). Both have drawbacks though. The first implies a severe loss of potential useful information from the dataset, while the second can present problems of overfitting. Some techniques to improve overfitting are SMOTE and ADASYN, which use strategies to improve variety in the generation of new synthetic samples.
Of course, cross-validation in this case becomes paramount. Additionally, in case we are finally doing oversampling, this has to be 'coordinated' with the cross-validation approach to ensure we are making the most of these two ideas. Check http://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation for more details.
Apart from these sampling ideas, when selecting our learner, many ML methods can be trained/optimised for specific metrics. In our case, we do not want to optimise accuracy definitely. Instead, we want to train the model to optimise either ROC-AUC or specifically looking for a high recall even at a loss of precission, as we want to predict all the positive 'frauds' or at least raise an alarm even though some will prove false alarms. Models can adapt internal parameters (thresholds) to find the optimal balance between these two metrics. Have a look at this nice blog for more info about metrics: https://www.analyticsvidhya.com/blog/2016/02/7-important-model-evaluation-error-metrics/
Finally, is only a matter of evaluate the model empirically to check what options and parameters are the most suitable given the dataset. Following these ideas does not guarantee 100% that we are going to be able to tackle the problem at hand. But it ensures we are in a much better position to try to learn from data and being able to get rid of those evil fraudsters out there, while perhaps getting a nice job along the way ;)
In this problem you want to classify transactions as good or fraud. However your data is really imbalance. In that you will probably be interested by Anomaly detection. I will let you read all the article for more details but I will quote a few parts in my answer.
I think this will convince you that this is what you are looking for to solve this problem:
Is it not just Classification?
The answer is yes if the following three conditions are met.
You have labeled training data Anomalous and normal classes are
balanced ( say at least 1:5) Data is not autocorrelated. ( That one
data point does not depend on earlier data points. This often breaks
in time series data). If all of above is true, we do not need an
anomaly detection techniques and we can use an algorithm like Random
Forests or Support Vector Machines (SVM).
However, often it is very hard to find training data, and even when
you can find them, most anomalies are 1:1000 to 1:10^6 events where
classes are not balanced.
Now to answer your question:
Generally, the class imbalance is solved using an ensemble built by
resampling data many times. The idea is to first create new datasets
by taking all anomalous data points and adding a subset of normal data
points (e.g. as 4 times as anomalous data points). Then a classifier
is built for each data set using SVM or Random Forest, and those
classifiers are combined using ensemble learning. This approach has
worked well and produced very good results.
If the data points are autocorrelated with each other, then simple
classifiers would not work well. We handle those use cases using time
series classification techniques or Recurrent Neural networks.
I would also suggest another approach of the problem. In this article the author said:
If you do not have training data, still it is possible to do anomaly
detection using unsupervised learning and semi-supervised learning.
However, after building the model, you will have no idea how well it
is doing as you have nothing to test it against. Hence, the results of
those methods need to be tested in the field before placing them in
the critical path.
However you do have a few fraud data to test if your unsupervised algorithm is doing well or not, and if it is doing a good enough job, it can be a first solution that will help gathering more data to train a supervised classifier later.
Note that I am not an expert and this is just what I've come up with after mixing my knowledge and some articles I read recently on the subject.
For more question about machine learning I suggest you to use this stackexchange community
I hope it will help you :)

Random forest algorithms able to switch data sets

I'm curious as to whether research been done into random forests that combine unsupervised with supervised learning in a way allowing a single algorithm to find patterns in, and work with, multiple different data sets. I have googled every possible way to find research on this, and have come up empty. Can anyone point me in the right direction?
Note: I have already asked this question in the Data Sciences forum, but it's basically a dead forum so I came here.
(also read the comments and will incorporate the content in my answer)
From what I read between the lines is that you want to use Deep networks in a transfer learning setting. However, this would not be based on decision trees.
http://jmlr.csail.mit.edu/proceedings/papers/v27/mesnil12a/mesnil12a.pdf
There are many elements in your question:
1.) Machine learning algorithms, in general, don't care about the source of your data set. So basically you can feed the learning algorithms 20 different data sets and it will use all of them. However, the data should have the same underlying concept (except in the transfer learning case see below). This means: if you combine cats/dogs data with bills data this will not work or make it much harder for the algorithms. At least all input features need to be identical (exceptions exists), e.g, it is hard to combine images with text.
2.) labeled/unlabeled: Two important terms: a data set is a set of data points with a fixed number of dimensions. Datapoint i might be described as {Xi1,....Xin} where each Xi might for example be a pixel. A label Yi is from another domain, e.g., cats and dogs
3.) unsupervised learning data without any labels. (I have the gut feeling that this is not what you want.
4.) semi-supervised learning: The idea is basically that you combine data where you have labels with data without labels. Basically you have a set of images labeled as cats and dogs {Xi1,..,Xin,Yi} and a second set which contains images with cats/dogs but no labels {Xj1,..,Xjn}. The algorithm can use this information to build better classifiers as the unlabeld data provide information on how images look in general.
3.) transfer learning (I think this come closest to what you want). The Idea is that you provide a data set of cats and dogs and learn a classifier. Afterwards you want to train the classifier with images of cats/dogs/hamster. The training does not need to start from scratch but can use the cats/dogs classifier to converge much faster
4.) feature generation / feature construction The idea is that the algoritm learns features like "eyes". This features are used in the next step to learn the classifier. I'm mainly aware of this in the context of deep learning. Where the algoritm learns in the first step concepts like edges and constructs increasingly complex features like faces cats intolerant it can describe things like "the man on the elephant. This combined with transfer learning is probably what you want. However deep learning is based on Neural networks besides a few exceptions.
5.) outlier detection you provide a data set of cats/dogs as known images. When you provide the cats/dogs/hamster classifier. The classifier tells you that it has never seen something like a hamster before.
6.) active learning The idea is that you don't provide labels for all examples (Data points) beforehand, but that the algorithms asks you to label certain data points. This way you need to label much less data.

Working with inaccurate (incorrect) dataset

This is my problem description:
"According to the Survey on Household Income and Wealth, we need to find out the top 10% households with the most income and expenditures. However, we know that these collected data is not reliable due to many misstatements. Despite these misstatements, we have some features in the dataset which are certainly reliable. But these certain features are just a little part of information for each household wealth."
Unreliable data means that households tell lies to government. These households misstate their income and wealth in order to unfairly get more governmental services. Therefore, these fraudulent statements in original data will lead to incorrect results and patterns.
Now, I have below questions:
How should we deal with unreliable data in data science?
Is there any way to figure out these misstatements and then report the top 10% rich people with better accuracy using Machine Learning algorithms?
-How can we evaluate our errors in this study? Since we have unlabeled dataset, should I look for labeling techniques? Or, should I use unsupervised methods? Or, should I work with semi-supervised learning methods?
Is there any idea or application in Machine Learning which tries to improve the quality of collected data?
Please introduce me any ideas or references which can help me in this issue.
Thanks in advance.
Q: How should we deal with unreliable data in data science
A: Use feature engineering to fix unreliable data (make some transformations on unreliable data to make it reliable) or drop them out completely - bad features could significantly decrease the quality of the model
Q: Is there any way to figure out these misstatements and then report the top 10% rich people with better accuracy using Machine Learning algorithms?
A: ML algorithms are not magic sticks, they can't figure out anything unless you tell them what you are looking for. Can you describe what means 'unreliable'? If yes, you can, as I mentioned, use feature engineering or write a code which will fix the data. Otherwise no ML algorithm will be able to help you, without the description of what exactly you want to achieve
Q: Is there any idea or application in Machine Learning which tries to improve the quality of collected data?
A: I don't think so just because the question itself is too open-ended. What means 'the quality of the data'?
Generally, here are couple of things for you to consider:
1) Spend some time on googling feature engineering guides. They cover how to prepare your data for you ML algorithms, refine it, fix it. Good data with good features dramatically increase the results.
2) You don't need to use all of features from original data. Some of features of original dataset are meaningless and you don't need to use them. Try to run gradient boosting machine or random forest classifier from scikit-learn on your dataset to perform classification (or regression, if you do regression). These algorithms also evaluate importance of each feature of original dataset. Part of your features will have extremely low importance for classification, so you may wish to drop them out completely or try to combine unimportant features together somehow to produce something more important.

How to identify in advance the key features from large data set where most of the data goes under one category using supervised learning

I have a very large data set extracted from machine(stream data) where most of the data fall under one category. if I train a classifier using the current data the accuracy will be very low. how to identify the key features in the giving data? also how can I measure the probability of some previous features in the time series?
Typical methods for identifying important features include PCA and ICA. However, even more valuable than these methods is having an understanding of the underlying system your data is representing.
It's difficult to answer without more information about the data structure. The best classification approach depends on the structure of your data and the aims of your analysis. There are some classifiers which can cope quite well with skewed data, I'd suggest that you have a look at some of the ensemble methods such as boosting and random or rotation forests. Some of these classification methods, such as rotation forests, provide information about variable importance as part of the training process. If you just want to work out which features are most important, you could try using CART/random forests. If you want detailed help, though, I'd strongly suggest that you provide more information about your data structure and what you'd like to achieve.

Resources