How to categorize continuous data? - machine-learning

I have two dependent continuous variables and i want to use their combined values to predict the value of a third binary variable. How do i go about discretizing/categorizing the values? I am not looking for clustering algorithms, i'm specifically interested in obtaining 'meaningful' discrete categories i can subsequently use in in a Bayesian classifier.
Pointers to papers, books, online courses, all very much appreciated!

That is the essence of machine learning and problem one of the most studied problem.
Least-square regression, logistic regression, SVM, random forest are widely used for this type of problem, which is called binary classification.
If your goal is to pragmatically classify your data, several libraries are available, like Scikits-learn in python and weka in java. They have a great documentation.
But if you want to understand what's the intrinsics of machine learning, just search (here or on google) for machine learning resources.

If you wanted to be a real nerd, generate a bunch of different possible discretizations and then train a classifier on it, and then characterize the discretizations by features and then run a classifier on that, and see what sort of discretizations are best!?
In general discretizing stuff is more of an art and having a good understanding of what the input variable ranges mean.

Related

How to determine, whether to use machine learning algorithms or data mining technique for a given scenario?

I have been reading so many articles on Machine Learning and Data mining from the past few weeks. Articles like the difference between ML and DM, similarities, etc. etc. But I still have one question, it may look like a silly question,
How to determine, when should we use ML algorithms and when should we use DM?
Because I have performed some practicals of DM using weka on Time Series Analysis(future population prediction, sales prediction), text mining using R/python, etc. Same can be done using ML algorithms also, like future population prediction using Linear regression.
So how to determine, that, for a given problem ML is best suitable or Dm is best suitable.
Thanks in advance.
Probably the closest thing to the quite arbitrary and meaningless separation of ML and DM is unsupervised methods vs. supervised learning.
Choose ML if you have training data for your target function.
Choose DM when you need to explore your data.

Machine Learning - Feature Ranking by Algorithms

I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome. I have 5 algorithms:
Neural Networks
Logistics
Naive
Random Forest
Adaboost
I read a lot about Information Gain technique and it seems it is independent of the machine learning algorithm used. It is like a preprocess technique.
My question follows, is it best practice to perform feature importance for each algorithm dependently or just use Information Gain. If yes what are the technique used for each ?
First of all, it's worth stressing that you have to perform the feature selection based on the training data only, even if it is a separate algorithm. During testing, you then select the same features from the test dataset.
Some approaches that spring to mind:
Mutual information based feature selection (eg here), independent of the classifier.
Backward or forward selection (see stackexchange question), applicable to any classifier but potentially costly since you need to train/test many models.
Regularisation techniques that are part of the classifier optimisation, eg Lasso or elastic net. The latter can be better in datasets with high collinearity.
Principal components analysis or any other dimensionality reduction technique that groups your features (example).
Some models compute latent variables which you can use for interpretation instead of the original features (e.g. Partial Least Squares or Canonical Correlation Analysis).
Specific classifiers can aid interpretability by providing extra information about the features/predictors, off the top of my head:
Logistic regression: you can obtain a p-value for every feature. In your interpretation you can focus on those that are 'significant' (eg p-value <0.05). (same for two-classes Linear Discriminant Analysis)
Random Forest: can return a variable importance index that ranks the variables from most to least important.
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome.
This will depend on the algorithm. If you have 5 algorithms, you will likely get 5 slightly different answers, unless you perform the feature selection prior to classification (eg using mutual information). One reason is that Random Forests and neural networks would pick up nonlinear relationships while logistic regression wouldn't. Furthermore, Naive Bayes is blind to interactions.
So unless your research is explicitly about these 5 models, I would rather select one model and proceed with it.
Since your purpose is to get some intuition on what's going on, here is what you can do:
Let's start with Random Forest for simplicity, but you can do this with other algorithms too. First, you need to build a good model. Good in the sense that you need to be satisfied with its performance and it should be Robust, meaning that you should use a validation and/or a test set. These points are very important because we will analyse how the model takes its decisions, so if the model is bad you will get bad intuitions.
After having built the model, you can analyse it at two level : For the whole dataset (understanding your process), or for a given prediction. For this task I suggest you to look at the SHAP library which computes features contributions (i.e how much does a feature influences the prediction of my classifier) that can be used for both puproses.
For detailled instructions about this process and more tools, you can look fast.ai excellent courses on the machine learning serie, where lessons 2/3/4/5 are about this subject.
Hope it helps!

why using support vector machine?

I have some questions about SVM :
1- Why using SVM? or in other words, what causes it to appear?
2- The state Of art (2017)
3- What improvements have they made?
SVM works very well. In many applications, they are still among the best performing algorithms.
We've seen some progress in particular on linear SVMs, that can be trained much faster than kernel SVMs.
Read more literature. Don't expect an exhaustive answer in this QA format. Show more effort on your behalf.
SVM's are most commonly used for classification problems where labeled data is available (supervised learning) and are useful for modeling with limited data. For problems with unlabeled data (unsupervised learning), then support vector clustering is an algorithm commonly employed. SVM tends to perform better on binary classification problems since the decision boundaries will not overlap. Your 2nd and 3rd questions are very ambiguous (and need lots of work!), but I'll suffice it to say that SVM's have found wide range applicability to medical data science. Here's a link to explore more about this: Applications of Support Vector Machine (SVM) Learning in Cancer Genomics

When to use supervised or unsupervised learning?

Which are the fundamental criterias for using supervised or unsupervised learning?
When is one better than the other?
Is there specific cases when you can only use one of them?
Thanks
If you a have labeled dataset you can use both. If you have no labels you only can use unsupervised learning.
It´s not a question of "better". It´s a question of what you want to achieve. E.g. clustering data is usually unsupervised – you want the algorithm to tell you how your data is structured. Categorizing is supervised since you need to teach your algorithm what is what in order to make predictions on unseen data.
See 1.
On a side note: These are very broad questions. I suggest you familiarize yourself with some ML foundations.
Good podcast for example here: http://ocdevel.com/podcasts/machine-learning
Very good book / notebooks by Jake VanderPlas: http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/Index.ipynb
Depends on your needs. If you have a set of existing data including the target values that you wish to predict (labels) then you probably need supervised learning (e.g. is something true or false; or does this data represent a fish or cat or a dog? Simply put - you already have examples of right answers and you are just telling the algorithm what to predict). You also need to distinguish whether you need a classification or regression. Classification is when you need to categorize the predicted values into given classes (e.g. is it likely that this person develops a diabetes - yes or no? In other words - discrete values) and regression is when you need to predict continuous values (1,2, 4.56, 12.99, 23 etc.). There are many supervised learning algorithms to choose from (k-nearest neighbors, naive bayes, SVN, ridge..)
On contrary - use the unsupervised learning if you don't have the labels (or target values). You're simply trying to identify the clusters of data as they come. E.g. k-Means, DBScan, spectral clustering..)
So it depends and there's no exact answer but generally speaking you need to:
Collect and see you data. You need to know your data and only then decide which way you choose or what algorithm will best suite your needs.
Train your algorithm. Be sure to have a clean and good data and bear in mind that in case of unsupervised learning you can skip this step as you don't have the target values. You test your algorithm right away
Test your algorithm. Run and see how well your algorithm behaves. In case of supervised learning you can use some training data to evaluate how well is your algorithm doing.
There are many books online about machine learning and many online lectures on the topic as well.
Depends on the data set that you have.
If you have target feature in your hand then you should go for supervised learning. If you don't have then it is a unsupervised based problem.
Supervised is like teaching the model with examples. Unsupervised learning is mainly used to group similar data, it plays a major role in feature engineering.
Thank you..

How to approach a machine learning programming competition

Many machine learning competitions are held in Kaggle where a training set and a set of features and a test set is given whose output label is to be decided based by utilizing a training set.
It is pretty clear that here supervised learning algorithms like decision tree, SVM etc. are applicable. My question is, how should I start to approach such problems, I mean whether to start with decision tree or SVM or some other algorithm or is there is any other approach i.e. how will I decide?
So, I had never heard of Kaggle until reading your post--thank you so much, it looks awesome. Upon exploring their site, I found a portion that will guide you well. On the competitions page (click all competitions), you see Digit Recognizer and Facial Keypoints Detection, both of which are competitions, but are there for educational purposes, tutorials are provided (tutorial isn't available for the facial keypoints detection yet, as the competition is in its infancy. In addition to the general forums, competitions have forums also, which I imagine is very helpful.
If you're interesting in the mathematical foundations of machine learning, and are relatively new to it, may I suggest Bayesian Reasoning and Machine Learning. It's no cakewalk, but it's much friendlier than its counterparts, without a loss of rigor.
EDIT:
I found the tutorials page on Kaggle, which seems to be a summary of all of their tutorials. Additionally, scikit-learn, a python library, offers a ton of descriptions/explanations of machine learning algorithms.
This cheatsheet http://peekaboo-vision.blogspot.pt/2013/01/machine-learning-cheat-sheet-for-scikit.html is a good starting point. In my experience using several algorithms at the same time can often give better results, eg logistic regression and svm where the results of each one have a predefined weight. And test, test, test ;)
There is No Free Lunch in data mining. You won't know which methods work best until you try lots of them.
That being said, there is also a trade-off between understandability and accuracy in data mining. Decision Trees and KNN tend to be understandable, but less accurate than SVM or Random Forests. Kaggle looks for high accuracy over understandability.
It also depends on the number of attributes. Some learners can handle many attributes, like SVM, whereas others are slow with many attributes, like neural nets.
You can shrink the number of attributes by using PCA, which has helped in several Kaggle competitions.

Resources