Machine learning algorithm for few samples and features - machine-learning

I am intended to do a yes/no classifier. The problem is that the data does not come from me, so I have to work with what I have been given. I have around 150 samples, each sample contains 3 features, these features are continuous numeric variables. I know the dataset is quite small. I would like to make you two questions:
A) What would be the best machine learning algorithm for this? SVM? a neural network? All that I have read seems to require a big dataset.
B)I could make the dataset a little bit bigger by adding some samples that do not contain all the features, only one or two. I have read that you can use sparse vectors in this case, is this possible with every machine learning algorithm? (I have seen them in SVM)
Thanks a lot for your help!!!

My recommendation is to use a simple and straightforward algorithm, like decision tree or logistic regression, although, the ones you refer to should work equally well.
The dataset size shouldn't be a problem, given that you have far more samples than variables. But having more data always helps.

Naive Bayes is a good choice for a situation when there are few training examples. When compared to logistic regression, it was shown by Ng and Jordan that Naive Bayes converges towards its optimum performance faster with fewer training examples. (See section 4 of this book chapter.) Informally speaking, Naive Bayes models a joint probability distribution that performs better in this situation.
Do not use a decision tree in this situation. Decision trees have a tendency to overfit, a problem that is exacerbated when you have little training data.

Related

Machine Learning - Feature Ranking by Algorithms

I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome. I have 5 algorithms:
Neural Networks
Logistics
Naive
Random Forest
Adaboost
I read a lot about Information Gain technique and it seems it is independent of the machine learning algorithm used. It is like a preprocess technique.
My question follows, is it best practice to perform feature importance for each algorithm dependently or just use Information Gain. If yes what are the technique used for each ?
First of all, it's worth stressing that you have to perform the feature selection based on the training data only, even if it is a separate algorithm. During testing, you then select the same features from the test dataset.
Some approaches that spring to mind:
Mutual information based feature selection (eg here), independent of the classifier.
Backward or forward selection (see stackexchange question), applicable to any classifier but potentially costly since you need to train/test many models.
Regularisation techniques that are part of the classifier optimisation, eg Lasso or elastic net. The latter can be better in datasets with high collinearity.
Principal components analysis or any other dimensionality reduction technique that groups your features (example).
Some models compute latent variables which you can use for interpretation instead of the original features (e.g. Partial Least Squares or Canonical Correlation Analysis).
Specific classifiers can aid interpretability by providing extra information about the features/predictors, off the top of my head:
Logistic regression: you can obtain a p-value for every feature. In your interpretation you can focus on those that are 'significant' (eg p-value <0.05). (same for two-classes Linear Discriminant Analysis)
Random Forest: can return a variable importance index that ranks the variables from most to least important.
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome.
This will depend on the algorithm. If you have 5 algorithms, you will likely get 5 slightly different answers, unless you perform the feature selection prior to classification (eg using mutual information). One reason is that Random Forests and neural networks would pick up nonlinear relationships while logistic regression wouldn't. Furthermore, Naive Bayes is blind to interactions.
So unless your research is explicitly about these 5 models, I would rather select one model and proceed with it.
Since your purpose is to get some intuition on what's going on, here is what you can do:
Let's start with Random Forest for simplicity, but you can do this with other algorithms too. First, you need to build a good model. Good in the sense that you need to be satisfied with its performance and it should be Robust, meaning that you should use a validation and/or a test set. These points are very important because we will analyse how the model takes its decisions, so if the model is bad you will get bad intuitions.
After having built the model, you can analyse it at two level : For the whole dataset (understanding your process), or for a given prediction. For this task I suggest you to look at the SHAP library which computes features contributions (i.e how much does a feature influences the prediction of my classifier) that can be used for both puproses.
For detailled instructions about this process and more tools, you can look fast.ai excellent courses on the machine learning serie, where lessons 2/3/4/5 are about this subject.
Hope it helps!

Classifiers of machine learning for different parameters of dataset

Why behaviour of different classifier differ for different data?
Based on what parameters we can decide the good classifier for particular dataset?
For some dataset naive bayes gives better accuracy than SVM classifier
and for other dataset SVM performs better than naive bayes. Why is it
so? What is the reason?
Those are completely different classifiers. If you would have one classifier which is allways better than the other one. Why would you need the "bad one" then?
First google hit about when SVM's are not the best choice:
https://www.quora.com/For-what-kind-of-classification-problems-is-SVM-a-bad-approach
There is no general answer for this question. To understand which classifier to be used when you will need to understand the algorithm behind the classification procedure.
For instance logistic regression assumes a normal distribution of y and is generally useful when a particular parameter is not a uniquely deciding factor however combined weightage of the factors make a difference, for instance in text classification.
Decision tree on the other hand splits on the basis of parameter which gives most information. So if you have a set of parameters which is highly correlated with the label, then it makes more sense to use decision tree based classifiers.
SVM, work based on identifying adequate hyperplanes. These are generally useful when it is not possible to classify data in one plane but projecting them into higher plane classifies them easily. This is a nice tutorial on SVM https://blog.statsbot.co/support-vector-machines-tutorial-c1618e635e93
In short the only way to learn which classifier will be better in which situation is to understand how they work, and then figure out if they are best for your situation.
Another, crude way will be try every classifier and pick the best one, but i don't think you are interested in that.

Linear Discriminant Analysis vs Naive Bayes

What are the advantages and disadvantages of LDA vs Naive Bayes in
terms of machine learning classification?
I know some of the differences like Naive Bayes assumes variables to be independent, while LDA assumes Gaussian class-conditional density models, but I don't understand when to use LDA and when to use NB depending on the situation?
Both methods are pretty simple, so it's hard to say which one is going to work much better. It's often faster just to try both and calculate the test accuracy. But here's the list of characteristics that usually indicate if certain method is less likely to give good results. It all boils down to the data.
Naive Bayes
The first disadvantage of the Naive Bayes classifier is the feature independence assumption. In practice, the data is multi-dimensional and different features do correlate. Due to this, the result can be potentially pretty bad, though not always significantly. If you know for sure, that features are dependent (e.g. pixels of an image), don't expect Naive Bayes to show off.
Another problem is data scarcity. For any possible value of a feature, a likelihood is estimated by a frequentist approach. This can result in probabilities being close to 0 or 1, which in turn leads to numerical instabilities and worse results.
A third problem arises for continuous features. The Naive Bayes classifier works only with categorical variables, so one has to transform continuous features to discrete, by which throwing away a lot of information. If there's a continuous variable in the data, it's a strong sign against Naive Bayes.
Linear Discriminant Analysis
The LDA does not work well if the classes are not balanced, i.e. the number of objects in various classes are highly different. The solution is to get more data, which can be pretty easy or almost impossible, depending on a task.
Another disadvantage of LDA is that it's not applicable for non-linear problems, e.g. separation of donut-shape point clouds, but in high dimensional spaces it's hard to spot it right away. Usually you understand this after you see LDA not working, but if the data is known to be very non-linear, this is a strong sign against LDA.
In addition, LDA can be sensitive to overfitting and need careful validation / testing.

What classifier to use while performing unsupervised learning

I am new to Machine learning and I have this basic question. As I am weak in Math part of the algorithm I find it difficult to understand this.
When you are given a task to design a classifier(keep it simple -- a 2 class classifier) using unsupervised learning(no training samples), how to decide what type of classifier(linear or non-linear) to use? If we do not know this, then the importance on feature selection(which means indirectly knowing what the data set is) becomes very critical.
Am I thinking in the right direction or is there something big that I dont know. Insight into this topic is greatly appreciated.
classification is by definition a "supervised learning" problem. such models require examples of points within given classes to understand how to separate the classes from one another. if you are simply looking for relationships between unlabeled data points, you're solving an unsupervised problem. look into clustering algorithms. k-means is where a lot of people start.
hope this helps!
This is a huge problem. Yes, the term "clustering" is the best entry point for googling about that, but I understand that you want to train a classifier, where "training" means optimizing an objective function with parameters. The first choice is definitely not discriminative classifiers (such as linear ones), because with them, the standard maximum likelihood (ML) objective does not work without labels. If you absolutely want to use linear classifiers, then you have to tweak the ML objective, or better use another objective (approximating the classifier risk). But an easier choice is to rather look at generative models, such as HMMs, Naive Bayes, Latent Dirichlet Allocation, ... for which the ML objective works without labels.

How to approach a machine learning programming competition

Many machine learning competitions are held in Kaggle where a training set and a set of features and a test set is given whose output label is to be decided based by utilizing a training set.
It is pretty clear that here supervised learning algorithms like decision tree, SVM etc. are applicable. My question is, how should I start to approach such problems, I mean whether to start with decision tree or SVM or some other algorithm or is there is any other approach i.e. how will I decide?
So, I had never heard of Kaggle until reading your post--thank you so much, it looks awesome. Upon exploring their site, I found a portion that will guide you well. On the competitions page (click all competitions), you see Digit Recognizer and Facial Keypoints Detection, both of which are competitions, but are there for educational purposes, tutorials are provided (tutorial isn't available for the facial keypoints detection yet, as the competition is in its infancy. In addition to the general forums, competitions have forums also, which I imagine is very helpful.
If you're interesting in the mathematical foundations of machine learning, and are relatively new to it, may I suggest Bayesian Reasoning and Machine Learning. It's no cakewalk, but it's much friendlier than its counterparts, without a loss of rigor.
EDIT:
I found the tutorials page on Kaggle, which seems to be a summary of all of their tutorials. Additionally, scikit-learn, a python library, offers a ton of descriptions/explanations of machine learning algorithms.
This cheatsheet http://peekaboo-vision.blogspot.pt/2013/01/machine-learning-cheat-sheet-for-scikit.html is a good starting point. In my experience using several algorithms at the same time can often give better results, eg logistic regression and svm where the results of each one have a predefined weight. And test, test, test ;)
There is No Free Lunch in data mining. You won't know which methods work best until you try lots of them.
That being said, there is also a trade-off between understandability and accuracy in data mining. Decision Trees and KNN tend to be understandable, but less accurate than SVM or Random Forests. Kaggle looks for high accuracy over understandability.
It also depends on the number of attributes. Some learners can handle many attributes, like SVM, whereas others are slow with many attributes, like neural nets.
You can shrink the number of attributes by using PCA, which has helped in several Kaggle competitions.

Resources