Machine learning : RandomForest data pre-processing - machine-learning

Before fitting a RandomForest what should be done with continuous features, should they be standard scaled?

No decision trees approach or Random Forests for that matter don't really care whether they are dealing with continuous data or discrete data. So even if you don't standardize it wont be a issue.

Related

Classifiers of machine learning for different parameters of dataset

Why behaviour of different classifier differ for different data?
Based on what parameters we can decide the good classifier for particular dataset?
For some dataset naive bayes gives better accuracy than SVM classifier
and for other dataset SVM performs better than naive bayes. Why is it
so? What is the reason?
Those are completely different classifiers. If you would have one classifier which is allways better than the other one. Why would you need the "bad one" then?
First google hit about when SVM's are not the best choice:
https://www.quora.com/For-what-kind-of-classification-problems-is-SVM-a-bad-approach
There is no general answer for this question. To understand which classifier to be used when you will need to understand the algorithm behind the classification procedure.
For instance logistic regression assumes a normal distribution of y and is generally useful when a particular parameter is not a uniquely deciding factor however combined weightage of the factors make a difference, for instance in text classification.
Decision tree on the other hand splits on the basis of parameter which gives most information. So if you have a set of parameters which is highly correlated with the label, then it makes more sense to use decision tree based classifiers.
SVM, work based on identifying adequate hyperplanes. These are generally useful when it is not possible to classify data in one plane but projecting them into higher plane classifies them easily. This is a nice tutorial on SVM https://blog.statsbot.co/support-vector-machines-tutorial-c1618e635e93
In short the only way to learn which classifier will be better in which situation is to understand how they work, and then figure out if they are best for your situation.
Another, crude way will be try every classifier and pick the best one, but i don't think you are interested in that.

Linear Discriminant Analysis vs Naive Bayes

What are the advantages and disadvantages of LDA vs Naive Bayes in
terms of machine learning classification?
I know some of the differences like Naive Bayes assumes variables to be independent, while LDA assumes Gaussian class-conditional density models, but I don't understand when to use LDA and when to use NB depending on the situation?
Both methods are pretty simple, so it's hard to say which one is going to work much better. It's often faster just to try both and calculate the test accuracy. But here's the list of characteristics that usually indicate if certain method is less likely to give good results. It all boils down to the data.
Naive Bayes
The first disadvantage of the Naive Bayes classifier is the feature independence assumption. In practice, the data is multi-dimensional and different features do correlate. Due to this, the result can be potentially pretty bad, though not always significantly. If you know for sure, that features are dependent (e.g. pixels of an image), don't expect Naive Bayes to show off.
Another problem is data scarcity. For any possible value of a feature, a likelihood is estimated by a frequentist approach. This can result in probabilities being close to 0 or 1, which in turn leads to numerical instabilities and worse results.
A third problem arises for continuous features. The Naive Bayes classifier works only with categorical variables, so one has to transform continuous features to discrete, by which throwing away a lot of information. If there's a continuous variable in the data, it's a strong sign against Naive Bayes.
Linear Discriminant Analysis
The LDA does not work well if the classes are not balanced, i.e. the number of objects in various classes are highly different. The solution is to get more data, which can be pretty easy or almost impossible, depending on a task.
Another disadvantage of LDA is that it's not applicable for non-linear problems, e.g. separation of donut-shape point clouds, but in high dimensional spaces it's hard to spot it right away. Usually you understand this after you see LDA not working, but if the data is known to be very non-linear, this is a strong sign against LDA.
In addition, LDA can be sensitive to overfitting and need careful validation / testing.

Machine learning algorithm for few samples and features

I am intended to do a yes/no classifier. The problem is that the data does not come from me, so I have to work with what I have been given. I have around 150 samples, each sample contains 3 features, these features are continuous numeric variables. I know the dataset is quite small. I would like to make you two questions:
A) What would be the best machine learning algorithm for this? SVM? a neural network? All that I have read seems to require a big dataset.
B)I could make the dataset a little bit bigger by adding some samples that do not contain all the features, only one or two. I have read that you can use sparse vectors in this case, is this possible with every machine learning algorithm? (I have seen them in SVM)
Thanks a lot for your help!!!
My recommendation is to use a simple and straightforward algorithm, like decision tree or logistic regression, although, the ones you refer to should work equally well.
The dataset size shouldn't be a problem, given that you have far more samples than variables. But having more data always helps.
Naive Bayes is a good choice for a situation when there are few training examples. When compared to logistic regression, it was shown by Ng and Jordan that Naive Bayes converges towards its optimum performance faster with fewer training examples. (See section 4 of this book chapter.) Informally speaking, Naive Bayes models a joint probability distribution that performs better in this situation.
Do not use a decision tree in this situation. Decision trees have a tendency to overfit, a problem that is exacerbated when you have little training data.

Sweep through all machine learning classifiers?

I'm using Weka to perform classification, clustering, and some regression on a few large data sets. I'm currently trying out all the classifiers (decision tree, SVM, naive bayes, etc.).
Is there a way (in Weka or other machine learning toolkit) to sweep through all the available classifier algorithms to find the one that produces the best cross-validated accuracy or other metric?
I'd like to find the best clustering algorithm, too, for my other clustering problem; perhaps finding the lowest sum-of-squared-error?
Isn't that some kind of overfitting, too? Trying tons of classifiers, and choosing the best?
Also note that preprocessing is usually very important, and different classifiers may need different preprocessing; and each classifier has in turn a dozen or so parameters...
Same for clustering, don't choose a clustering algorithm by some metric. Because if you choose e.g. "lowest sum-of-squares", k-means will win. Not because it is better. But because it is more overfit to your evaluation method: k-means optimizes the sum-of-squares. The results may be crap on other metrics, but on SSQ, they are by design a local optimum.
Data mining is not something you can automate to a push-button level.
It's a skill that requires experience on how to preprocess, choose algorithms, adjust parameters and evaluate the actual outcome. Otherwise, you'd have some software on the market where you just feed your data and get the optimal classifier out.

training a decision tree

I am trying to get started with Machine Learning. I have some training data representing pixel values of digits in images and I am trying to train a decision tree out of this. What would be a good way of getting started? What tools should I consider (pointers on related documentation would help)? I also want to train a random forest on the data to compare performance versus decision tree. Any guidance would be of great help.
The best way to get started is probably Weka. Apart from offering implementations of a random forest classifier as well as several decision trees (among lots of other algorithms), it also provides tools for processing and visualizing the data. It comes with a relatively easy to use GUI.
The random forest uses trees, so I'd probably counsel you to get the trees working first. Once you know all about trees, you can read about forests and it will be very straightforward. However, you should start by trying to learn about machine learning rather than just jumping into a library. I would start by understanding how to use decision trees on Boolean features (much simpler) using the method of maximizing entropy. Once you understand that algorithm well enough to run it by hand on a small dataset, read up on how to use decision-trees on real valued features. Then check out the library.

Resources