Random Forest - Max Features - machine-learning

I do have a question and I need your support. I have a data set which I am analyzing. I need to predict a target. To do this I did some data cleaning, among others drop highly (linear correlated feautes)
After preparing my data I applied random forest regressor (it is a regression problem). I am stucked a bit, since I really cannot catch the meaning and thus the value for max_features
I found the following page answer, where it is written
features=n_features for regression is a mistake on scikit's part. The original paper for RF gave max_features = n_features/3 for regression
I do get different results if I use max_features=sqrt(n) or max_features=n_features
Can any1 give me a good explanation how to approach this parameter?
That would be really great

max_features is a parameter that needs to be tuned. Values such as sqrt or n/3 are defaults and usually perform decently, but the parameter needs to be optimized for every dataset, as it will depend on the features you have, their correlations and importances.
Therefore, I suggest training the model many times with a grid of values for max_features, trying every possible value from 2 to the total number of your features. Train your RandomForestRegressor with oob_score=True and use oob_score_ to assess the performance of the Forest. Once you have looped over all possible values of max_features, keep the one that obtained the highest oob_score.
For safety, keep the n_estimators on the high end.
PS: this procedure is basically a grid search optimization for one parameter, and is usually done via Cross Validation. Since RFs give you OOB scores, you can use these instead of CV scores, as they are quicker to compute.

Related

How to get the final equation that the Random Forest algorithm uses on your independent variables to predict your dependent variable?

I am working on optimizing a manufacturing based dataset which consists of a huge number of controllable parameters. The goal is to attain the best run settings of these parameters.
I familiarized myself with several predictive algorithms while doing my research and if I say, use Random Forest to predict my dependent variable to understand how important each independent variable is, is there a way to extract the final equation/relationship the algorithm uses?
I'm not sure if my question was clear enough, please let me know if there's anything else I can add here.
There is no general way to get an interpretable equation from a random forest, explaining how your covariates affect the dependent variable. For that you can use a different model more suitable, e.g., linear regression (perhaps with kernel functions), or a decision tree. Note that you can use one model for prediction, and one model for descriptive analysis - there's no inherent reason to stick with a single model.
use Random Forest to predict my dependent variable to understand how important each independent variable is
Understanding how important each dependent variable, does not necessarily mean you need the question in the title of your question, namely getting the actual relationship. Most random forest packages have a method quantifying how much each covariate affected the model over the train set.
There is a number of methods to estimate feature importance based on trained model. For Random Forest, most famous methods are MDI (Mean Decrease of Impurity) and MDA (Mean Decrease of Accuracy). Many popular ML libraries support feature importance estimation out of the box for Random Forest.

Python/SKlearn: Using KFold Results in big ROC_AUC Variations

Based on data that our business department supplied to us, I used the sklearn decision tree algorithm to determine the ROC_AUC for a binary classification problem.
The data consists of 450 rows and there are 30 features in the data.
I used 10 times StratifiedKFold repetition/split of training and test data. As a result, I got the following ROC_AUC values:
0.624
0.594
0.522
0.623
0.585
0.656
0.629
0.719
0.589
0.589
0.592
As I am new in machine learning, I am unsure whether such a variation in the ROC_AUC values can be expected (with minimum values of 0.522 and maximum values of 0.719).
My questions are:
Is such a big variation to be expected?
Could it be reduced with more data (=rows)?
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?
Well, you do k-fold splits to actually evaluate how well your model generalizes.
Therefore, from your current results I would assume the following:
This is a difficult problem, the AUCs are usually low.
0.71 is an outlier, you were just lucky there (probably).
Important questions that will help us help you:
What is the proportion of the binary classes? Are they balanced?
What are the features? Are they all continuous? If categorical, are they ordinal or nominal?
Why Decision Tree? Have you tried other methods? Logistic Regression for instance is a good start before you move on to more advanced ML methods.
You should run more iterations, instead of k fold use the ShuffleSplit function and run at least 100 iterations, compute the Average AUC with 95% Confidence Intervals. That will give you a better idea of how well the models perform.
Hope this helps!
Is such a big variation to be expected?
This is a textbook case of high variance.
Depending on the difficulty of your problem, 405 training samples may not be enough for it to generalize properly, and the random forest may be too powerful.
Try adding some regularization, by limiting the number of splits that the trees are allowed to make. This should reduce the variance in your model, though you might expect a potentially lower average performance.
Could it be reduced with more data (=rows)?
Yes, adding data is the other popular way of lowering the variance of your model. If you're familiar with deep learning, you'll know that deep models usually need LOTS of samples to learn properly. That's because they are very powerful models with an intrinsically high variance, and therefore a lot of data is needed for them to generalize.
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?
Variance will decrease with regularization and adding data, it has no relation to the actual performance "number" that you get.
Cheers

What are the metrics to evaluate a machine learning algorithm

I would like to know what are the various techniques and metrics used to evaluate how accurate/good an algorithm is and how to use a given metric to derive a conclusion about a ML model.
one way to do this is to use precision and recall, as defined here in wikipedia.
Another way is to use the accuracy metric as explained here. So, what I would like to know is whether there are other metrics for evaluating an ML model?
I've compiled, a while ago, a list of metrics used to evaluate classification and regression algorithms, under the form of a cheatsheet. Some metrics for classification: precision, recall, sensitivity, specificity, F-measure, Matthews correlation, etc. They are all based on the confusion matrix. Others exist for regression (continuous output variable).
The technique is mostly to run an algorithm on some data to get a model, and then apply that model on new, previously unseen data, and evaluate the metric on that data set, and repeat.
Some techniques (actually resampling techniques from statistics):
Jacknife
Crossvalidation
K-fold validation
bootstrap.
Talking about ML in general is a quite vast field, but I'll try to answer any way. The Wikipedia definition of ML is the following
Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
In this context learning can be defined parameterization of an algorithm. The parameters of the algorithm are derived using input data with a known output. When the algorithm has "learned" the association between input and output, it can be tested with further input data for which the output is well known.
Let's suppose your problem is to obtain words from speech. Here the input is some kind of audio file containing one word (not necessarily, but I supposed this case to keep it quite simple). You'd record X words N times and then use (for example) N/2 of the repetitions to parameterize your algorithm, disregarding - at the moment - how your algorithm would look like.
Now on the one hand - depending on the algorithm - if you feed your algorithm with one of the remaining repetitions, it may give you some certainty estimate which may be used to characterize the recognition of just one of the repetitions. On the other hand you may use all of the remaining repetitions to test the learned algorithm. For each of the repetitions you pass it to the algorithm and compare the expected output with the actual output. After all you'll have an accuracy value for the learned algorithm calculated as the quotient of correct and total classifications.
Anyway, the actual accuracy will depend on the quality of your learning and test data.
A good start to read on would be Pattern Recognition and Machine Learning by Christopher M Bishop
There are various metrics for evaluating the performance of ML model and there is no rule that there are 20 or 30 metrics only. You can create your own metrics depending on your problem. There are various cases wherein when you are solving real - world problem where you would need to create your own custom metrics.
Coming to the existing ones, it is already listed in the first answer, I would just highlight each metrics merits and demerits to better have an understanding.
Accuracy is the simplest of the metric and it is commonly used. It is the number of points to class 1/ total number of points in your dataset. This is for 2 class problem where some points belong to class 1 and some to belong to class 2. It is not preferred when the dataset is imbalanced because it is biased to balanced one and it is not that much interpretable.
Log loss is a metric that helps to achieve probability scores that gives you better understanding why a specific point is belonging to class 1. The best part of this metric is that it is inbuild in logistic regression which is famous ML technique.
Confusion metric is best used for 2-class classification problem which gives four numbers and the diagonal numbers helps to get an idea of how good is your model.Through this metric there are others such as precision, recall and f1-score which are interpretable.

Using weka to classify sensor data

I am working on a classification problem, which has different sensors. Each sensor collect a sets of numeric values.
I think its a classification problem and want to use weka as a ML tool for this problem. But I am not sure how to use weka to deal with the input values? And which classifier will best fit for this problem( one instance of a feature is a sets of numeric value)?
For example, I have three sensors A ,B, C. Can I define 5 collected data from all sensors,as one instance? Such as, One instance of A is {1,2,3,4,5,6,7}, and one instance of B is{3,434,534,213,55,4,7). C{424,24,24,13,24,5,6}.
Thanks a lot for your time on reviewing my question.
Commonly the first classifier to try is Naive Bayes (you can find it under "Bayes" directory in Weka) because it's fast, parameter less and the classification accuracy is hard to beat whenever the training sample is small.
Random Forest (you can find it under "Tree" directory in Weka) is another pleasant classifier since it process almost any data. Just run it and see whether it gives better results. It can be just necessary to increase the number of trees from the default 10 to some higher value. Since you have 7 attributes 100 trees should be enough.
Then I would try k-NN (you can find it under "Lazy" directory in Weka and it's called "IBk") because it commonly ranks amount the best single classifiers for a wide range of datasets. The only issues with k-nn are that it scales badly for large datasets (> 1GB) and it needs to fine tune k, the number of neighbors. This value is by default set to 1 but with increasing number of training samples it's commonly better to set it up to some higher integer value in range from 2 to 60.
And finally for some datasets where both, Naive Bayes and k-nn performs poorly, it's best to use SVM (under "Functions", it's called "Lib SVM"). However, it can be hassle to set up all the parameters of the SVM to get competitive results. Hence I leave it to the end when I already know what classification accuracies to expect. This classifier may not be the most convenient if you have more than two classes to classify.

progressive random forest?

I am considering using random forest for a classification problem. The data comes in sequences. I plan to use first N(500) to train the classifier. Then, use the classifier to classify the data after that. It will make mistakes and the mistakes sometimes can be recorded.
My question is: can I use those mis-classified data to retrain the original classifier and how? If I simply add the mis-classified ones to original training set with size N, then the importance of the mis-classified ones will be exaggerated as the corrected classified ones are ignored. Do I have to retrain the classifier using all data? What other classifiers can do this kind of learning?
What you describe is a basic version of the Boosting meta-algorithm.
It's better if your underlying learner have a natural way to handle samples weights. I have not tried boosting random forests (generally boosting is used on individual shallow decision trees with a depth limit between 1 and 3) but that might work but will likely be very CPU intensive.
Alternatively you can train several independent boosted decision stumps in parallel with different PRNG seed values and then aggregate the final decision function as you would do with a random forests (e.g. voting or averaging class probability assignments).
If you are using Python, you should have a look at the scikit-learn documentation on the topic.
Disclaimer: I am a scikit-learn contributor.
Here is my understanding of your problem.
You have a dataset and create two subdata set with it say, training dataset and evaluation dataset. How can you use the evaluation dataset to improve classification performance ?
The point of this probleme is'nt to find a better classifier but to find a good way for the evaluation, then have a good classifier in the production environnement.
Evaluation purpose
As the evaluation dataset has been tag for evaluation there is now way yo do this. You must use another way for training and evaluation.
A common way to do is cross-validation;
Randomize your samples in your dataset. Create ten partitions from your initial dataset. Then do ten iteration of the following :
Take all partitions but the n-th for training and do the evaluation with the n-th.
After this take the median of the errors of the ten run.
This will give you the errors rate of yours classifiers.
The least run give you the worst case.
Production purpose
(no more evaluation)
You don't care anymore of evaluation. So take all yours samples of all your dataset and give it for training to your classifier (re-run a complet simple training). The result can be use in production environnement, but can't be evaluate any more with any of yours data. The result is as best as the worst case in previous partitions set.
Flow sample processing
(production or learning)
When you are in a flow where new samples are produce over time. You will face case where some sample correct errors case. This is the wanted behavior because we want the system to
improve itself. If you just correct in place the leaf in errors, after some times your
classifier will have nothing in common with the original random forest. You will be doing
a form of greedy learning, like meta taboo search. Clearly we don't wanna this.
If we try to reprocess all the dataset + the new sample every time a new sample is available we will experiment terrible low latency. The solution is like human, sometime
a background process run (when service is on low usage), and all data get a complet
re-learning; and at the end swap old and new classifier.
Sometime the sleep time is too short for a complet re-learning. So you have to use node computing clusturing like that. It cost lot of developpement because you probably need to re-write the algorithms; but at that time you already have the bigest computer you could have found.
note : Swap process is very important to master. You should already have it in your production plan. What do you do if you want to change algorithms? backup? benchmark? power-cut? etc...
I would simply add the new data and retrain the classifier periodically if it weren't too expensive.
A simple way to keep things in balance is to add weights.
If you weigh all positive samples by 1/n_positive and all negative samples by 1/n_negative ( including all the new negative samples you're getting ), then you don't have to worry about the classifier getting out of balance.

Resources