Could someone explain the intuition of the parameter "bootstrap" for the random forest model?
When looking at the scikit-learn page https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html:
bootstrap bool, default=True
Whether bootstrap samples are used when building trees. If False, the
whole dataset is used to build each tree.
I am even more confused because I thought that random forest was already a technique using bootstrap so why is there this parameter to define ?
Roughly speaking, bootstrap sampling is just sampling by replacement, which naturally leads to samples of the original dataset being left out, while other samples being present more than once.
I thought that random forest was already a technique using bootstrap
You are right in that the original RF algorithm as suggested by Breiman indeed incorporates bootstrap sampling by default (this is actually an inheritance from bagging, which is used in RF).
Nevertheless, implementations like the scikit-learn one, understandably prefer to leave available the option not to use bootstrap sampling (i.e. sampling with replacement), and use the whole dataset instead; from the docs:
The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.
Similar is the situation in the standard R implementation (here the respective parameter is called replace, and, like here, it's also set by default to TRUE).
So, nothing really strange here, beyond the (generally desirable) design choice of leaving room and flexibility for the practitioner to be able to select bootstrap sampling or not. In the RF early days, bootstrap sampling offered the extra possibility to calculate out-of-bag (OOB) error without using cross-validation, an idea that (I think...) eventually fell out of favor, and "freed" the practitioners to try leaving out the bootstrap sampling option, if this leads to better performance.
You may also find parts of my answer in Why is Random Forest with a single tree much better than a Decision Tree classifier? useful.
Related
I am trying to figure it out which features are being considered in each subsampling in my classification problem, for this, I am assuming that there is a random subset of features of length max_features that is considered when building every tree.
I am interested in this because I am using two different types of features for my problem so I want to make sure that in every tree both types of features are being used for every node split. So one way to at least make each tree to consider all features is by setting the max_features parameter to None. So one question here would be:
Does that mean that both types of features are being considered for every node split?
Another one derived from the previous question is:
Since Random Forest make a subsampling for every tree, is this subsampling among cases (rows) or among columns (features) as well? Besides, can this subsampling be done by group of rows instead of randomly?
Besides, it does not seem to be a good assumption to use all the features in the max_features parameter neither on Decision Trees nor on random forest since it is opposite to the whole point and definition of random forest in terms of correlation among trees (I am not completely sure about this statement).
Does anyone know if this is something that can be modified in the source code or if at least it can be approached differently?
Any suggestion or comment is very welcomed.
Feel free to correct any assumption.
In the source code I have been reading about this but could not find where this might be defined.
Source code inspected so far:
splitter.py code from decision tree
forest.py code from random forest
Does that mean that both types of features are being considered for every node split?
Given that you have correctly pointed out above that setting max_features to None will indeed force the algorithm to consider all features in every split, it is unclear what exactly you are asking here: all means all and, from the point of view of the algorithm, there are not different "types" of features.
Since Random Forest make a subsampling for every tree, is this subsampling among cases (rows) or among columns (features) as well?
Both. But, regarding the rows, it is not exactly subsampling, it is actually bootstrap sampling, i.e. sampling with replacement, which means that, in each sample, some rows will be missing while others will be present multiple times.
Random forest is in fact the combination of two independent ideas: bagging, and random selection of features. The latter corresponds essentially to "column subsampling", while the former includes the bootstrap sampling I have just described.
Besides, can this subsampling be done by group of rows instead of randomly?
AFAIK no, at least in the standard implementations (including scikit-learn).
Does anyone know if this is something that can be modified in the source code or if at least it can be approached differently?
Everything can be modified in the source code, literally; now, if it is really necessary (or even a good idea) to do so is a different story...
Besides, it does not seem to be a good assumption to use all the features in the max_features parameter
It does not indeed, as this is the very characteristic that discriminates RF from the simpler bagging approach (short for bootstrap aggregating). Experiments have indeed shown that adding this random selection of features at each step boosts performance related to simple bagging.
Although your question (and issue) sounds rather vague, my advice would be to "sit back and relax", letting the (powerful enough) RF algorithm do its job with your data...
I am trying to maximize precision in a binary classification problem (there is a high cost to false positives). The data set is really unbalanced as well. Would it make sense to run a DRF or XGBOOST model twice, using the weights column the second time in order to counter-act false positives?
Are there other methods within these H2O algorithms to maximize precision (rather than log-loss) besides this potential method? I am also going to use an ensemble (which does seems to increase precision). Cross-validation does not appear to help.
Firstly I would use balance_classes (set it to true). That will help, a bit, with unbalanced data. (Also look at class_sampling_factors and max_after_balance_size if you need to take fine control.)
My hunch would be that your suggestion to use the output of one model to weight a second model is dangerous. It sounds like a bit of the idea of stacked ensemble, but hand-coded and custom code is more likely to have bugs. (But, if you do try it, it would be interesting to see the code and the results.)
To maximize precision I'd go with an ensemble, and put my effort into making 3 or 4 models that have different strengths and weaknesses. E.g. a GBM, a GLM, a deep learning model with all defaults, then a deep learning model using dropout (and more hidden nodes, to compensate).
Random search is one possibility for hyperparameter optimization in machine learning. I have applied random search to search for the best hyperparameters of a SVM classifier with a RBF kernel. Additional to the continuous Cost and gamma parameter, I have one discrete parameter and also an equality constraint over some parameters.
Now, I would like to develop random search further, e.g. through adaptive random search. That means for example adaptation of the search direction or of the search range.
Does somebody have an idea how this can be done or could reference to some existing work on this? Other ideas for improving random search are also welcome.
Why you try to reinvent the wheel? Hyperparameters optimization is well studied topic, with at least few of the state of the art method, which simply solve the problem for SVMs, including:
Bayesian optimization (usually through modeling model quality with Gaussian processes), see for example bayesopt http://rmcantin.bitbucket.org/html/
Tree of parzen estimators (sometimes better for discrete, complex hyperparameters spaces) included (in particular) in hyperopt http://hyperopt.github.io/hyperopt/
To improve the random search procedure, you can refer to Hyperband.
Hyperband is a method proposed by UC Berkeley AMP Lab, aiming to improve the efficiency of tuning method like random search.
I'd like to add that Bayesian optimization is a perfect example of an adaptive random search, so looks like it's exactly what you want to apply.
The idea of Bayesian optimization is to model the target function using Gaussian Processes (GP), select the best next point according to the current model and update the model after seeing the actual outcome. So, effectively, Bayesian optimization starts like a random search, gradually builds a picture of what the function looks like and shifts its focus to the most promising areas (note that "promising" can be defined differently by different particular methods - PI, EI, UCB, etc). There are further techniques to help it to find a right balance between exploration and exploitation, for example portfolio strategy. If that's what you mean by adaptive, then Bayesian optimization is your choice .
If you'd like to extend your code without external libraries, it's totally possible because Bayesian optimization is not that hard to implement. You can take a look at sample code that I used in my research, for example here is the bulk of GP-related code.
I am using scikit-learn's LogisticRegression object for regularized binary classification. I've read the documentation on intercept_scaling but I don't understand how to choose this value intelligently.
The datasets look like this:
10-20 features, 300-500 replicates
Highly non-Gaussian, in fact most observations are zeros
The output classes are not necessarily equally likely. In some cases they are almost 50/50, in other cases they are more like 90/10.
Typically C=0.001 gives good cross-validated results.
The documentation contains warnings that the intercept itself is subject to regularization, like every other feature, and that intercept_scaling can be used to address this. But how should I choose this value? One simple answer is to explore many possible combinations of C and intercept_scaling and choose the parameters that give the best performance. But this parameter search will take quite a while and I'd like to avoid that if possible.
Ideally, I would like to use the intercept to control the distribution of output predictions. That is, I would like to ensure that the probability that the classifier predicts "class 1" on the training set is equal to the proportion of "class 1" data in the training set. I know that this is the case under certain circumstances, but this is not the case in my data. I don't know if it's due to the regularization or to the non-Gaussian nature of the input data.
Thanks for any suggestions!
While you tried oversampling the positive class by setting class_weight="auto"? That effectively oversamples the underrepresented classes and undersamples the majority class.
(The current stable docs are a bit confusing since they seem to have been copy-pasted from SVC and not edited for LR; that's just changed in the bleeding edge version.)
In a particular application I was in need of machine learning (I know the things I studied in my undergraduate course). I used Support Vector Machines and got the problem solved. Its working fine.
Now I need to improve the system. Problems here are
I get additional training examples every week. Right now the system starts training freshly with updated examples (old examples + new examples). I want to make it incremental learning. Using previous knowledge (instead of previous examples) with new examples to get new model (knowledge)
Right my training examples has 3 classes. So, every training example is fitted into one of these 3 classes. I want functionality of "Unknown" class. Anything that doesn't fit these 3 classes must be marked as "unknown". But I can't treat "Unknown" as a new class and provide examples for this too.
Assuming, the "unknown" class is implemented. When class is "unknown" the user of the application inputs the what he thinks the class might be. Now, I need to incorporate the user input into the learning. I've no idea about how to do this too. Would it make any difference if the user inputs a new class (i.e.. a class that is not already in the training set)?
Do I need to choose a new algorithm or Support Vector Machines can do this?
PS: I'm using libsvm implementation for SVM.
I just wrote my Answer using the same organization as your Question (1., 2., 3).
Can SVMs do this--i.e., incremental learning? Multi-Layer Perceptrons of course can--because the subsequent training instances don't affect the basic network architecture, they'll just cause adjustment in the values of the weight matrices. But SVMs? It seems to me that (in theory) one additional training instance could change the selection of the support vectors. But again, i don't know.
I think you can solve this problem quite easily by configuring LIBSVM in one-against-many--i.e., as a one-class classifier. SVMs are one-class classifiers; application of an SVM for multi-class means that it has been coded to perform multiple, step-wise one-against-many classifications, but again the algorithm is trained (and tested) one class at a time. If you do this, then what's left after step-wise execution against the test set, is "unknown"--in other words, whatever data is not classified after performing multiple, sequential one-class classifications, is by definition in that 'unknown' class.
Why not make the user's guess a feature (i.e., just another dependent variable)? The only other option is to make it the class label itself, and you don't want that. So you would, for instance, add a column to your data matrix "user class guess", and just populate it with some value most likely to have no effect for those data points not in the 'unknown' category and therefore for which the user will not offer a guess--this value could be '0' or '1', but really it depends on how you have your data scaled and normalized).
Your first item will likely be the most difficult, since there are essentially no good incremental SVM implementations in existence.
A few months ago, I also researched online or incremental SVM algorithms. Unfortunately, the current state of implementations is quite sparse. All I found was a Matlab example, OnlineSVR (a thesis project only implementing regression support), and SVMHeavy (only binary class support).
I haven't used any of them personally. They all appear to be at the "research toy" stage. I couldn't even get SVMHeavy to compile.
For now, you can probably get away with doing periodic batch training to incorporate updates. I also use LibSVM, and it's quite fast, so it sould be a good substitute until a proper incremental version is implemented.
I also don't think SVM's can model the concept of an "unknown" sample by default. They typically work as a series of boolean classifiers, so a sample ends up as positively being classified as something, even if that sample is drastically different from anything seen previously. A possible workaround would be to model the ranges of your features, and randomly generate samples that exist outside of these ranges, and then add these to your training set.
For example, if you have an attribute called "color", which has a minimum value of 4 and a maximum value of 123, then you could add these to your training set
[({'color':3},'unknown'),({'color':125},'unknown')]
to give your SVM an idea of what an "unknown" color means.
There are algorithms to train an SVM incrementally, but I don't think libSVM implements this. I think you should consider whether you really need this feature. I see no problem with your current approach, unless the training process is really too slow. If it is, could you retrain in batches (i.e. after every 100 new examples)?
You can get libSVM to produce probabilities of class membership. I think this can be done for multiclass classification, but I'm not entirely sure about that. You will need to decide some threshold at which the classification is not certain enough and then output 'Unknown'. I suppose something like setting a threshold on the difference between the most likely and second most likely class would achieve this.
I think libSVM scales to any number of new classes. The accuracy of your model may well suffer by adding new classes, however.
Even though this question is probably out of date, I feel obliged to give some additional thoughts.
Since your first question has been answered by others (there is no production-ready SVM which implements incremental learning, even though it is possible), I will skip it. ;)
Adding 'Unknown' as a class is not a good idea. Depending on it's use, the reasons are different.
If you are using the 'Unknown' class as a tag for "this instance has not been classified, but belongs to one of the known classes", then your SVM is in deep trouble. The reason is, that libsvm builds several binary classifiers and combines them. So if you have three classes - let's say A, B and C - the SVM builds the first binary classifier by splitting the training examples into "classified as A" and "any other class". The latter will obviously contain all examples from the 'Unknown' class. When trying to build a hyperplane, examples in 'Unknown' (which really belong to the class 'A') will probably cause the SVM to build a hyperplane with a very small margin and will poorly recognizes future instances of A, i.e. it's generalization performance will diminish. That's due to the fact, that the SVM will try to build a hyperplane which separates most instances of A (those officially labeled as 'A') onto one side of the hyperplane and some instances (those officially labeled as 'Unknown') on the other side .
Another problem occurs if you are using the 'Unknown' class to store all examples, whose class is not yet known to the SVM. For example, the SVM knows the classes A, B and C, but you recently got example data for two new classes D and E. Since these examples are not classified and the new classes not known to the SVM, you may want to temporarily store them in 'Unknown'. In that case the 'Unknown' class may cause trouble, since it possibly contains examples with enormous variation in the values of it's features. That will make it very hard to create good separating hyperplanes and therefore the resulting classifier will poorly recognize new instances of D or E as 'Unknown'. Probably the classification of new instances belonging to A, B or C will be hindered as well.
To sum up: Introducing an 'Unknown' class which contains examples of known classes or examples of several new classes will result in a poor classifier. I think it's best to ignore all unclassified instances when training the classifier.
I would recommend, that you solve this issue outside the classification algorithm. I was asked for this feature myself and implemented a single webpage, which shows an image of the object in question and a button for each known class. If the object in question belongs to a class which is not known yet, the user can fill out another form to add a new class. If he goes back to the classification page, another button for that class will magically appear. After the instances have been classified, they can be used for training the classifier. (I used a database to store the known classes and reference which example belongs to which class. I implemented an export function to make the data SVM-ready.)