Scikit-Learn Post pruning in RandomForestClassifier - machine-learning

Does the RandomForestClassifier() in scikit-learn support post-pruning? So there are parameters such as max_depth etc but they are more on the pre-pruning side.
So is it possible to build out the tree as far as possible and then prune the tree after in order to avoid overfitting.
Any advice would be appreciated, thanks.

Related

Visualize trees in H2O XGBoost model

I was looking at this answer to visualize the gradient boosting tree model in H2O, it says the method on GBM can be applied to XGBoost as well:
Finding contribution by each feature into making particular prediction by h2o ensemble model
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/productionizing.html
But when I try to use the method it mentioned on H2O XGBoost MOJO, it fails.
I check the source code of hex.genmodel.tools.PrintMojo:https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/tools/PrintMojo.java
it seems like it can only work on randomforest and GBM models, but not XGBoost model.
Is there anyone who knows how to visualize trees in H2O XGBoost model? Thanks!
This is a feature H2O is currently adding, you can track its progress here: https://0xdata.atlassian.net/browse/PUBDEV-5743.
Note that in the ticket there is a suggestion in the comments on how to visualize the trees with native xgboost.
I finally found the solution, that seem not documented for XGBoost, but it is indeed the same as for other trees-related algorithms.
Just run this command to generate the first 50 trees from your model:
for tn in {1..50}
do
java -cp h2o-3.24.0.1/h2o.jar hex.genmodel.tools.PrintMojo --tree $tn -i <your mojo model> -o XGBOOST_$tn.gv
dot -Tpng XGBOOST_$tn.gv -o xgboost_$tn.png
done

Regression Model Comparrison

I'm looking for metrics to compare various regressions models (e.g. SVM, Decision Tree, Neural Network etc), to decide the merits of each for solving a specific problem.
For my problem I have just over 80,000 training samples with 12 variables, all of which are independent and identically distributed.
I've done most of my research into neural networks but I'm drawing a blank when trying to compare them against other models.
Any input (including reading suggestions) would be greatly appreciated, thanks!
You can compare regression models by calculating the mean squared error for each model over a test set. The best model will simply be the one with the least error.
Sadly, there ist nothing like roc curves for regression models. Except your output is a binary variable like with logistic regression.

Train MFCC using Machine Learning Algorithm

I have a datasets of MFCC that I know is good. I know how to put a row vector into a machine learning algorithm. My question is how to do it with MFCC, as it is a matrix? For example, how would I put this inside a machine learning algorithm:?
http://archive.ics.uci.edu/ml/machine-learning-databases/00195/Test_Arabic_Digit.txt
Any algorithm will work. I am looking at a binary classifier, but will be looking into it more. Scikit seems like a good resource. For now I would just like to know how to input MFCC into an algorithm. Step by step would help me a lot! I have looked in a lot of places but have not found an answer.
Thank you
In python, you can easily flatten a matrix so it becomes in a vector,for example you can use numpy and numpy's flatten function ,additionally an idea that comes to my mind(it's just an idea may or may not work) is to use convolutions, convolutions work very well with 2d structures.

JAVA library for SVM-HMM or other sequential based classifiers

Could I find an implementation for SVM classifier based on Hidden Markov Model in JAVA ????
In other words, I'm looking for a JAVA implementation of Sequential based classifier for words with Some features in a sentence.
Any Help ??
Thanks
Mallet is a good package for sequence tagging. You can use Mallet-LibSVM to get Support Vector Machines as well.

Scalable or online out-of-core multi-label classifiers

I have been blowing my brains out over the past 2-3 weeks on this problem.
I have a multi-label (not multi-class) problem where each sample can belong to several of the labels.
I have around 4.5 million text documents as training data and around 1 million as test data. The labels are around 35K.
I am using scikit-learn. For feature extraction I was previously using TfidfVectorizer which didn't scale at all, now I am using HashVectorizer which is better but not that scalable given the number of documents that I have.
vect = HashingVectorizer(strip_accents='ascii', analyzer='word', stop_words='english', n_features=(2 ** 10))
SKlearn provides a OneVsRestClassifier into which I can feed any estimator. For multi-label I found LinearSVC & SGDClassifier only to be working correctly. Acc to my benchmarks SGD outperforms LinearSVC both in memory & time. So, I have something like this
clf = OneVsRestClassifier(SGDClassifier(loss='log', penalty='l2', n_jobs=-1), n_jobs=-1)
But this suffers from some serious issues:
OneVsRest does not have a partial_fit method which makes it impossible for out-of-core learning. Are there any alternatives for that?
HashingVectorizer/Tfidf both work on a single core and don't have any n_jobs parameter. It's taking too much time to hash the documents. Any alternatives/suggestions? Also is the value of n_features correct?
I tested on 1 million documents. The Hashing takes 15 minutes and when it comes to clf.fit(X, y), I receive a MemoryError because OvR internally uses LabelBinarizer and it tries to allocate a matrix of dimensions (y x classes) which is fairly impossible to allocate. What should I do?
Any other libraries out there which have reliable & scalable multi-label algorithms? I know of genism & mahout but both of them don't have anything for multi-label situations?
I would do the multi-label part by hand. The OneVsRestClassifier treats them as independent problems anyhow. You can just create the n_labels many classifiers and then call partial_fit on them. You can't use a pipeline if you only want to hash once (which I would advise), though.
Not sure about speeding up hashing vectorizer. You gotta ask #Larsmans and #ogrisel for that ;)
Having partial_fit on OneVsRestClassifier would be a nice addition, and I don't see a particular problem with it, actually. You could also try to implement that yourself and send a PR.
The algorithm that OneVsRestClassifier implements is very simple: it just fits K binary classifiers when there are K classes. You can do this in your own code instead of relying on OneVsRestClassifier. You can also do this on at most K cores in parallel: just run K processes. If you have more classes than processors in your machine, you can schedule training with a tool such as GNU parallel.
Multi-core support in scikit-learn is work in progress; fine-grained parallel programming in Python is quite tricky. There are potential optimizations for HashingVectorizer, but I (one of the hashing code's authors) haven't come round to it yet.
If you follow my (and Andreas') advice to do your own one-vs-rest, this shouldn't be a problem anymore.
The trick in (1.) applies to any classification algorithm.
As for the number of features, it depends on the problem, but for large scale text classification 2^10 = 1024 seems very small. I'd try something around 2^18 - 2^22. If you train a model with L1 penalty, you can call sparsify on the trained model to convert its weight matrix to a more space-efficient format.
My argument for scalability is that instead of using OneVsRest which is just a simplest of simplest baselines, you should use a more advanced ensemble of problem-transformation methods. In my paper I provide a scheme for dividing label space into subspaces and transforming the subproblems into multi-class single-label classifications using Label Powerset. To try this, just use the following code that utilizes a multi-label library built on top of scikit-learn - scikit-multilearn:
from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from skmultilearn.cluster import IGraphLabelCooccurenceClusterer
from skmultilearn.problem_transform import LabelPowerset
from sklearn.linear_model import SGDClassifier
# base multi-class classifier SGD
base_classifier = SGDClassifier(loss='log', penalty='l2', n_jobs=-1)
# problem transformation from multi-label to single-label multi-class
transformation_classifier = LabelPowerset(base_classifier)
# clusterer dividing the label space using fast greedy modularity maximizing scheme
clusterer = IGraphLabelCooccurenceClusterer('fastgreedy', weighted=True, include_self_edges=True)
# ensemble
clf = LabelSpacePartitioningClassifier(transformation_classifier, clusterer)
clf.fit(x_train, y_train)
prediction = clf.predict(x_test)
The partial_fit() method was recently added to sklearn so hopefully it should be available in the upcoming release (it's in the master branch already).
The size of your problem makes it attractive to tackling it with neural networks. Have a look at magpie, it should give much better results than linear classifiers.

Resources