I want to figure out why in binary classification, we only need 1 tree per boosting round, while in n-class multi-class classification, we need n trees in one boosting round. I think XGBoost need to compute sum of scores for each class in every round and pass it to softmax function, so binary classification should use two trees per round. Is there anything wrong?
Related
Suppose I want to use a multilayer perceptron to classify 3 classes. When it comes to number of output neurons, anybody would instantly say - use 3 output neurons with softmax activation. But what if I use 2 output neurons with sigmoid activations to output [0,0] for class 1, [0,1] for class 2 and [1,0] for class 3? Basically getting a binary encoded output with each bit being output by each output neuron. Wouldn't this technique decrease output neurons(and hence number of parameters) by a lot? A 100 class word classification for simple NLP application would require 100 output neurons for softmax where as you can cover it with 7 output neurons with the above technique. One disadvantage is that you won't get the probability scores for all the classes. My question is, is this approach correct? If so, would you consider it to be more efficient than softmaxing for datasets with large number of classes?
You could do this, but then you would have to rethink your loss function. The cross-entropy loss used in training a model for classification is the likelihood of a categorical distribution, which assumes you have a probability associated with every class. The loss function requires 3 output probabilities and you only have 2 output values.
However, there are ways to do it anyway: you could use a binary cross-entropy loss on each element of your output, but this would be a different probabilistic assumption about your model. You'd be assuming that your classes have some shared characteristics [0,0] and [0,1] share a value. The decreased degrees of freedom are probably going to give you marginally worse performance (but other parts of the MLP may pick up the slack).
If you're really worried about the parameter cost of the final layer, then you might be better just not training it at all. This paper shows a fixed Hadamard matrix on the final layer is as good as training it.
Imagine we have a classification problem on a dataset where the examples are only positive (equivalently negative). For instance, on a problem where the the winning class is specified by position (e.g. think of a tennis dataset problem where the first player is always the winner). How can we create negative examples in order to train a supervised learning algorithm on this dataset? One idea could be to generate negative examples, by exchanging the positions of the features that are tied to each of the classes. Do you think this will give an unbiased dataset? Could we create negative duplicates of our original dataset and train a supervised learning algorithm on this double dataset?
My crime classification dataset has indicator features, such as has_rifle.
The job is to train and predict whether data points are criminals or not. The metric is weighted mean absolute error, where if the person is criminal, and the model predicts him/her as not, then the weight is large as 5. If person is not criminal and the model predicts as he/she is, then weight is 1. Otherwise the model predicts correctly, with weight 0.
I've used classif:multinom method in mlr in R, and tuned the threshold to 1/6. The result is not that good. Adaboost is slightly better. Though neither is perfect.
I'm wondering which method is typically used in this kind of binary classification problem with a sparse {0,1} matrix? And how to improve the performance measured by the weighted mean absolute error metric?
Dealing with sparse data is not a trivial task. Lack of information makes difficult to capture features such as variance. I would suggest you to search for subspace clustering methods or to be more specific, soft subspace clustering. The last one usually identifies relevant/irrelevant data dimensions. It is a good approach when you want to improve classification accuracy.
Suppose I have a dataset which only has one continuous variable, and I try to use decision tree algorithm to build a model which classify the +ve and -ve label from the dataset. I run 10-fold cross-validation.
How does the AUC being calculated for the Decision Tree classifier? Will the algorithm check different threshold value of the classifier, and determine the AUC?
What about I have more than 2 continuous variable?
Thanks!
Off topic, but hey:
AUC only makes sense for binary classification. The number of predictors does not matter.
Decision trees do not inherently have a 'threshold' but typically in a classification problem, the leaves contain a probability distribution over the 2 classes, and so does the tree's prediction. So you could conceive of picking the positive class only if its probability is >= p, not just >= 0.5. Then you could draw an AUC curve.
So it's a little unnatural to apply this to a decision tree but can be done.
I am interested in understanding how probability estimates are calculated by random forests, both in general and specifically in Python's scikit-learn library (where probability estimated are returned by the predict_proba function).
Thanks,
Guy
The probabilities returned by a forest are the mean probabilities returned by the trees in the ensemble (docs).
The probabilities returned by a single tree are the normalized class histograms of the leaf a sample lands in.
In addition to what Andreas/Dougal said,
when you train the RF, turn on compute_importances=True.
Then inspect classifier.feature_importances_ to see which features are occurring high-up in the RF's trees.