I had a problem to classify inputs which have more than one label. So problem is multi-label classification. I used scikit-learn Decision Tree classifiers to do this and it gives pretty good results at initial stages. But, I am wondering how is it working under the hood and How the split is done in Decision Tree for multi-label classification? The important question is about how one model which is initialized once can be trained with two different class of labels at the same time? How the Decision Tree model will solve out the optimization task for both different sets of labels?
Under the hood, each node in your decision tree has the same labels as the root node, however, the probability of each label is different. When you run model.predict(), the model gives the prediction as the label with the highest probability. You can use model.predict_proba() to see the probability for each label separately. You can use this code to get the probabilities correctly:
all_probs=pd.DataFrame(model.predict_proba(X_test),columns=model.classes_)
Related
I'm confused about the intuition behind decision trees when used to describe continuous targets in machine learning.
I understand that decision trees uses splits based on feature values to decide which branches of a tree to go down to get to a leaf value.
It intuitively make sense to me when doing inference on classification based on nominal targets because each leaf would have as specific value (label), so after going down enough branches one eventually arrives at discrete value which is the label.
But if we're doing regression where a machine learning model predicts a value on a continuum, for example a real number between 0 and 100, how could there be enough leaves to allow the model to output any real number between 0 and 100?
Regression trees are only what you could call 'pseudo continuous' in contrast for example to linear regression models. For the 'leaves' the outputs will have a steady value for certain ranges of the independent variable(s) - dependent on the mentioned 'splits'.
However, there exists some academic work that fits (regression) models in the nodes (...). See the accepted answer here:
https://stats.stackexchange.com/questions/439756/decision-tree-that-fits-a-regression-at-leaf-nodes
I have some questions regarding decision tree and random forest classifier.
Question 1: Is a trained Decision Tree unique?
I believe that it should be unique as it maximizes Information Gain over each split. Now if it is unique why there is random_state parameter in decision tree classifier.As it is unique so it will be reproducible every time. So no need for random_state as Decision tree is unique.
Question 2: What does a decision tree actually predict?
While going through random forest algorithm I read that it averages probability of each class from its individual tree, But as far I know decision tree predicts class not the Probability for each class.
Even without checking out the code, you will see this note in the docs:
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.
For splitter='best', this is happening here:
# Draw a feature at random
f_j = rand_int(n_drawn_constants, f_i - n_found_constants,
random_state)
And for your other question, read this:
...
Just build the tree so that the leaves contain not just a single class estimate, but also a probability estimate as well. This could be done simply by running any standard decision tree algorithm, and running a bunch of data through it and counting what portion of the time the predicted label was correct in each leaf; this is what sklearn does. These are sometimes called "probability estimation trees," and though they don't give perfect probability estimates, they can be useful. There was a bunch of work investigating them in the early '00s, sometimes with fancier approaches, but the simple one in sklearn is decent for use in forests.
...
I have a question about the approach to deal with a multilabel classification problem.
Based on literature review, I found one most commonly-used approach is Problem Transformation Approach. It transformed the multilabel problem to a number of single label problems, and the classification result is just the simple union of each single label classifier, using the binary relevant approach.
Since a single label problem can be catergorized as either binary classification (if there are two labels) or multiclass classification problem (if there are multiple labels i.e., labels>2), the current transformation approach seems all transform the multilabel problem to a number of binary problems. But this would be cause the data imbalance issue, because of the negative class may have much more documents than the positive class.
So my question, why not transform to a number of multiclass problems, and then apply the direct multiclass classification algorithms to avoid the data imbalance problem. In this case, for one testing document, each trained single label multiclass classifier would predict whether to assign the label, and the union of all such single label multiclass classifier prediction results would be the final set of labels for that testing documents.
In summary, compared to transform a multilabel classification problem to a number of binary classification problems, transform a multilabel classification problem to a number of multiclass classification problems could avoid the data imbalance problem. Other than this, everything stays the same for the above two methods: you need to construct |L|(|L| means the total number of different labels in the classification problem) single label (either binary or multiclass) classifier, you need to prepare |L| sets of training data and testing data, you need to test each single label classifier on the testing document and the union of prediction results of each single label classifier is the final label set for the testing document.
Hope anyone could help clarify my confusion, thanks very much!
what you describe is a known transformation strategy to multi-class problems called Label Power Set Transformation Strategy.
Drawbacks of this method:
The LP transformation may lead to up to 2^|L| transformed
labels.
Class imbalance problem.
Refer to:
Cherman, Everton Alvares, Maria Carolina Monard, and Jean Metz. "Multi-label problem transformation methods: a case study." CLEI Electronic Journal 14.1 (2011): 4-4.
I have a question about some basic concepts of machine learning. The examples, I observed, were giving a brief overview .For training the system, feature vector is given as input. In case of supervised learning, the dataset is labelled. I have confusion about labelling. For example if I have to distinguish between two types of pictures, I will provide a feature vector and on output side for testing, I'll provide 1 for type A and 2 for type B. But if I want to extract a region of interest from a dataset of images. How will I label my data to extract ROI using SVM. I hope I am able to convey my confusion. Thanks in anticipation.
In supervised learning, such as SVMs, the dataset should be composed as follows:
<i-th feature vector><i-th label>
where i goes from 1 to the number of patterns (also examples or observations) in your training set so this represents a single record in your training set which can be used to train the SVM classifier.
So you basically have a set composed by such tuples and if you do have just 2 labels (binary classification problem) you can easily use a SVM. Indeed the SVM model will be trained thanks to the training set and the training labels and once the training phase has finished you can use another set (called Validation Set or Test Set), which is structured in the same way as the training set, to test the accuracy of your SVMs.
In other words the SVM workflow should be structured as follows:
train the SVM using the training set and the training labels
predict the labels for the validation set using the model trained in the previous step
if you know what the actual validation labels are, you can match the predicted labels with the actual labels and check how many labels have been correctly predicted. The ratio between the number of correctly predicted labels and the total number of labels in the validation set returns a scalar between [0;1] and it's called the accuracy of your SVM model.
if you're interested in the ROI, you might want to check the trained SVM parameters (mainly the weights and bias) to reconstruct the separation hyperplane
It is also important to know that the training set records should be correctly, a priori labelled: if the training labels are not correct, the SVM will never be able to correctly predict the output for previously unseen patterns. You do not have to label your data according to the ROI you want to extract, the data must be correctly labelled a priori: the SVM will have the entire set of type A pictures and the set of type B pictures and will learn the decision boundary to separate pictures of type A and pictures of type B. You do not have to trick the labels: if you do, you're not doing classification and/or machine learning and/or pattern recognition. You're basically tricking the results.
I'm trying to solve a text classification problem for academic purpose. I need to classify the tweets into labels like "cloud" ,"cold", "dry", "hot", "humid", "hurricane", "ice", "rain", "snow", "storms", "wind" and "other". Each tweet in training data has probabilities against all the label. Say the message "Can already tell it's going to be a tough scoring day. It's as windy right now as it was yesterday afternoon." has 21% chance for being hot and 79% chance for wind. I have worked on the classification problems which predicts whether its wind or hot or others. But in this problem, each training data has probabilities against all the labels. I have previously used mahout naive bayes classifier which take a specific label for a given text to build model. How to convert these input probabilities for various labels as input to any classifier?
In a probabilistic setting, these probabilities reflect uncertainty about the class label of your training instance. This affects parameter learning in your classifier.
There's a natural way to incorporate this: in Naive Bayes, for instance, when estimating parameters in your models, instead of each word getting a count of one for the class to which the document belongs, it gets a count of probability. Thus documents with high probability of belonging to a class contribute more to that class's parameters. The situation is exactly equivalent to when learning a mixture of multinomials model using EM, where the probabilities you have are identical to the membership/indicator variables for your instances.
Alternatively, if your classifier were a neural net with softmax output, instead of the target output being a vector with a single [1] and lots of zeros, the target output becomes the probability vector you're supplied with.
I don't, unfortunately, know of any standard implementations that would allow you to incorporate these ideas.
If you want an off the shelf solution, you could use a learner the supports multiclass classification and instance weights. Let's say you have k classes with probabilities p_1, ..., p_k. For each input instance, create k new training instances with identical features, and with label 1, ..., k, and assign weights p_1, ..., p_k respectively.
Vowpal Wabbit is one such learner that supports multiclass classification with instance weights.