Decision Tree Splitting strategy - machine-learning

I have a dataset with 4 categorical features (Cholesterol, Systolic Blood pressure, diastolic blood pressure, and smoking rate). I use a decision tree classifier to find the probability of stroke.
I am trying to verify my understanding of the splitting procedure done by Python Sklearn.
Since it is a binary tree, there are three possible ways to split the first feature which is either to group categories {0 and 1 to a leaf, 2 to another leaf} or {0 and 2, 1}, or {0, 1 and 2}. What I know (please correct me here) is that the chosen split is the one with the least information gain (Gini impurity).
I have calculated the information gain for each of the three grouping scenarios:
{0 + 1 , 2} --> 0.17
{0 + 2 , 1} --> 0.18
{1 + 2 , 0} --> 0.004
However, sklearn's decision tree chose the first scenario instead of the third (please check the picture).
Can anyone please help clarify the reason for the selection? is there a priority for splits that results in pure nodes. thus selecting such a scenario although it has less information gain?

The algorithm does splits based on maximizing the information gain (=minimizing the entropy):
https://scikit-learn.org/stable/modules/tree.html#tree-algorithms-id3-c4-5-c5-0-and-cart

Related

How can I get the index of features for each score of rfecv?

I am useing recursive feature elimination and cross-validated (rfecv) in order to find the best accuracy score for features.
As I see _grid_scoresis the score the estimator produced when trained with the i-th subset of features. Is there any way to get the index of subset features for each score in the _grid_score?
I can get the index of the selected features for highest score using get_support ( 5 subset of features).
subset_features, scores
5 , 0.976251
4 , 0.9762072
3 , 0.97322212
How can I get the indexes of 4 or 3 subset of features?
I checked the output of rfecv.ranking_ and the 5 features have rank =1 , but the Rank= 2 only has one feature and so on.
A (single) subset of 3 (or 4) features was (probably) never chosen!
This seems to be a common misconception on how RFECV works; see How does cross-validated recursive feature elimination drop features in each iteration (sklearn RFECV)?. There's an RFE for each cross-validation fold (say 5), and each will produce its own set of 3 features (probably different). Unfortunately (in this case at least), those RFE objects are not saved, so you cannot identify which sets of features each fold has selected; only the score is saved (source pt1, pt2) for choosing the optimal number of features, and then another RFE is trained on the entire dataset to reduce to the final set of features.

Ternary Classification of the type 'A', 'B' or 'any'?

For any general machine learning model (though I am currently working with neural networks), for the task of
classifying the elements of a set into three groups ('A' or 'B' or 'any'),
(here, labeling as 'A' means that the only valid label is 'A' (similarly 'B'), and 'any' means that both the tags 'A' and 'B' are equally valid), what kind of loss function should be used?
This can be solved using the techniques related to the more general problem of "ternary classification," but I think I'll lose some information by this generalization.
For the sake of example, let's say we want to classify verbs (English language) according to their tense forms (let us only consider the present and past tense)
Then the model should classify
{"work", "eat", "sing", ...} as "present tense"
{"worked", "ate", "sang", ...} as "past tense"
and,
{"read", "put", "cut", ...} as "any"
(note that the pronunciation is different for the present and past tense of 'read', but we are considering text-based classification)
This is different from the task that I am working on but probably should work as a valid example for this particular question.
PS: I am a student, and only have a basic understanding of this field, so if needed, please ask for any clarification regarding the question.
I think that you are in the situation of multi-label classification and not multi-class classifcation.
As stated here:
In machine learning, multi-label classification and the strongly
related problem of multi-output classification are variants of the
classification problem where multiple labels may be assigned to each
instance
Which means that instances can have more than 1 class associated to them.
Usually, when you work with a binary classification (e.g. 0, 1 classes) you can have as final layer of your network one neuron, which will output continues values between 0 and 1, using as activation function the sigmoid one, and as loss the binary cross-entropy
Given your situation you could decide to use:
two neurons as output of your neural network
for each one you can use the sigmoid activation function
and as loss the binary-cross entropy
in this way, each instance can be associated with both classes with a specific probability by the model.
This means that for each instance, you should associate two classes, or rather "labels".
For example, for your verbs you should have "past", "present" classes:
present past
work: 1 0
worked: 0 1
read 1 1
And your model will try to output two probabilities, with the architecture explained before:
present past sum
work: 0.9 0.3 1.2
worked: 0.21 0.8 1.01
read 0.86 0.7 1.5
Basically, you have two independent probabilites (if you check, the sum of one row is not 1), and therefore you can associate to one instance both classes.
Instead, if you wanted a mutually exclusive classification, with more than 2 classes, you should have used the categorical crossentropy as loss, and the softmax activation function in your last layer, the which will basically handle the outputs to generate a vector of probabilities that sums to 1. Example
present past both sum
work: 0.7 0.2 0.1 1
worked: 0.21 0.7 0.19 1
read 0.33 0.33 0.33 1
Check here to see an extensive example

How to squish a continuous cosine-theta score to a discrete (0/1) output?

I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.

Store textual dataset for binary classification

I am currently working on a machine learning project, and am in the process of building the dataset. The dataset will be comprised of a number of different textual features, of varying length from 1 sentence to around 50 sentences(including punctuation). What is the best way to store this data to then pre-process and use for machine learning using python?
In most cases, you can use a method called Bag of Word, however, in some cases when you are performing more complicated task like similarity extraction or want to make comparison between sentences, you should use Word2Vec
Bag of Word
You may use the classical Bag-Of-Word representation, in which you encode each sample into a long vector indicating the count of all the words from all samples. For example, if you have two samples:
"I like apple, and she likes apple and banana.",
"I love dogs but Sara prefer cats.".
Then all the possible words are(order doesn't matter here):
I she Sara like likes love prefer and but apple banana dogs cats , .
Then the two samples will be encoded to
First: 1 1 0 1 1 0 0 2 0 2 1 0 0 1 1
Second: 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1
If you are using sklearn, the task would be as simple as:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
# Now you can feed X into any other machine learning algorithms.
Word2Vec
Word2Vec is a more complicated method, which attempts to find the relationship between words by training a embedding neural network underneath. An embedding, in plain english, can be thought of the mathematical representation of a word, in the context of all the samples provided. The core idea is that words are similar if their contexts are similar.
The result of Word2Vec are the vector representation(embeddings) of all the words shown in all the samples. The amazing thing is that we can perform algorithmic operations on the vector. A cool example is: Queen - Woman + Man = King reference here
To use Word2Vec, we can use a package called gensim, here is a basic setup:
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.most_similar(positive=['woman', 'king'], negative=['man'])
>>> [('queen', 0.50882536), ...]
Here sentences is your data, size is the dimension of the embeddings, the larger size is, the more space is used to represent a word, and there is always overfitting we should think about. window is the size of the context we are cared about, it is the number of words before the target word we are looking at when we are predicting the target from its context, when training.
One common way is to create your dictionary(all the posible words) and then encode every of your examples in function of this dictonary, for example(this is a very small and limited dictionary just for example) you could have a dictionary : hello ,world, from, python . Every word will be associated to a position, and in every of your examples you define a vector with 0 for inexistence and 1 for existence, for example for the example "hello python" you would encode it as: 1,0,0,1

Optimal parameter estimation for a classifier with multiple parameters

The image on the left shows a standard ROC curve formed by sweeping a single threshold and recording the corresponding True Positive Rate (TPR) and False Positive Rate (FPR).
The image on the right shows my problem setup where there are 3 parameters, and for each, we have only 2 choices. Together, it produces 8 points as depicted on the graph. In practice, I intend to have thousands of possible combinations of 100s of parameters, but the concept remains the same in this down-scaled case.
I intend to find 2 things here:
Determine the optimum parameter(s) for the given data
Provide an overall performance score for all combinations of parameters
In the case of the ROC curve on the left, this is done easily using the following methods:
Optimal parameter: Maximal difference of TPR and FPR with a cost component (I believe it is called the J-statistic?)
Overall performance: Area under the curve (the shaded portion in the graph)
However, for my case in the image on the right, I do not know if the methods I have chosen are the standard principled methods that are normally used.
Optimal parameter set: Same maximal difference of TPR and FPR
Parameter score = TPR - FPR * cost_ratio
Overall performance: Average of all "parameter scores"
I have found a lot of reference material for the ROC curve with a single threshold and while there are other techniques available to determine the performance, the ones mentioned in this question is definitely considered a standard approach. I found no such reading material for the scenario presented on the right.
Bottomline, the question here is two-fold: (1) Provide methods to evaluate the optimal parameter set and overall performance in my problem scenario, (2) Provide reference that claims the suggested methods to be a standard approach for the given scenario.
P.S.: I had first posted this question on the "Cross Validated" forum, but didn't get any responses, in fact, got only 7 views in 15 hours.
I'm going to expand a little on aberger's previous answer on a Grid Search. As with any tuning of a model it's best to optimise hyper-parameters using one portion of the data and evaluate those parameters using another proportion of the data, so GridSearchCV is best for this purpose.
First I'll create some data and split it into training and test
import numpy as np
from sklearn import model_selection, ensemble, metrics
np.random.seed(42)
X = np.random.random((5000, 10))
y = np.random.randint(0, 2, 5000)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3)
This gives us a classification problem, which is what I think you're describing, though the same would apply to regression problems too.
Now it's helpful to think about what parameters you may want to optimise. A cross-validated grid search is a computational expensive process, so the smaller the search space the quicker it gets done. I will show an example for a RandomForestClassifier because it's my go to model.
clf = ensemble.RandomForestClassifier()
parameters = {'n_estimators': [10, 20, 30],
'max_features': [5, 8, 10],
'max_depth': [None, 10, 20]}
So now I have my base estimator and a list of parameters that I want to optimise. Now I just have to think about how I want to evaluate each of the models that I'm going to build. It seems from your question that you're interested in the ROC AUC, so that's what I'll use for this example. Though you can chose from many default metrics in scikit or even define your own.
gs = model_selection.GridSearchCV(clf, param_grid=parameters,
scoring='roc_auc', cv=5)
gs.fit(X_train, y_train)
This will fit a model for all possible combinations of parameters that I have given it, using 5-fold cross-validation evaluate how well those parameters performed using the ROC AUC. Once that's been fit, we can look at the best parameters and pull out the best performing model.
print gs.best_params_
clf = gs.best_estimator_
Outputs:
{'max_features': 5, 'n_estimators': 30, 'max_depth': 20}
Now at this point you may want to retrain your classifier on all of the training data, as currently it's been trained using cross-validation. Some people prefer not to, but I'm a retrainer!
clf.fit(X_train, y_train)
So now we can evaluate how well the model performs on both our training and test set.
print metrics.classification_report(y_train, clf.predict(X_train))
print metrics.classification_report(y_test, clf.predict(X_test))
Outputs:
precision recall f1-score support
0 1.00 1.00 1.00 1707
1 1.00 1.00 1.00 1793
avg / total 1.00 1.00 1.00 3500
precision recall f1-score support
0 0.51 0.46 0.48 780
1 0.47 0.52 0.50 720
avg / total 0.49 0.49 0.49 1500
We can see that this model has overtrained by the poor score on the test set. But this is not surprising as the data is just random noise! Hopefully when performing these methods on data with a signal you will end up with a well-tuned model.
EDIT
This is one of those situations where 'everyone does it' but there's no real clear reference to say this is the best way to do it. I would suggest looking for an example close to the classification problem that you're working on. For example using Google Scholar to search for "grid search" "SVM" "gene expression"
I feeeeel like we're talking about Grid Search in scikit-learn. It (1), provides methods to evaluate optimal (hyper)parameters and (2), is implemented in a massively popular and well referenced statistical software package.

Resources