How to get the Gini coefficient using random forests in the caret R package? - random-forest

I'm trying to understand the difference between the random forest implementation in the randomForest package and in the caret package.
For example, this specifies 2000 trees with mtry = 2 in randomForest and I show the Gini coefficient for each predictor:
library(randomForest)
library(tidyr)
rf1 <- randomForest(Species ~ ., data = iris,
ntree = 2000, mtry = 2,
importance = TRUE)
data.frame(RF = sort(importance(rf1)[, "MeanDecreaseGini"], decreasing = TRUE)) %>% add_rownames() %>% rename(Predictor = rowname)
# Predictor RF
# 1 Petal.Width 45.57974
# 2 Petal.Length 41.61171
# 3 Sepal.Length 9.59369
# 4 Sepal.Width 2.47010
I'm trying to get the same info in caret, but I don't know how to specify the number of trees, or how to get the Gini coefficient:
rf2 <- train(Species ~ ., data = iris, method = "rf",
metric = "Kappa",
tuneGrid = data.frame(mtry = 2))
varImp(rf2) # not the Gini coefficient
# Overall
# Petal.Length 100.000
# Petal.Width 99.307
# Sepal.Width 0.431
# qSepal.Length 0.000
Also, the confusion matrix of rf1 has some errors and that of rf2 doesn't. What parameter is causing this difference?:
# rf1 Confusion matrix:
# setosa versicolor virginica class.error
# setosa 50 0 0 0.00
# versicolor 0 47 3 0.06
# virginica 0 4 46 0.08
table(predict(rf2, iris), iris$Species)
# setosa versicolor virginica
# setosa 50 0 0
# versicolor 0 50 0
# virginica 0 0 50
This is quick and dirty. I know this isn't the right way to test the performance of the classifier, but I dont' understand the difference in the results.

This might help to answer the question - see 2nd post:
caret: using random forest and include cross-validation
randomforest is sampling with replacement. If you use "rf" in caret, you need to specify trControl in train::caret(); you want the same resampling method to be used in caret i.e. a bootstrap, so you need to set trControl="oob". TrControl is a list of values that defines how the function acts; this can be set to "cv" for cross validation, "repeatedcv" for repeated cross validation etc. See the caret package documentation for more info.
You should get the same result as using randomForest, but do remember to set the seeds properly.

I was also recently looking for a solution to get the MeanDecreasingGini variable from the caret implementation of randomForest. I realize this was posted long ago so perhaps caret has updated and my advice is no longer necessary, but I struggled to find a solution so hopefully someone finds this useful.
To set the number of trees in caret you use the ntrees=xx argument during training just like you would with randomForest. Then to output the MeanDecreasingGini in caret specify type=2 (1=MeanDecreasingAccuracy[default], 2=MeanDecreasingGini) and scale=FALSE. Full code with results below (after several runs there are minor fluctuations in the magnitude of results which I am predicting is random chance, but rank of variables is consistent):
library(randomForest)
library(tidyr)
library(caret)
##randomForest
rf1 <- randomForest(Species ~ ., data = iris,
ntree = 2000, mtry = 2,
importance = TRUE)
data.frame(Gini=sort(importance(rf1, type=2)[,], decreasing=T))
# Gini
# Petal.Width 43.924705
# Petal.Length 43.293731
# Sepal.Length 9.717544
# Sepal.Width 2.320682
##caret
rf2 <- train(Species ~ .,
data = iris,
method = "rf",
ntrees=2000, ##same as randomForest
importance=TRUE, ##same as randomForest
metric = "Kappa",
tuneGrid = data.frame(mtry = 2),
trControl = trainControl(method = "none")) ##Stop the default bootstrap=25
varImp(rf2, type=2, scale=FALSE)
# rf variable importance
#
# Overall
# Petal.Width 44.475
# Petal.Length 43.401
# Sepal.Length 9.140
# Sepal.Width 2.267
Then in terms confusion matrix confusion (confusing phrasing?), this seems to be a byproduct of the way you were calculating the confusion matrices. When I used the predict function for both models, I moved to 100% accuracy versus when I used other methods:
rf1$confusion
# setosa versicolor virginica class.error
# setosa 50 0 0 0.00
# versicolor 0 47 3 0.06
# virginica 0 3 47 0.06
table(predict(rf1, iris), iris$Species)
# setosa versicolor virginica
# setosa 50 0 0
# versicolor 0 50 0
# virginica 0 0 50
rf2$finalModel$confusion
# setosa versicolor virginica class.error
# setosa 50 0 0 0.00
# versicolor 0 47 3 0.06
# virginica 0 5 45 0.10
table(predict(rf2, iris), iris$Species)
# setosa versicolor virginica
# setosa 50 0 0
# versicolor 0 50 0
# virginica 0 0 50
However, I am not sure if rf1$confusion and rf2$finalModel$confusion both represent the last tree's predictions. Perhaps someone with a better grasp of this could help out.

Related

how to calculate accuracy in segmentation model?

I evaluate a segmentation model using a bound box technique. Then I
sum the values of TP, FP, TN, and FN for each image. The total images were
10 (rows numbers in the below table). I need to calculate the accuracy of this model.
The equation of accuracy = (TP+TN)/(TP+FP+FN+TN)
(TP+FP+FN+TN) is the total number. I confused of the total here ...(actual and predicted
The question is: what is the value of the Total Number in this case? Why?
imgNo TP FP TN FN
1 4 0 0 0
2 6 1 1 0
3 2 3 0 0
4 1 1 1 0
5 5 0 0 0
6 3 1 0 0
7 0 3 1 0
8 1 0 0 0
9 3 2 1 0
10 4 1 1 0
I appreciate any help.
TP : True Positive is the number of objects you correctly identified in image.
FP : False Positive are objects you identified but actually that's a mistake because there is no such object in ground-truth.
TN : True Negative is when algorithm doesn't identify any object and indeed that is the case with ground-truth. i.e. correct negative identification.
FN : False Negative is when your algorithm failed to identify objects (i.e. the ground truth contains objects in the image(s), but it is marked as background by your algorithm). In other words, you missed identifying an object.
Its 0 anyway in your experiments.
So, TP+TN = True Total cases. Don't include FN because that is wrong detection.
you can use a heat map to visual analyze your coefficients from a logistic regression. roc_curve returns the false positives and true positive values. The confusion matrix returns fp, tp, fn, and fp aggregates.
fpr, tpr, threshholds = roc_curve(y_test,y_preds_proba_lr_df)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
accuracy=round(pipeline['lr'].score(X_train, y_train) * 100, 2)
print("Model Accuracy={accuracy}".format(accuracy=accuracy))
cm=confusion_matrix(y_test,predictions)
sns.heatmap(cm,annot=True,fmt="g")

How is the precision and recall calculated in the classification report?

Confusion Matrix :
[[4 2]
[1 3]]
Accuracy Score : 0.7
Report :
precision recall f1-score support
0 0.80 0.67 0.73 6
1 0.60 0.75 0.67 4
avg / total 0.72 0.70 0.70 10
from the formular precision = true positive/(true positive + false positive)
4/(4+2) = 0.667
But this is under recall .
The formula to calculate recall is true positive/(true positive + false negative)
4/(4+1) = 0.80
I don't seem to get the difference .
Hard to say for sure without seeing code but my guess is that you are using Sklearn and did not pass labels into your confusion matrix. Without labels, it makes decisions about the ordering leading to false positives and false negatives being swapped by interpretting the confusion matrix.

When doing classification, why do I get different precision for the same testing data?

I am testing a dataset with two labels 'A' and 'B' on a decision tree classifier. I accidentally found out that the model get different precision result on the same testing data. I want to know why.
Here is what I do, I train the model, and test it on
1. the testing set,
2. the data only labelled 'A' in the testing set,
3. and the data only labelled 'B'.
Here is what I got:
for testing dataset
precision recall f1-score support
A 0.94 0.95 0.95 25258
B 0.27 0.22 0.24 1963
for data only labelled 'A' in testing dataset
precision recall f1-score support
A 1.00 0.95 0.98 25258
B 0.00 0.00 0.00 0
for data only labelled 'B' in testing dataset
precision recall f1-score support
A 0.00 0.00 0.00 0
B 1.00 0.22 0.36 1963
The training dataset and model are the same, the data in 2 and 3rd test are also same with those in 1. Why the precision for 'A' and 'B' differ so much? What is the real precision for this model? Thank you very much.
You sound confused, and it is not at all clear why you are interested in metrics where you have completely remove one of the two labels from your evaluation set.
Let's explore the issue with some reproducible dummy data:
from sklearn.metrics import classification_report
import numpy as np
y_true = np.array([0, 1, 0, 1, 1, 0, 0])
y_pred = np.array([0, 0, 1, 1, 0, 0, 1])
target_names = ['A', 'B']
print(classification_report(y_true, y_pred, target_names=target_names))
Result:
precision recall f1-score support
A 0.50 0.50 0.50 4
B 0.33 0.33 0.33 3
avg / total 0.43 0.43 0.43 7
Now, let's keep only class A in our y_true:
indA = np.where(y_true==0)
print(indA)
print(y_true[indA])
print(y_pred[indA])
Result:
(array([0, 2, 5, 6], dtype=int64),)
[0 0 0 0]
[0 1 0 1]
Now, here is the definition of precision from the scikit-learn documentation:
The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
For class A, a true positive (tp) would be a case where the true class is A (0 in our case), and we have indeed predict A (0); from above, it is apparent that tp=2.
The tricky part is the false positives (fp): they are the cases where we have predicted A (0), where the true label is B (1). But it is apparent here that we cannot have any such cases, since we have (intentionally) removed all the B's from our y_true (why we would want to do such a thing? I don't know, it does not make any sense at all); hence, fp=0 in this (weird) setting. Hence, our precision for class A will be tp / (tp+0) = tp/tp = 1.
Which is the exact same result given by the classification report:
print(classification_report(y_true[indA], y_pred[indA], target_names=target_names))
# result:
precision recall f1-score support
A 1.00 0.50 0.67 4
B 0.00 0.00 0.00 0
avg / total 1.00 0.50 0.67 4
and obviously the case for B is identical.
why the precision is not 1 in case #1 (for both A and B)? The data are the same
No, they are very obviously not the same - the ground truth is altered!
Bottom line: removing classes from your y_true before computing precision etc. does not make any sense at all (i.e. your reported results in case #2 and case #3 are of no practical use whatsoever); but, since for whatever reasons you decide to do so, your reported results are exactly as expected.

How to know if I get a good Weka result

I have used Weka to train my data set, but I don't know if I got a good result then.
Can someone gives me some ideas?
This is my result:
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 2823 97.9188 %
Incorrectly Classified Instances 60 2.0812 %
Kappa statistic 0
Mean absolute error 0.0208
Root mean squared error 0.1443
Relative absolute error 50.6234 %
Root relative squared error 101.0568 %
Coverage of cases (0.95 level) 97.9188 %
Mean rel. region size (0.95 level) 50 %
Total Number of Instances 2883
Ignored Class Unknown Instances 119
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.000 0.000 0.000 0.000 0.000 0.000 0.500 0.020 0
1.000 1.000 0.979 1.000 0.989 0.000 0.500 0.940 1
Weighted Avg. 0.979 0.979 0.959 0.979 0.969 0.000 0.500 0.921
=== Confusion Matrix ===
a b <-- classified as
0 60 | a = 0
0 2823 | b = 1
2 classes was used, tagged with b classes all instances correctly classified, so Correctly Classified Instances rate = 2823/(2823+60) = 97.9188% and class 1 TP value is 1, but never tagged with a classes instances correctly classified so Incorrectly Classified Instances rate = 60/(2823+60) = 2.0812% and class 0 TP value is 0

Decision Trees (Random Forest and Random Tree) classification on a small data set. Something wrong?

I performed classification on a small data set 65x9 using Decision Trees (Random Forest and Random Tree). I have four classes and 8 Attributes and 65 Instances.
My Application is in assistive robotics. So,Im extracting some parameters from my sensor data that I think are relevant to classify the users run while they are performing some task. I get the movement data from the sensor package deployed on the wheelchair. Im classify certain action like turning 180 degree, and Im giving him a mark (from 1 to 4) So from the sensor package and the software I had extracted parameters like velocity, distance, time, standard deviation of the velocity etc. that are relevant for the classification of the users run. So my data are all numbers.
When I performed Decision Trees Classify I got this Results
=== Classifier model (full training set) ===
Random forest of 10 trees, each constructed while considering 4 random features.
Out of bag error: 0.5231
Time taken to build model: 0.01 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 64 98.4615 %
Incorrectly Classified Instances 1 1.5385 %
Kappa statistic 0.9791
Mean absolute error 0.0715
Root mean squared error 0.1243
Relative absolute error 19.4396 %
Root relative squared error 29.0038 %
Total Number of Instances 65
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 c1
1 0 1 1 1 1 c2
0.952 0 1 0.952 0.976 1 c3
1 0.019 0.917 1 0.957 1 c4
Weighted Avg. 0.985 0.003 0.986 0.985 0.985 1
=== Confusion Matrix ===
a b c d <-- classified as
14 0 0 0 | a = c1
0 19 0 0 | b = c2
0 0 20 1 | c = c3
0 0 0 11 | d = c4
This is too good. Am I doing something wrong?

Resources