Xgboost objective rank is position sensitive - machine-learning

I'm running xgboost with objective rank:NDGC and the input format of the dataset needed to be in libsvm fomart
and I'm wondering if the libsvm is position sensitive for this objective as we know that the position is important for search (higher position higher probability to click)
libsvm example of 1 query group:
0 qid:6 1:1 3:5 4:65 5:281 6:2 7:15
0 qid:6 1:2 3:15 4:68 5:13 6:2 7:14
1 qid:6 1:3 3:75 4:65 5:11 6:2 7:9
0 qid:6 1:4 3:20 4:65 5:113 6:2 7:10
2 qid:6 1:5 3:5 4:68 5:83 6:2 7:51
0 qid:6 1:6 3:20 4:65 5:116 6:2 7:3
1 qid:6 1:7 3:25 4:65
Is the ordering of the position relevance?
I know that for NDCG the position is important, how the position is taking into account in Xgboost implementation

Related

how to calculate accuracy in segmentation model?

I evaluate a segmentation model using a bound box technique. Then I
sum the values of TP, FP, TN, and FN for each image. The total images were
10 (rows numbers in the below table). I need to calculate the accuracy of this model.
The equation of accuracy = (TP+TN)/(TP+FP+FN+TN)
(TP+FP+FN+TN) is the total number. I confused of the total here ...(actual and predicted
The question is: what is the value of the Total Number in this case? Why?
imgNo TP FP TN FN
1 4 0 0 0
2 6 1 1 0
3 2 3 0 0
4 1 1 1 0
5 5 0 0 0
6 3 1 0 0
7 0 3 1 0
8 1 0 0 0
9 3 2 1 0
10 4 1 1 0
I appreciate any help.
TP : True Positive is the number of objects you correctly identified in image.
FP : False Positive are objects you identified but actually that's a mistake because there is no such object in ground-truth.
TN : True Negative is when algorithm doesn't identify any object and indeed that is the case with ground-truth. i.e. correct negative identification.
FN : False Negative is when your algorithm failed to identify objects (i.e. the ground truth contains objects in the image(s), but it is marked as background by your algorithm). In other words, you missed identifying an object.
Its 0 anyway in your experiments.
So, TP+TN = True Total cases. Don't include FN because that is wrong detection.
you can use a heat map to visual analyze your coefficients from a logistic regression. roc_curve returns the false positives and true positive values. The confusion matrix returns fp, tp, fn, and fp aggregates.
fpr, tpr, threshholds = roc_curve(y_test,y_preds_proba_lr_df)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
accuracy=round(pipeline['lr'].score(X_train, y_train) * 100, 2)
print("Model Accuracy={accuracy}".format(accuracy=accuracy))
cm=confusion_matrix(y_test,predictions)
sns.heatmap(cm,annot=True,fmt="g")

Extract feature values from Solr

I have 80000 Questions&Answers which indexed using Solr, and a feature file.
I'm trying to extract those feature values for each Q&A couple in order to use them for training by algorithm (such as LambdaMart).
The training Algorithm gets as input this format:
<label> qid:<qid> <feature>:<value> ... <feature>:<value> # <info>
For example:
3 qid:1 1:1 2:1 3:0 4:0.2 5:0 # 1A
2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # 1B
1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # 1C
1 qid:1 1:0 2:0 3:1 4:0.3 5:0 # 1D
1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2A
Can anyone help me to extract those feature values?
Thanks!

Artificial Neural Network Toplogy

I am currently trying to revise for my final year exams and came across this question, I have looked everywhere in my lecture slides for any sort of help and cannot find any. Any help in providing insight in to how to solve this question would be appreciated (I am not just asking for the answer, I need to comprehend the topic). Furthermore, do I assume that all inputs are equal to 1? Do i include 7 inputs in the input layer? Im at a loss as to how to answer.
The question is as follows:
b) Determine, with justification, the simplest type and topology (i.e. number of neurons & layers) of artificial neural network that could learn the data set below.
Click here for picture of the dataset.
If I'm not mistaken, you have two inputs X1, X2, and one target output. For each input consisting, of two numbers X1, X2, the appropriate output ("target") is given.
As a first step, you could sketch the seven data points - just draw the 3 ones and 4 zeroes at the right places on on the square (X1, X2) ∈ [0, 1.05] × [0, 1]. Maybe you remember something similar from the lecture, possibly near a mention of "XOR".
The edit queue is full, so adding data from the linked image here
Pattern X1 X2 Target
1 0.01 -0.1 1
2 0.90 0.09 0
3 0.89 -0.05 0
4 1.05 0.95 1
5 -0.01 0.12 0
6 1.05 0.97 1
7 0.98 0.10 0
It looks like 1 possible solution is X1 >= 1.0 OR X2 <= -0.1
Alternatively, if you round each of X1 and X2, it becomes
Pattern X1 X2 Target
1 0 0 1
2 1 0 0
3 1 0 0
4 1 1 1
5 0 0 0
6 1 1 1
7 1 0 0
Then it IS XOR, and the solution is round(X1) XOR round(X2). In that case you can use 1 activation layer (like round, RELU, sigmoid, linear), 1 hidden layer of 2 neurons and 1 output layer of 1 neuron.
See this stackoverflow post for a detail of how to solve XOR with a neural net.

How to get the Gini coefficient using random forests in the caret R package?

I'm trying to understand the difference between the random forest implementation in the randomForest package and in the caret package.
For example, this specifies 2000 trees with mtry = 2 in randomForest and I show the Gini coefficient for each predictor:
library(randomForest)
library(tidyr)
rf1 <- randomForest(Species ~ ., data = iris,
ntree = 2000, mtry = 2,
importance = TRUE)
data.frame(RF = sort(importance(rf1)[, "MeanDecreaseGini"], decreasing = TRUE)) %>% add_rownames() %>% rename(Predictor = rowname)
# Predictor RF
# 1 Petal.Width 45.57974
# 2 Petal.Length 41.61171
# 3 Sepal.Length 9.59369
# 4 Sepal.Width 2.47010
I'm trying to get the same info in caret, but I don't know how to specify the number of trees, or how to get the Gini coefficient:
rf2 <- train(Species ~ ., data = iris, method = "rf",
metric = "Kappa",
tuneGrid = data.frame(mtry = 2))
varImp(rf2) # not the Gini coefficient
# Overall
# Petal.Length 100.000
# Petal.Width 99.307
# Sepal.Width 0.431
# qSepal.Length 0.000
Also, the confusion matrix of rf1 has some errors and that of rf2 doesn't. What parameter is causing this difference?:
# rf1 Confusion matrix:
# setosa versicolor virginica class.error
# setosa 50 0 0 0.00
# versicolor 0 47 3 0.06
# virginica 0 4 46 0.08
table(predict(rf2, iris), iris$Species)
# setosa versicolor virginica
# setosa 50 0 0
# versicolor 0 50 0
# virginica 0 0 50
This is quick and dirty. I know this isn't the right way to test the performance of the classifier, but I dont' understand the difference in the results.
This might help to answer the question - see 2nd post:
caret: using random forest and include cross-validation
randomforest is sampling with replacement. If you use "rf" in caret, you need to specify trControl in train::caret(); you want the same resampling method to be used in caret i.e. a bootstrap, so you need to set trControl="oob". TrControl is a list of values that defines how the function acts; this can be set to "cv" for cross validation, "repeatedcv" for repeated cross validation etc. See the caret package documentation for more info.
You should get the same result as using randomForest, but do remember to set the seeds properly.
I was also recently looking for a solution to get the MeanDecreasingGini variable from the caret implementation of randomForest. I realize this was posted long ago so perhaps caret has updated and my advice is no longer necessary, but I struggled to find a solution so hopefully someone finds this useful.
To set the number of trees in caret you use the ntrees=xx argument during training just like you would with randomForest. Then to output the MeanDecreasingGini in caret specify type=2 (1=MeanDecreasingAccuracy[default], 2=MeanDecreasingGini) and scale=FALSE. Full code with results below (after several runs there are minor fluctuations in the magnitude of results which I am predicting is random chance, but rank of variables is consistent):
library(randomForest)
library(tidyr)
library(caret)
##randomForest
rf1 <- randomForest(Species ~ ., data = iris,
ntree = 2000, mtry = 2,
importance = TRUE)
data.frame(Gini=sort(importance(rf1, type=2)[,], decreasing=T))
# Gini
# Petal.Width 43.924705
# Petal.Length 43.293731
# Sepal.Length 9.717544
# Sepal.Width 2.320682
##caret
rf2 <- train(Species ~ .,
data = iris,
method = "rf",
ntrees=2000, ##same as randomForest
importance=TRUE, ##same as randomForest
metric = "Kappa",
tuneGrid = data.frame(mtry = 2),
trControl = trainControl(method = "none")) ##Stop the default bootstrap=25
varImp(rf2, type=2, scale=FALSE)
# rf variable importance
#
# Overall
# Petal.Width 44.475
# Petal.Length 43.401
# Sepal.Length 9.140
# Sepal.Width 2.267
Then in terms confusion matrix confusion (confusing phrasing?), this seems to be a byproduct of the way you were calculating the confusion matrices. When I used the predict function for both models, I moved to 100% accuracy versus when I used other methods:
rf1$confusion
# setosa versicolor virginica class.error
# setosa 50 0 0 0.00
# versicolor 0 47 3 0.06
# virginica 0 3 47 0.06
table(predict(rf1, iris), iris$Species)
# setosa versicolor virginica
# setosa 50 0 0
# versicolor 0 50 0
# virginica 0 0 50
rf2$finalModel$confusion
# setosa versicolor virginica class.error
# setosa 50 0 0 0.00
# versicolor 0 47 3 0.06
# virginica 0 5 45 0.10
table(predict(rf2, iris), iris$Species)
# setosa versicolor virginica
# setosa 50 0 0
# versicolor 0 50 0
# virginica 0 0 50
However, I am not sure if rf1$confusion and rf2$finalModel$confusion both represent the last tree's predictions. Perhaps someone with a better grasp of this could help out.

Decision Trees (Random Forest and Random Tree) classification on a small data set. Something wrong?

I performed classification on a small data set 65x9 using Decision Trees (Random Forest and Random Tree). I have four classes and 8 Attributes and 65 Instances.
My Application is in assistive robotics. So,Im extracting some parameters from my sensor data that I think are relevant to classify the users run while they are performing some task. I get the movement data from the sensor package deployed on the wheelchair. Im classify certain action like turning 180 degree, and Im giving him a mark (from 1 to 4) So from the sensor package and the software I had extracted parameters like velocity, distance, time, standard deviation of the velocity etc. that are relevant for the classification of the users run. So my data are all numbers.
When I performed Decision Trees Classify I got this Results
=== Classifier model (full training set) ===
Random forest of 10 trees, each constructed while considering 4 random features.
Out of bag error: 0.5231
Time taken to build model: 0.01 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 64 98.4615 %
Incorrectly Classified Instances 1 1.5385 %
Kappa statistic 0.9791
Mean absolute error 0.0715
Root mean squared error 0.1243
Relative absolute error 19.4396 %
Root relative squared error 29.0038 %
Total Number of Instances 65
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1 0 1 1 1 1 c1
1 0 1 1 1 1 c2
0.952 0 1 0.952 0.976 1 c3
1 0.019 0.917 1 0.957 1 c4
Weighted Avg. 0.985 0.003 0.986 0.985 0.985 1
=== Confusion Matrix ===
a b c d <-- classified as
14 0 0 0 | a = c1
0 19 0 0 | b = c2
0 0 20 1 | c = c3
0 0 0 11 | d = c4
This is too good. Am I doing something wrong?

Resources