Generally, in statistics, the R² Score is between 0 and 1. But, it can be negative in BigQuery ML training results using XGBoost(model type = BOOSTED_TREE_REGRESSOR).
So, what is the coefficient of determination R² in the evaluation of models in BigQuery ML?
R2 score can be negative. R2 is not always the square of anything, so it can have a negative value without violating any rules of math. R2 is negative only when the chosen model does not follow the trend of the data.
Related
How can I calculate the false positive rate for an object detection algorithm, where I can have multiple objects per image?
In my data, a given image may have many objects. I am counting a predicted box as a true positive if its IOU with a truth box is above a certain threshold, and as a false positive otherwise. For example:
I have 2 prediction bounding boxes and 2 ground-truth bounding boxes:
I computed IoU for each pair of prediction and ground-truth bounding boxes:
IoU = 0.00, 0.60, 0.10, 0.05
threshold = 0.50
In this case do I have TP example or not? Could You explain it?
Summary, specific: Yes, you have a TP; you also have a FP and a FN.
Summary, detailed: Your prediction model correctly identified one GT (ground truth) box. It missed the other. It incorrectly identified a third box.
Classification logic:
At the very least, your IoU figures should be a matrix, not a linear sequence. For M predictions and N GT boxes, you will have a NxM matrix. Your looks like this:
0.00 0.60
0.10 0.05
Now, find the largest value in the matrix, 0.60. This is above the threshold, so you declare the match and eliminate both that prediction and that GT box from the matrix. This leaves you with a rather boring matrix:
0.10
Since this value is below the threshold, you are out of matches. You have one prediction and one GT remaining. With the one "hit", you have three objects in your classification set: two expected objects, and a third created by the predictor. You code your gt and pred lists like this:
gt = [1, 1, 0] // The first two objects are valid; the third is a phantom.
pred = [1, 0, 1] // Identified one actual box and the phantom.
Is that clear enough?
You can use an algorithm (e.g. Hungarian algorithm aka Kuhn–Munkres algorithm aka Munkres algorithm) to assign detections to ground truths. You might incorporate the ability to not assign a detection to ground truth & vice versa (e.g. allow for false alarms and missed detections).
After assigning the detections to ground truths, just use the definition of TPR Wikipedia page for Sensitivity (aka TPR) & Specificity (aka TNR)
I provide this answer since I think #Prune provided an answer which uses a Greedy algorithm to perform assignment of detections to ground truths (i.e. "Now, find the largest value in the matrix, 0.60. This is above the threshold, so you declare the match and eliminate both that prediction and that GT box from the matrix."). This Greedy assignment method will not work well in all scenarios. For example imagine a matrix of IoU values between detections and ground truth bounding boxes
det1 det2
pred1 0.4 0.0
pred2 0.6 0.4
The Greedy algorithm would assign pred2 to det1 and pred1 to det2 (or pred1 to nothing if accounting for possibility of false alarms). However, the Hungarian algorithm would assign pred1 to det1 and pred2 to det2, which might be better in some cases.
So I've got the following results from Naïves Bayes classification on my data set:
I am stuck however on understanding how to interpret the data. I am wanting to find and compare the accuracy of each class (a-g).
I know accuracy is found using this formula:
However, lets take the class a. If I take the number of correctly classified instances - 313 - and divide it by the total number of 'a' (4953) from the row a, this gives ~6.32%. Would this be the accuracy?
EDIT: if we use the column instead of the row, we get 313/1199 which gives ~26.1% which seems a more reasonable number.
EDIT 2: I have done a calculation of the accuracy of a in excel which gives me 84% as the accuracy, using the accuracy calculation shown above:
This doesn't seem right, as the overall accuracy of classification successfully is ~24%
No -- all you've calculated is tp/(tp+fn), the total correct identifications of class a, divided by the total of actual a examples. This is recall, not accuracy. You need to include the other two figures.
fp is the rest of the a column; tn is all of the other figures in the non-a rows and columns, the 6x6 sub-matrix. This will reduce all 35K+ trials to a 2x2 matrix with labels a and not a, the 2x2 confusion matrix with which you're already familiar.
Yes, you get to repeat that reduction for each of the seven features. I recommend doing it programmatically.
RESPONSE TO OP UPDATE
Your accuracy is that high: you have a huge quantity of true negatives, not-a samples that were properly classified as not-a.
Perhaps it doesn't feel right because our experience focuses more on the class in question. There are [other statistics that handle that focus.
Recall is tp / (tp+fn) -- of all items actually in class a, what percentage did we properly identify? This is the 6.32% figure.
Precision is tp / (tp + fp) -- of all items identified as class a, what percentage were actually in that class. This is the 26.1% figure you calculated.
I have a few distance functions which return distance between two images , I want to combine these distance into a single distance, using weighted scoring e.g. ax1+bx2+cx3+dx4 etc i want to learn these weights automatically such that my test error is minimised.
For this purpose i have a labeled dataset which has various triplets of images such that (a,b,c) , a has less distance to b than it has to c.
i.e. d(a,b)<d(a,c)
I want to learn such weights so that this ordering of triplets can be as accurate as possible.(i.e. the weighted linear score given is less for a&b and more for a&c).
What sort of machine learning algorithm can be used for the task,and how the desired task can be achieved?
Hopefully I understand your question correctly, but it seems that this could be solved more easily with constrained optimization directly, rather than classical machine learning (the algorithms of which are often implemented via constrained optimization, see e.g. SVMs).
As an example, a possible objective function could be:
argmin_{w} || e ||_2 + lambda || w ||_2
where w is your weight vector (Oh god why is there no latex here), e is the vector of errors (one component per training triplet), lambda is some tunable regularizer constant (could be zero), and your constraints could be:
max{d(I_p,I_r)-d(I_p,I_q),0} <= e_j for jth (p,q,r) in T s.t. d(I_p,I_r) <= d(I_p,I_q)
for the jth constraint, where I_i is image i, T is the training set, and
d(u,v) = sum_{w_i in w} w_i * d_i(u,v)
with d_i being your ith distance function.
Notice that e is measuring how far your chosen weights are from satisfying all the chosen triplets in the training set. If the weights preserve ordering of label j, then d(I_p,I_r)-d(I_p,I_q) < 0 and so e_j = 0. If they don't, then e_j will measure the amount of violation of training label j. Solving the optimization problem would give the best w; i.e. the one with the lowest error.
If you're not familiar with linear/quadratic programming, convex optimization, etc... then start googling :) Many libraries exist for this type of thing.
On the other hand, if you would prefer a machine learning approach, you may be able to adapt some metric learning approaches to your problem.
I've been playing with some SVM implementations and I am wondering - what is the best way to normalize feature values to fit into one range? (from 0 to 1)
Let's suppose I have 3 features with values in ranges of:
3 - 5.
0.02 - 0.05
10-15.
How do I convert all of those values into range of [0,1]?
What If, during training, the highest value of feature number 1 that I will encounter is 5 and after I begin to use my model on much bigger datasets, I will stumble upon values as high as 7? Then in the converted range, it would exceed 1...
How do I normalize values during training to account for the possibility of "values in the wild" exceeding the highest(or lowest) values the model "seen" during training? How will the model react to that and how I make it work properly when that happens?
Besides scaling to unit length method provided by Tim, standardization is most often used in machine learning field. Please note that when your test data comes, it makes more sense to use the mean value and standard deviation from your training samples to do this scaling. If you have a very large amount of training data, it is safe to assume they obey the normal distribution, so the possibility that new test data is out-of-range won't be that high. Refer to this post for more details.
You normalise a vector by converting it to a unit vector. This trains the SVM on the relative values of the features, not the magnitudes. The normalisation algorithm will work on vectors with any values.
To convert to a unit vector, divide each value by the length of the vector. For example, a vector of [4 0.02 12] has a length of 12.6491. The normalised vector is then [4/12.6491 0.02/12.6491 12/12.6491] = [0.316 0.0016 0.949].
If "in the wild" we encounter a vector of [400 2 1200] it will normalise to the same unit vector as above. The magnitudes of the features is "cancelled out" by the normalisation and we are left with relative values between 0 and 1.
2 questions concerning machine learning algorithms like linear / logistic regression, ANN, SVM:
The models in these algorithms are dealing with data sets where each example has a no. of features and one output possible value (ex : getting price of house with features f) but what if the features are enough to produce more than one piece of information about the item of interest which means more than one output?! consider this as an example: a data set about cars where each example (car) has the following features (initial velocity, acceleration, and time), in real world these features are enough to know two variables: velocity via v = v_i + at and distance via s = (v_i * t ) + (0.5 * a *t^2 ) so I want example X with features (x1 , x2 , ... , xn) to have output y1 and y2 in the same step so that after training the model, if a new car example is given with initial velocity and acc. and time, the model will be able to predict the velocity and distance at the same time, is this possible?
in the houses' price prediction example where example X given with features (x1, x2, x3) the model predicts the price, can the process be reversed by any means? meaning if I give the model example X with features x1, x2 with price y can it predict the feature x3?
Depends on the model. A linear model such as linear regression cannot reliably learn the distance formula since it's a cubic function of the given variables. You'd need to add v×t and a×t² as a feature to get a good prediction of the distance. A non-linear model such as a cubic-kernel SVM regression or a multi-layer ANN should be able to learn this from the given features, though, given enough data.
More generally, predicting multiple values with a single model sometimes works and sometimes doesn't -- when in doubt, just fit several models.
You can try. Whether it'll work depends on the relation between the variables and the model.