I have a very basic question about calculating RMSE in an NB classification scenario. My training data X has some 1000-odd reviews with ratings in [1,5] which are the class labels Y.
So what I am doing is something like this:
model = nb_classifier_train(trainingX,Y)
Yhat = nb_classifier_test(model,testingX)
My testing data has some 400-odd reviews with missing ratings (whose labels/ratings I need to predict. Now to calculate RMSE
RMSE = sqrt(mean((Y - Yhat).^2))
What is the Y in this scenario? I understand RMSE is calculated using difference between predicted values and actual values. What are the actual values here? Or is there something missing?
Y in this case is the labels for your training data, so the RMSE you're calculating does not make much sense since you are making a prediction on the test examples and comparing against the training labels. In fact, there is no reason that Y and Yhat vectors would even be the same length. Instead you should replace the Y with your test labels, and if you don't have test labels then you simply have no way of calculating your test error.
Related
I have a large dataset (~20,000 samples x 2,000 features-- each sample w/ a corresponding y-value) that I'm constructing a regression ML model for.
The input vectors are bitvectors with either 1s or 0s at each position.
Interestingly, I have noticed that when I 'randomly' select N samples such that their y-values are between two arbitrary values A and B (such that B-A is much smaller than the total range of values in y), the subsequent model is much better at predicting other values with the A-->B range not used in the training of the model.
However, the overall similarity of the input X vectors for these values are in no way more similar than any random selection of X values across the whole dataset.
Is there an available method to transform the input X-vectors such that those with more similar y-values are "closer" (I'm not particular the methodology, but it could be something like cosine similarity), and those with not similar y-values are separated?
After more thought, I believe this question can be re-framed as a supervised clustering problem. What might be able to accomplish this might be as simple as:
import umap
print(df.shape)
>> (23,312, 2149)
print(len(target))
>> 23,312
embedding = umap.UMAP().fit_transform(df, y=target)
I have a classification task. The training data has 50 different labels. The customer wants to differentiate the low probability predictions, meaning that, I have to classify some test data as Unclassified / Other depending on the probability (certainty?) of the model.
When I test my code, the prediction result is a numpy array (I'm using different models, this is one is pre-trained BertTransformer). The prediction array doesn't contain probabilities such as in Keras predict_proba() method. These are numbers generated by prediction method of pretrained BertTransformer model.
[[-1.7862008 -0.7037363 0.09885322 1.5318055 2.1137428 -0.2216074
0.18905772 -0.32575375 1.0748093 -0.06001111 0.01083148 0.47495762
0.27160102 0.13852511 -0.68440574 0.6773654 -2.2712054 -0.2864312
-0.8428862 -2.1132915 -1.0157436 -1.0340284 -0.35126117 -1.0333195
9.149789 -0.21288703 0.11455813 -0.32903734 0.10503325 -0.3004114
-1.3854568 -0.01692022 -0.4388664 -0.42163098 -0.09182278 -0.28269592
-0.33082992 -1.147654 -0.6703184 0.33038092 -0.50087476 1.1643585
0.96983343 1.3400391 1.0692116 -0.7623776 -0.6083422 -0.91371405
0.10002492]]
I'm using numpy.argmax() to identify the correct label. The prediction works just fine. However, since these are not probabilities, I cannot compare the best result with a threshold value.
My question is, how can I define a threshold (say, 0.6), and then compare the probability of the argmax() element of the BertTransformer prediction array so that I can classify the prediction as "Other" if the probability is less than the threshold value?
Edit 1:
We are using 2 different models. One is Keras, and the other is BertTransformer. We have no problem in Keras since it gives the probabilities so I'm skipping Keras model.
The Bert model is pretrained. Here is how it is generated:
def model(self, data):
number_of_categories = len(data['encoded_categories'].unique())
model = BertForSequenceClassification.from_pretrained(
"dbmdz/bert-base-turkish-128k-uncased",
num_labels=number_of_categories,
output_attentions=False,
output_hidden_states=False,
)
# model.cuda()
return model
The output given above is the result of model.predict() method. We compare both models, Bert is slightly ahead, therefore we know that the prediction works just fine. However, we are not sure what those numbers signify or represent.
Here is the Bert documentation.
BertForSequenceClassification returns logits, i.e., the classification scores before normalization. You can normalize the scores by calling F.softmax(output, dim=-1) where torch.nn.functional was imported as F.
With thousands of labels, the normalization can be costly and you do not need it when you are only interested in argmax. This is probably why the models return the raw scores only.
I use the following simple IsolationForest algorithm to detect the outliers of given dataset X of 20K samples and 16 features, I run the following
train_X, tesy_X, train_y, test_y = train_test_split(X, y, train_size=.8)
clf = IsolationForest()
clf.fit(X) # Notice I am using the entire dataset X when fitting!!
print (clf.predict(X))
I get the result:
[ 1 1 1 -1 ... 1 1 1 -1 1]
This question is: Is it logically correct to use the entire dataset X when fitting into IsolationForest or only train_X?
Yes, it is logically correct to ultimately train on the entire dataset.
With that in mind, you could measure the test set performance against the training set's performance. This could tell you if the test set is from a similar distribution as your training set.
If the test set scores anomalous as compared to the training set, then you can expect future data to be similar. In this case, I would like more data to have a more complete view of what is 'normal'.
If the test set scores similarly to the training set, I would be more comfortable with the final Isolation Forest trained on all data.
Perhaps you could use sklearn TimeSeriesSplit CV in this fashion to get a sense for how much data is enough for your problem?
Since this is unlabeled data to the anomaly detector, the more data the better when defining 'normal'.
I am using the LogisticRegression() method in scikit-learn on a highly unbalanced data set. I have even turned the class_weight feature to auto.
I know that in Logistic Regression it should be possible to know what is the threshold value for a particular pair of classes.
Is it possible to know what the threshold value is in each of the One-vs-All classes the LogisticRegression() method designs?
I did not find anything in the documentation page.
Does it by default apply the 0.5 value as threshold for all the classes regardless of the parameter values?
There is a little trick that I use, instead of using model.predict(test_data) use model.predict_proba(test_data). Then use a range of values for thresholds to analyze the effects on the prediction;
pred_proba_df = pd.DataFrame(model.predict_proba(x_test))
threshold_list = [0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,.7,.75,.8,.85,.9,.95,.99]
for i in threshold_list:
print ('\n******** For i = {} ******'.format(i))
Y_test_pred = pred_proba_df.applymap(lambda x: 1 if x>i else 0)
test_accuracy = metrics.accuracy_score(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1),
Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1))
print('Our testing accuracy is {}'.format(test_accuracy))
print(confusion_matrix(Y_test.as_matrix().reshape(Y_test.as_matrix().size,1),
Y_test_pred.iloc[:,1].as_matrix().reshape(Y_test_pred.iloc[:,1].as_matrix().size,1)))
Best!
Logistic regression chooses the class that has the biggest probability. In case of 2 classes, the threshold is 0.5: if P(Y=0) > 0.5 then obviously P(Y=0) > P(Y=1). The same stands for the multiclass setting: again, it chooses the class with the biggest probability (see e.g. Ng's lectures, the bottom lines).
Introducing special thresholds only affects in the proportion of false positives/false negatives (and thus in precision/recall tradeoff), but it is not the parameter of the LR model. See also the similar question.
Yes, Sci-Kit learn is using a threshold of P>=0.5 for binary classifications. I am going to build on some of the answers already posted with two options to check this:
One simple option is to extract the probabilities of each classification using the output from model.predict_proba(test_x) segment of the code below along with class predictions (output from model.predict(test_x) segment of code below). Then, append class predictions and their probabilities to your test dataframe as a check.
As another option, one can graphically view precision vs. recall at various thresholds using the following code.
### Predict test_y values and probabilities based on fitted logistic
regression model
pred_y=log.predict(test_x)
probs_y=log.predict_proba(test_x)
# probs_y is a 2-D array of probability of being labeled as 0 (first
column of
array) vs 1 (2nd column in array)
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(test_y, probs_y[:,
1])
#retrieve probability of being 1(in second column of probs_y)
pr_auc = metrics.auc(recall, precision)
plt.title("Precision-Recall vs Threshold Chart")
plt.plot(thresholds, precision[: -1], "b--", label="Precision")
plt.plot(thresholds, recall[: -1], "r--", label="Recall")
plt.ylabel("Precision, Recall")
plt.xlabel("Threshold")
plt.legend(loc="lower left")
plt.ylim([0,1])
we can use a wrapper as follows:
model = LogisticRegression()
model.fit(X, y)
def custom_predict(X, threshold):
probs = model.predict_proba(X)
return (probs[:, 1] > threshold).astype(int)
new_preds = custom_predict(X=X, threshold=0.4)
Assuming after performing median frequency balancing for images used for segmentation, we have these class weights:
class_weights = {0: 0.2595,
1: 0.1826,
2: 4.5640,
3: 0.1417,
4: 0.9051,
5: 0.3826,
6: 9.6446,
7: 1.8418,
8: 0.6823,
9: 6.2478,
10: 7.3614,
11: 0.0}
The idea is to create a weight_mask such that it could be multiplied by the cross entropy output of both classes. To create this weight mask, we can broadcast the values based on the ground_truth labels or the predictions. Some mathematics in my implementation:
Both labels and logits are of shape [batch_size, height, width, num_classes]
The weight mask is of shape [batch_size, height, width, 1]
The weight mask is broadcasted to the num_classes number of channels of the multiplication between the softmax of the logit and the labels to give an output shape of [batch_size, height, width, num_classes]. In this case, num_classes is 12.
Reduce sum for each example in a batch, then perform reduce mean for all examples in one batch to get a single scalar value of loss.
In this case, should we create the weight mask based on the predictions or the ground truth?
If we build it based on the ground_truth, then it means no matter what the predicted pixel labels are, they get penalized based on the actual labels of the class, which doesn't seem to guide the training in a sensible way.
But if we build it based on the predictions, then for whatever logit predictions that are produced, if the predicted label (from taking the argmax of the logit) is dominant, then the logit values for that pixel will all be reduced by a significant amount.
--> Although this means the maximum logit will still be the maximum since all of the logits in the 12 channels will be scaled by the same value, the final softmax probability of the label predicted (which is still the same before and after scaling), will be lower than before scaling (did some simple math to estimate). --> a lower loss is predicted
But the problem is this: If a lower loss is predicted as a result of this weighting, then wouldn't it contradict the idea that predicting dominant labels should give you a greater loss?
The impression I get in total for this method is that:
For the dominant labels, they are penalized and rewarded much lesser.
For the less dominant labels, they are rewarded highly if the predictions are correct, but they're also penalized heavily for a wrong prediction.
So how does this help to tackle the issue of class-balancing? I don't quite get the logic here.
IMPLEMENTATION
Here is my current implementation for calculating the weighted cross entropy loss, although I'm not sure if it is correct.
def weighted_cross_entropy(logits, onehot_labels, class_weights):
if not logits.dtype == tf.float32:
logits = tf.cast(logits, tf.float32)
if not onehot_labels.dtype == tf.float32:
onehot_labels = tf.cast(onehot_labels, tf.float32)
#Obtain the logit label predictions and form a skeleton weight mask with the same shape as it
logit_predictions = tf.argmax(logits, -1)
weight_mask = tf.zeros_like(logit_predictions, dtype=tf.float32)
#Obtain the number of class weights to add to the weight mask
num_classes = logits.get_shape().as_list()[3]
#Form the weight mask mapping for each pixel prediction
for i in xrange(num_classes):
binary_mask = tf.equal(logit_predictions, i) #Get only the positions for class i predicted in the logits prediction
binary_mask = tf.cast(binary_mask, tf.float32) #Convert boolean to ones and zeros
class_mask = tf.multiply(binary_mask, class_weights[i]) #Multiply only the ones in the binary mask with the specific class_weight
weight_mask = tf.add(weight_mask, class_mask) #Add to the weight mask
#Multiply the logits with the scaling based on the weight mask then perform cross entropy
weight_mask = tf.expand_dims(weight_mask, 3) #Expand the fourth dimension to 1 for broadcasting
logits_scaled = tf.multiply(logits, weight_mask)
return tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits_scaled)
Could anyone verify whether my concept of this weighted loss is correct, and whether my implementation is correct? This is my first time getting acquainted with a dataset with imbalanced class, and so I would really appreciate it if anyone could verify this.
TESTING RESULTS: After doing some tests, I found the implementation above results in a greater loss. Is this supposed to be the case? i.e. Would this make the training harder but produce a more accurate model eventually?
SIMILAR THREADS
Note that I have checked a similar thread here: How can I implement a weighted cross entropy loss in tensorflow using sparse_softmax_cross_entropy_with_logits
But it seems that TF only has a sample-wise weighting for loss but not a class-wise one.
Many thanks to all of you.
Here is my own implementation in Keras using the TensorFlow backend:
def class_weighted_pixelwise_crossentropy(target, output):
output = tf.clip_by_value(output, 10e-8, 1.-10e-8)
with open('class_weights.pickle', 'rb') as f:
weight = pickle.load(f)
return -tf.reduce_sum(target * weight * tf.log(output))
where weight is just a standard Python list with the indexes of the weights matched to those of the corresponding class in the one-hot vectors. I store the weights as a pickle file to avoid having to recalculate them. It is an adaptation of the Keras categorical_crossentropy loss function. The first line simply clips the value to make sure we never take the log of 0.
I am unsure why one would calculate the weights using the predictions rather than the ground truth; if you provide further explanation I can update my answer in response.
Edit: Play around with this numpy code to understand how this works. Also review the definition of cross entropy.
import numpy as np
weights = [1,2]
target = np.array([ [[0.0,1.0],[1.0,0.0]],
[[0.0,1.0],[1.0,0.0]]])
output = np.array([ [[0.5,0.5],[0.9,0.1]],
[[0.9,0.1],[0.4,0.6]]])
crossentropy_matrix = -np.sum(target * np.log(output), axis=-1)
crossentropy = -np.sum(target * np.log(output))