I'm building Kmeans in pytorch using gradient descent on centroid locations, instead of expectation-maximisation. Loss is the sum of square distances of each point to its nearest centroid. To identify which centroid is nearest to each point, I use argmin, which is not differentiable everywhere. However, pytorch is still able to backprop and update weights (centroid locations), giving similar performance to sklearn kmeans on the data.
Any ideas how this is working, or how I can figure this out within pytorch? Discussion on pytorch github suggests argmax is not differentiable:
Example code below (on random pts):
import numpy as np
import torch
num_pts, batch_size, n_dims, num_clusters, lr = 1000, 100, 200, 20, 1e-5
# generate random points
vector = torch.from_numpy(np.random.rand(num_pts, n_dims)).float()
# randomly pick starting centroids
idx = np.random.choice(num_pts, size=num_clusters)
kmean_centroids = vector[idx][:,None,:] # [num_clusters,1,n_dims]
kmean_centroids = torch.tensor(kmean_centroids, requires_grad=True)
for t in range(4001):
# get batch
idx = np.random.choice(num_pts, size=batch_size)
vector_batch = vector[idx]
distances = vector_batch - kmean_centroids # [num_clusters, #pts, #dims]
distances = torch.sum(distances**2, dim=2) # [num_clusters, #pts]
# argmin
membership = torch.min(distances, 0)[1] # [#pts]
# cluster distances
cluster_loss = 0
for i in range(num_clusters):
subset = torch.transpose(distances,0,1)[membership==i]
if len(subset)!=0: # to prevent NaN
cluster_loss += torch.sum(subset[:,i])
with torch.no_grad():
kmean_centroids -= lr * kmean_centroids.grad
As alvas noted in the comments, argmax is not differentiable. However, once you compute it and assign each datapoint to a cluster, the derivative of loss with respect to the location of these clusters is well-defined. This is what your algorithm does.
Why does it work? If you had only one cluster (so that the argmax operation didn't matter), your loss function would be quadratic, with minimum at the mean of the data points. Now with multiple clusters, you can see that your loss function is piecewise (in higher dimensions think volumewise) quadratic - for any set of centroids [C1, C2, C3, ...] each data point is assigned to some centroid CN and the loss is locally quadratic. The extent of this locality is given by all alternative centroids [C1', C2', C3', ...] for which the assignment coming from argmax remains the same; within this region the argmax can be treated as a constant, rather than a function and thus the derivative of loss is well-defined.
Now, in reality, it's unlikely you can treat argmax as constant, but you can still treat the naive "argmax-is-a-constant" derivative as pointing approximately towards a minimum, because the majority of data points are likely to indeed belong to the same cluster between iterations. And once you get close enough to a local minimum such that the points no longer change their assignments, the process can converge to a minimum.
Another, more theoretical way to look at it is that you're doing an approximation of expectation maximization. Normally, you would have the "compute assignments" step, which is mirrored by argmax, and the "minimize" step which boils down to finding the minimizing cluster centers given the current assignments. The minimum is given by d(loss)/d([C1, C2, ...]) == 0, which for a quadratic loss is given analytically by the means of data points within each cluster. In your implementation, you're solving the same equation but with a gradient descent step. In fact, if you used a 2nd order (Newton) update scheme instead of 1st order gradient descent, you would be implicitly reproducing exactly the baseline EM scheme.
Imagine this:
t = torch.tensor([-0.0627, 0.1373, 0.0616, -1.7994, 0.8853,
-0.0656, 1.0034, 0.6974, -0.2919, -0.0456])
torch.argmax(t).item() # outputs 6
We increase t[0] for some, δ close to 0, will this update the argmax? It will not, so we are dealing with 0 gradients, all the time. Just ignore this layer, or assume it is frozen.
The same is for argmin, or any other function where the dependent variable is in discrete steps.
Assuming after performing median frequency balancing for images used for segmentation, we have these class weights:
class_weights = {0: 0.2595,
1: 0.1826,
2: 4.5640,
3: 0.1417,
4: 0.9051,
5: 0.3826,
6: 9.6446,
7: 1.8418,
8: 0.6823,
9: 6.2478,
10: 7.3614,
11: 0.0}
The idea is to create a weight_mask such that it could be multiplied by the cross entropy output of both classes. To create this weight mask, we can broadcast the values based on the ground_truth labels or the predictions. Some mathematics in my implementation:
Both labels and logits are of shape [batch_size, height, width, num_classes]
The weight mask is of shape [batch_size, height, width, 1]
The weight mask is broadcasted to the num_classes number of channels of the multiplication between the softmax of the logit and the labels to give an output shape of [batch_size, height, width, num_classes]. In this case, num_classes is 12.
Reduce sum for each example in a batch, then perform reduce mean for all examples in one batch to get a single scalar value of loss.
In this case, should we create the weight mask based on the predictions or the ground truth?
If we build it based on the ground_truth, then it means no matter what the predicted pixel labels are, they get penalized based on the actual labels of the class, which doesn't seem to guide the training in a sensible way.
But if we build it based on the predictions, then for whatever logit predictions that are produced, if the predicted label (from taking the argmax of the logit) is dominant, then the logit values for that pixel will all be reduced by a significant amount.
--> Although this means the maximum logit will still be the maximum since all of the logits in the 12 channels will be scaled by the same value, the final softmax probability of the label predicted (which is still the same before and after scaling), will be lower than before scaling (did some simple math to estimate). --> a lower loss is predicted
But the problem is this: If a lower loss is predicted as a result of this weighting, then wouldn't it contradict the idea that predicting dominant labels should give you a greater loss?
The impression I get in total for this method is that:
For the dominant labels, they are penalized and rewarded much lesser.
For the less dominant labels, they are rewarded highly if the predictions are correct, but they're also penalized heavily for a wrong prediction.
So how does this help to tackle the issue of class-balancing? I don't quite get the logic here.
Here is my current implementation for calculating the weighted cross entropy loss, although I'm not sure if it is correct.
def weighted_cross_entropy(logits, onehot_labels, class_weights):
if not logits.dtype == tf.float32:
logits = tf.cast(logits, tf.float32)
if not onehot_labels.dtype == tf.float32:
onehot_labels = tf.cast(onehot_labels, tf.float32)
#Obtain the logit label predictions and form a skeleton weight mask with the same shape as it
logit_predictions = tf.argmax(logits, -1)
weight_mask = tf.zeros_like(logit_predictions, dtype=tf.float32)
#Obtain the number of class weights to add to the weight mask
num_classes = logits.get_shape().as_list()[3]
#Form the weight mask mapping for each pixel prediction
for i in xrange(num_classes):
binary_mask = tf.equal(logit_predictions, i) #Get only the positions for class i predicted in the logits prediction
binary_mask = tf.cast(binary_mask, tf.float32) #Convert boolean to ones and zeros
class_mask = tf.multiply(binary_mask, class_weights[i]) #Multiply only the ones in the binary mask with the specific class_weight
weight_mask = tf.add(weight_mask, class_mask) #Add to the weight mask
#Multiply the logits with the scaling based on the weight mask then perform cross entropy
weight_mask = tf.expand_dims(weight_mask, 3) #Expand the fourth dimension to 1 for broadcasting
logits_scaled = tf.multiply(logits, weight_mask)
return tf.losses.softmax_cross_entropy(onehot_labels=onehot_labels, logits=logits_scaled)
Could anyone verify whether my concept of this weighted loss is correct, and whether my implementation is correct? This is my first time getting acquainted with a dataset with imbalanced class, and so I would really appreciate it if anyone could verify this.
TESTING RESULTS: After doing some tests, I found the implementation above results in a greater loss. Is this supposed to be the case? i.e. Would this make the training harder but produce a more accurate model eventually?
Note that I have checked a similar thread here: How can I implement a weighted cross entropy loss in tensorflow using sparse_softmax_cross_entropy_with_logits
But it seems that TF only has a sample-wise weighting for loss but not a class-wise one.
Many thanks to all of you.
Here is my own implementation in Keras using the TensorFlow backend:
def class_weighted_pixelwise_crossentropy(target, output):
output = tf.clip_by_value(output, 10e-8, 1.-10e-8)
with open('class_weights.pickle', 'rb') as f:
weight = pickle.load(f)
return -tf.reduce_sum(target * weight * tf.log(output))
where weight is just a standard Python list with the indexes of the weights matched to those of the corresponding class in the one-hot vectors. I store the weights as a pickle file to avoid having to recalculate them. It is an adaptation of the Keras categorical_crossentropy loss function. The first line simply clips the value to make sure we never take the log of 0.
I am unsure why one would calculate the weights using the predictions rather than the ground truth; if you provide further explanation I can update my answer in response.
Edit: Play around with this numpy code to understand how this works. Also review the definition of cross entropy.
import numpy as np
weights = [1,2]
target = np.array([ [[0.0,1.0],[1.0,0.0]],
output = np.array([ [[0.5,0.5],[0.9,0.1]],
crossentropy_matrix = -np.sum(target * np.log(output), axis=-1)
crossentropy = -np.sum(target * np.log(output))
I'm using the ScikitLearn flavour of the DecisionTree.jl package to create a random forest model for a binary classification problem of one of the RDatasets data sets (see bottom of the DecisionTree.jl home page for what I mean by ScikitLearn flavour). I'm also using the MLBase package for model evaluation.
I have built a random forest model of my data and would like to create a ROC Curve for this model. Reading the documentation available, I do understand what a ROC curve is in theory. I just can't figure out how to create one for a specific model.
From the Wikipedia page the last part of the first sentence that I have marked in bold italics below is the one that is causing my confusion: "In statistics, a receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied." There is more on the threshold value throughout the article but this still confuses me for binary classification problems. What is the threshold value and how do I vary it?
Also, in the MLBase documentation on ROC Curves it says "Compute an ROC instance or an ROC curve (a vector of ROC instances), based on given scores and a threshold thres." But doesn't mention this threshold anywhere else really.
Example code for my project is given below. Basically, I want to create a ROC curve for the random forest but I'm not sure how to or if it's even appropriate.
using DecisionTree
using RDatasets
using MLBase
quakes_data = dataset("datasets", "quakes");
# Add in a binary column as feature column for classification
quakes_data[:MagGT5] = convert(Array{Int32,1}, quakes_data[:Mag] .> 5.0)
# Getting features and labels where label = 1 is mag > 1 and label = 2 is mag <= 5
features = convert(Array, quakes_data[:, [1:3;5]]);
labels = convert(Array, quakes_data[:, 6]);
labels[labels.==0] = 2
# Create a random forest model with the tuning parameters I want
r_f_model = RandomForestClassifier(nsubfeatures = 3, ntrees = 50, partialsampling=0.7, maxdepth = 4)
# Train the model in-place on the dataset (there isn't a fit function without the in-place functionality)!(r_f_model, features, labels)
# Apply the trained model to the test features data set (here I haven't partitioned into training and test)
r_f_prediction = convert(Array{Int64,1}, DecisionTree.predict(r_f_model, features))
# Applying the model to the training set and looking at model stats
TrainingROC = roc(labels, r_f_prediction) #getting the stats around the model applied to the train set
# p::T # positive in ground-truth
# n::T # negative in ground-truth
# tp::T # correct positive prediction
# tn::T # correct negative prediction
# fp::T # (incorrect) positive prediction when ground-truth is negative
# fn::T # (incorrect) negative prediction when ground-truth is positive
I also read this question and didn't find it helpful really.
The task in binary classification is to give a 0/1 (or true/false, red/blue) label to a new, unlabeled, data-point. Most classification algorithms are designed to output a continuous real value. This value is optimized to be higher for points with known or predicted label 1, and lower for points with known or predicted label 0. To use this value to generate a 0/1 prediction, an additional threshold is used. Points with a value higher than threshold are predicted to be labeled 1 (and for lower than threshold a 0 label is predicted ).
Why is this setup useful? Because, sometimes mispredicting a 0 instead of a 1 is more costly, and then you can set the threshold low, making the algorithm output predict 1s more often.
In an extreme case when predicting 0 instead of a 1 costs nothing for the application, you can set the threshold at infinity, making it always output 0 (which is obviously the best solution, since it incurs no cost).
The threshold trick cannot eliminate errors from the classifier - no classifier in real-world problems is perfect or free from noise. What it can do is change the ratio between the 0-when-really-1 errors and 1-when-really-0 errors for the final classification.
As you increase the threshold, more points are classified with a 0 label. Consider a chart with the fraction of points classified with 0 on the x-axis, and the fraction of points with a 0-when-really-1 error on the y-axis. For each value of the threshold, plot a point for the resulting classifier on this chart. Plotting a point for all thresholds you get a curve. This is (some variant of) the ROC curve, which summarizes the abilities of the classifier. An often used metric for quality of classification is the AUC or area-under-curve of this chart, but in fact, the whole curve can be of interest in applications.
A summary like this appears in many texts on machine learning, which are a google query away.
Hope this clarifies the role of the threshold and its relation to ROC curves.
I know that a decision tree doesn't get affected by scaling the data but when I scale the data within my decision tree it gives me a bad performance (bad recall, precision and accuracy)
But when I don't scale all the performance metrics the decision tree gives me an amazing result. How can this be?
Note: I use GridSearchCV but I don't think that the cross validation is the reason for my problem. Here is my code:
scaled = MinMaxScaler()
pca = PCA()
bestK = SelectKBest()
combined_transformers = FeatureUnion([ ("scale",scaled),("best", bestK),
("pca", pca)])
clf = tree.DecisionTreeClassifier(class_weight= "balanced")
pipeline = Pipeline([("features", combined_transformers), ("tree", clf)])
param_grid = dict(features__pca__n_components=[1, 2,3],
features__best__k=[1, 2,3],
tree__max_depth= [4,5],
grid_search = GridSearchCV(pipeline, param_grid=param_grid,scoring='f1'),labels)
With the scale function MinMaxScaler() my performance is:
f1 = 0.837209302326
recall = 1.0
precision = 0.72
accuracy = 0.948148148148
But without scaling:
f1 = 0.918918918919
recall = 0.944444444444
precision = 0.894736842105
accuracy = 0.977777777778
I am not familiar with scikit-learn, so excuse me if I misunderstand something.
First of all, does PCA standardize features? If it does not, it will give different results for scaled and non-scaled input.
Second, due to the randomness in splitting the samples, CV may give different results on each run. This will affect the results especially for small sample size. In addition, in case you have small sample size, the results may not be that different after all.
I have the following suggestions:
Scaling can be treated as an additional hyperparameter, which can be optimized by CV.
Perform an extra CV (called nested CV) or hold-out to estimate performance. This is done by keeping a test set, selecting your model using CV on the training data and then evaluate its performance on the test set (in case of nested CV you do this repeatedly for all folds and average the performance estimates). Of course, your final model should be trained on the whole dataset. In general, you should not use the performance estimate of the CV used for model selection, as it will be overly optimistic.
I can't understand why dropout works like this in tensorflow. The blog of CS231n says that, "dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise." Also you can see this from picture(Taken from the same site)
From tensorflow site, With probability keep_prob, outputs the input element scaled up by 1 / keep_prob, otherwise outputs 0.
Now, why the input element is scaled up by 1/keep_prob? Why not keep the input element as it is with probability and not scale it with 1/keep_prob?
This scaling enables the same network to be used for training (with keep_prob < 1.0) and evaluation (with keep_prob == 1.0). From the Dropout paper:
The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2.
Rather than adding ops to scale down the weights by keep_prob at test time, the TensorFlow implementation adds an op to scale up the weights by 1. / keep_prob at training time. The effect on performance is negligible, and the code is simpler (because we use the same graph and treat keep_prob as a tf.placeholder() that is fed a different value depending on whether we are training or evaluating the network).
Let's say the network had n neurons and we applied dropout rate 1/2
Training phase, we would be left with n/2 neurons. So if you were expecting output x with all the neurons, now you will get on x/2. So for every batch, the network weights are trained according to this x/2
Testing/Inference/Validation phase, we dont apply any dropout so the output is x. So, in this case, the output would be with x and not x/2, which would give you the incorrect result. So what you can do is scale it to x/2 during testing.
Rather than the above scaling specific to Testing phase. What Tensorflow's dropout layer does is that whether it is with dropout or without (Training or testing), it scales the output so that the sum is constant.
Here is a quick experiment to disperse any remaining confusion.
Statistically the weights of a NN-layer follow a distribution that is usually close to normal (but not necessarily), but even in the case when trying to sample a perfect normal distribution in practice, there are always computational errors.
Then consider the following experiment:
DIM = 1_000_000 # set our dims for weights and input
x = np.ones((DIM,1)) # our input vector
#x = np.random.rand(DIM,1)*2-1.0 # or could also be a more realistic normalized input
probs = [1.0, 0.7, 0.5, 0.3] # define dropout probs
W = np.random.normal(size=(DIM,1)) # sample normally distributed weights
print("W-mean = ", W.mean()) # note the mean is not perfect --> sampling error!
h = defaultdict(list)
for i in range(1000):
for p in probs:
M = np.random.rand(DIM,1)
M = (M < p).astype(int)
Wp = W * M
a =, x)
for k,v in h.items():
print("For drop-out prob %r the average linear activation is %r (unscaled) and %r (scaled)" % (k, np.mean(v), np.mean(v)/float(k)))
Sample output:
x-mean = 1.0
W-mean = -0.001003985674840264
For drop-out prob '1.0' the average linear activation is -1003.985674840258 (unscaled) and -1003.985674840258 (scaled)
For drop-out prob '0.7' the average linear activation is -700.6128015029908 (unscaled) and -1000.8754307185584 (scaled)
For drop-out prob '0.5' the average linear activation is -512.1602655283492 (unscaled) and -1024.3205310566984 (scaled)
For drop-out prob '0.3' the average linear activation is -303.21194422742315 (unscaled) and -1010.7064807580772 (scaled)
Notice that the unscaled activations diminish due to the statistically imperfect normal distribution.
Can you spot an obvious correlation between the W-mean and the average linear activation means?
If you keep reading in cs231n, the difference between dropout and inverted dropout is explained.
In a network with no dropout, the activations in layer L will be aL. The weights of next layer (L+1) will be learned in such a manner that it receives aL and produces output accordingly. But with a network containing dropout (with keep_prob = p), the weights of L+1 will be learned in such a manner that it receives p*aL and produces output accordingly. Why p*aL? Because the Expected value, E(aL), will be probability_of_keeping(aL)*aL + probability_of_not_keeping(aL)*0 which will be equal to p*aL + (1-p)*0 = p*aL. In the same network, during testing time there will be no dropout. Hence the layer L+1 will receive aL simply. But its weights were trained to expect p*aL as input. Therefore, during testing time you will have to multiply the activations with p. But instead of doing this, you can multiply the activations with 1/p during training only. This is called inverted dropout.
Since we want to leave the forward pass at test time untouched (and tweak our network just during training), tf.nn.dropout directly implements inverted dropout, scaling the values.