I'm confused about logistic loss and cross entropy loss in binary classification scenario.
According to Wikipedia (https://en.wikipedia.org/wiki/Loss_functions_for_classification), the logistic loss is defined as:
where v=y*y_hat
The cross entropy loss is defined as:
From the Wikipedia (https://en.wikipedia.org/wiki/Loss_functions_for_classification):
It's easy to check that the logistic loss and binary cross entropy loss (Log loss) are in fact the same (up to a multiplicative constant 1/log(2))
However, when I test it with some code, I found they are not the same. Here is the python code:
from numpy import exp
from math import log
def cross_entropy_loss(y, yp):
return -log(1-yp) if y==0 else -log(yp)
def logistic_loss(y, yp):
return log(1+exp(-y*yp))/log(2)
y, yp = 0, 0.3 # y= {0, 1} for cross_entropy_loss
l1 = cross_entropy_loss(y, yp)
y, yp = -1, 0.3 # y = {-1, 1} for logistic loss
l2 = logistic_loss(y, yp)
print(l1, l2, l1/l2)
y, yp = 1, 0.9
l1 = cross_entropy_loss(y, yp)
l2 = logistic_loss(y, yp)
print(l1, l2, l1/l2)
The output shows that neither the loss values are the same nor the ratio between them is constant:
0.35667494393873245 1.2325740743522222 0.2893740436056004
0.10536051565782628 0.49218100325603786 0.21406863523949665
Could somebody explain why they are "in fact the same"?
In wikipedia v is defined as
v = -yf(x).
What is not defined in wikipedia is \hat{y} (i.e. the predicted label). It should be defined as (logistic function):
\hat{y} = 1/(1+exp(-f(x))).
By substituting the above definition into the logistic loss formula from the wikipedia, you should be able to recover the cross entropy loss. Please note that cross entropy loss equation (you have presented above) is formulated for y={0,1}, while the eqautions from the wikipedia article are for y={-1,1}.
Related
I'm trying to solve the PCA problem:
For k some number and X dataset where I'm trying to find w (The PCA matrix) such that:
w = argmax( E(WXX^T))
(I might be wrong with the formulation of the optimization goal. Please correct me).
I want to solve the optimization goal with gradient decent rather than with SVD decomposition as usual.
I'm basing my code on this Stats SE post:
How can one implement PCA using gradient descent?.
But, my code doesn't seem to find an optimal solution.
def get_gd_pca(X, w):
k = w.shape[-1]
LEARNING_RATE = 0.1
EPOCHS = 200
cov = torch.cov(X)
lam = torch.rand(1, requires_grad=True)
optimizer = torch.optim.SGD([w, lam], lr=LEARNING_RATE, maximize=True)
for epoch in range(EPOCHS):
optimizer.zero_grad()
left_side_loss = torch.matmul(w.T, torch.matmul(cov, w))
right_side_loss = torch.matmul((lam * torch.eye(k)), (torch.matmul(w.T, w) - 1))
loss = torch.sum(left_side_loss - right_side_loss)
loss.backward()
optimizer.step()
# Normalizing current_P
w.data = w.data / torch.norm(w.data, dim=0)
print('current_P', w)
I have two problems:
If I'm not using normalization (last row of the for loop) the w values just explode.
If I do use it, it only works with k=1. After that, the returned vectors are the same as the first one.
Ideas what have I done wrong? I think it's connectקג to not using the Lagrangian correctly but I doesn't understand why.
I've been implementing VAE and IWAE models on the caltech silhouettes dataset and am having an issue where the VAE outperforms IWAE by a modest margin (test LL ~120 for VAE, ~133 for IWAE!). I don't believe this should be the case, according to both theory and experiments produced here.
I'm hoping someone can find some issue in how I'm implementing that's causing this to be the case.
The network I'm using to approximate q and p is the same as that detailed in the appendix of the paper above. The calculation part of the model is below:
data_k_vec = data.repeat_interleave(K,0) # Generate K samples (in my case K=50 is producing this behavior)
mu, log_std = model.encode(data_k_vec)
z = model.reparameterize(mu, log_std) # z = mu + torch.exp(log_std)*epsilon (epsilon ~ N(0,1))
decoded = model.decode(z) # this is the sigmoid output of the model
log_prior_z = torch.sum(-0.5 * z ** 2, 1)-.5*z.shape[1]*T.log(torch.tensor(2*np.pi))
log_q_z = compute_log_probability_gaussian(z, mu, log_std) # Definitions below
log_p_x = compute_log_probability_bernoulli(decoded,data_k_vec)
if model_type == 'iwae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, K)
elif model_type =='vae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, 1)*1/K
log_w_minus_max = log_w_matrix - torch.max(log_w_matrix, 1, keepdim=True)[0]
ws_matrix = torch.exp(log_w_minus_max)
ws_norm = ws_matrix / torch.sum(ws_matrix, 1, keepdim=True)
ws_sum_per_datapoint = torch.sum(log_w_matrix * ws_norm, 1)
loss = -torch.sum(ws_sum_per_datapoint) # value of loss that gets returned to training function. loss.backward() will get called on this value
Here are the likelihood functions. I had to fuss with the bernoulli LL in order to not get nan during training
def compute_log_probability_gaussian(obs, mu, logstd, axis=1):
return torch.sum(-0.5 * ((obs-mu) / torch.exp(logstd)) ** 2 - logstd, axis)-.5*obs.shape[1]*T.log(torch.tensor(2*np.pi))
def compute_log_probability_bernoulli(theta, obs, axis=1): # Add 1e-18 to avoid nan appearances in training
return torch.sum(obs*torch.log(theta+1e-18) + (1-obs)*torch.log(1-theta+1e-18), axis)
In this code there's a "shortcut" being used in that the row-wise importance weights are being calculated in the model_type=='iwae' case for the K=50 samples in each row, while in the model_type=='vae' case the importance weights are being calculated for the single value left in each row, so that it just ends up calculating a weight of 1. Maybe this is the issue?
Any and all help is huge - I thought that addressing the nan issue would permanently get me out of the weeds but now I have this new problem.
EDIT:
Should add that the training scheme is the same as that in the paper linked above. That is, for each of i=0....7 rounds train for 2**i epochs with a learning rate of 1e-4 * 10**(-i/7)
The K-sample importance weighted ELBO is
$$ \textrm{IW-ELBO}(x,K) = \log \sum_{k=1}^K \frac{p(x \vert z_k) p(z_k)}{q(z_k;x)}$$
For the IWAE there are K samples originating from each datapoint x, so you want to have the same latent statistics mu_z, Sigma_z obtained through the amortized inference network, but sample multiple z K times for each x.
So its computationally wasteful to compute the forward pass for data_k_vec = data.repeat_interleave(K,0), you should compute the forward pass once for each original datapoint, then repeat the statistics output by the inference network for sampling:
mu = torch.repeat_interleave(mu,K,0)
log_std = torch.repeat_interleave(log_std,K,0)
Then sample z_k. And now repeat your datapoints data_k_vec = data.repeat_interleave(K,0), and use the resulting tensor to efficiently evaluate the conditional p(x |z_k) for each importance sample z_k.
Note you may also want to use the logsumexp operation when calculating the IW-ELBO for numerical stability. I can't quite figure out what's going on with the log_w_matrix calculation in your post, but this is what I would do:
log_pz = ...
log_qzCx = ....
log_pxCz = ...
log_iw = log_pxCz + log_pz - log_qzCx
log_iw = log_iw.reshape(-1, K)
iwelbo = torch.logsumexp(log_iw, dim=1) - np.log(K)
EDIT: Actually after thinking about it a bit and using the score function identity, you can interpret the IWAE gradient as an importance weighted estimate of the standard single-sample gradient, so the method in the OP for calculation of the importance weights is equivalent (if a bit wasteful), provided you place a stop_gradient operator around the normalized importance weights, which you call w_norm. So I the main problem is the absence of this stop_gradient operator.
I wrote a vanilla autoencoder using only Dense layer.
Below is my code:
iLayer = Input ((784,))
layer1 = Dense(128, activation='relu' ) (iLayer)
layer2 = Dense(64, activation='relu') (layer1)
layer3 = Dense(28, activation ='relu') (layer2)
layer4 = Dense(64, activation='relu') (layer3)
layer5 = Dense(128, activation='relu' ) (layer4)
layer6 = Dense(784, activation='softmax' ) (layer5)
model = Model (iLayer, layer6)
model.compile(loss='binary_crossentropy', optimizer='adam')
(trainX, trainY), (testX, testY) = mnist.load_data()
print ("shape of the trainX", trainX.shape)
trainX = trainX.reshape(trainX.shape[0], trainX.shape[1]* trainX.shape[2])
print ("shape of the trainX", trainX.shape)
model.fit (trainX, trainX, epochs=5, batch_size=100)
Questions:
1) softmax provides probability distribution. Understood. This means, I would have a vector of 784 values with probability between 0 and 1. For example [ 0.02, 0.03..... upto 784 items], summing all 784 elements provides 1.
2) I don't understand how the binary crossentropy works with these values. Binary cross entropy is for two values of output, right?
In the context of autoencoders the input and output of the model is the same. So, if the input values are in the range [0,1] then it is acceptable to use sigmoid as the activation function of last layer. Otherwise, you need to use an appropriate activation function for the last layer (e.g. linear which is the default one).
As for the loss function, it comes back to the values of input data again. If the input data are only between zeros and ones (and not the values between them), then binary_crossentropy is acceptable as the loss function. Otherwise, you need to use other loss functions such as 'mse' (i.e. mean squared error) or 'mae' (i.e. mean absolute error). Note that in the case of input values in range [0,1] you can use binary_crossentropy, as it is usually used (e.g. Keras autoencoder tutorial and this paper). However, don't expect that the loss value becomes zero since binary_crossentropy does not return zero when both prediction and label are not either zero or one (no matter they are equal or not). Here is a video from Hugo Larochelle where he explains the loss functions used in autoencoders (the part about using binary_crossentropy with inputs in range [0,1] starts at 5:30)
Concretely, in your example, you are using the MNIST dataset. So by default the values of MNIST are integers in the range [0, 255]. Usually you need to normalize them first:
trainX = trainX.astype('float32')
trainX /= 255.
Now the values would be in range [0,1]. So sigmoid can be used as the activation function and either of binary_crossentropy or mse as the loss function.
Why binary_crossentropy can be used even when the true label values (i.e. ground-truth) are in the range [0,1]?
Note that we are trying to minimize the loss function in training. So if the loss function we have used reaches its minimum value (which may not be necessarily equal to zero) when prediction is equal to true label, then it is an acceptable choice. Let's verify this is the case for binray cross-entropy which is defined as follows:
bce_loss = -y*log(p) - (1-y)*log(1-p)
where y is the true label and p is the predicted value. Let's consider y as fixed and see what value of p minimizes this function: we need to take the derivative with respect to p (I have assumed the log is the natural logarithm function for simplicity of calculations):
bce_loss_derivative = -y*(1/p) - (1-y)*(-1/(1-p)) = 0 =>
-y/p + (1-y)/(1-p) = 0 =>
-y*(1-p) + (1-y)*p = 0 =>
-y + y*p + p - y*p = 0 =>
p - y = 0 => y = p
As you can see binary cross-entropy have the minimum value when y=p, i.e. when the true label is equal to predicted label and this is exactly what we are looking for.
I am new to machine learning and data science. Sorry, if it is a very stupid question.
I see there is an inbuilt function for cross-validation but not for a fixed validation set. I have a dataset with 50,000 samples labeled with years from 1990 to 2010. I need to train different classifiers on 1990-2008 samples, then validate on 2009 samples, and test on 2010 samples.
EDIT:
After #Quan Tran's answer, I tried this. This is how it should be?
# Fit a decision tree
estimator1 = DecisionTreeClassifier( max_depth = 9, max_leaf_nodes=9)
estimator1.fit(X_train, y_train)
print estimator1
# validate using validation set
acc = np.zeros((20,20)) # store accuracy
for i in range(20):
for j in range(20):
estimator1 = DecisionTreeClassifier(max_depth = i+1, max_leaf_nodes=j+2)
estimator1.fit(X_valid, y_valid)
y_pred = estimator1.predict(X_valid)
acc[i,j] = accuracy_score(y_valid, y_pred)
best_mod = np.where(acc == acc.max())
print best_mod
print acc[best_mod]
# Predict target values
estimator1 = DecisionTreeClassifier(max_depth = int(best_mod[0]) + 1, max_leaf_nodes= int(best_mod[1]) + 2)
estimator1.fit(X_valid, y_valid)
y_pred = estimator1.predict(X_test)
confusion = metrics.confusion_matrix(y_test, y_pred)
TP = confusion[1, 1]
TN = confusion[0, 0]
FP = confusion[0, 1]
FN = confusion[1, 0]
# Classification Accuracy
print "======= ACCURACY ========"
print((TP + TN) / float(TP + TN + FP + FN))
print accuracy_score(y_valid, y_pred)
# store the predicted probabilities for class
y_pred_prob = estimator1.predict_proba(X_test)[:, 1]
# plot a ROC curve for y_test and y_pred_prob
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for DecisionTreeClassifier')
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)
print("======= AUC ========")
print(metrics.roc_auc_score(y_test, y_pred_prob))
I get this answer, which is not the best accuracy.
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=9,
max_features=None, max_leaf_nodes=9, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
(array([5]), array([19]))
[ 0.8489011]
======= ACCURACY ========
0.574175824176
0.538461538462
======= AUC ========
0.547632099893
In this case, there are three separate sets. The train set, the test set and the validation set.
The train set is used to fit the parameters of the classifier. For example:
clf = DecisionTreeClassifier(max_depth=2)
clf.fit(trainfeatures, labels)
The validation set is used to tune the hyper parameters of the classifier or find the cutoff point for the training procedure. For example, in the case of Decision tree, max_depth is a hyper parameter. You will need to find a good set of hyper parameters by experimenting with different values of hyper parameters (tuning) and compare the performance measures (accuracy/precision,..) on the validation set.
The test set is used to estimate the error rate on unseen data. After having the performance measures on the test set, the model must not be trained/tuned any further.
Using RBF kernel in SVM, why the decision value of test samples faraway from the training ones tend to be equal to the negative of the bias term b?
A consequence is that, once the SVM model is generated, if I set the bias term to 0, the decision value of test samples faraway from the training ones tend to 0. Why it happens?
Using the LibSVM, the bias term b is the rho. The decision value is the distance from the hyperplane.
I need to understand what defines this behavior. Does anyone understand that?
Running the following R script, you can see this behavior:
library(e1071)
library(mlbench)
data(Glass)
set.seed(2)
writeLines('separating training and testing samples')
testindex <- sort(sample(1:nrow(Glass), trunc(nrow(Glass)/3)))
training.samples <- Glass[-testindex, ]
testing.samples <- Glass[testindex, ]
writeLines('normalizing samples according to training samples between 0 and 1')
fnorm <- function(ran, data) {
(data - ran[1]) / (ran[2] - ran[1])
}
minmax <- data.frame(sapply(training.samples[, -10], range))
training.samples[, -10] <- mapply(fnorm, minmax, training.samples[, -10])
testing.samples[, -10] <- mapply(fnorm, minmax, testing.samples[, -10])
writeLines('making the dataset binary')
training.samples$Type <- factor((training.samples$Type == 1) * 1)
testing.samples$Type <- factor((testing.samples$Type == 1) * 1)
writeLines('training the SVM')
svm.model <- svm(Type ~ ., data=training.samples, cost=1, gamma=2**-5)
writeLines('predicting the SVM with outlier samples')
points = c(0, 0.8, 1, # non-outliers
1.5, -0.5, 2, -1, 2.5, -1.5, 3, -2, 10, -9) # outliers
outlier.samples <- t(sapply(points, function(p) rep(p, 9)))
svm.pred <- predict(svm.model, testing.samples[, -10], decision.values=TRUE)
svm.pred.outliers <- predict(svm.model, outlier.samples, decision.values=TRUE)
writeLines('') # printing
svm.pred.dv <- c(attr(svm.pred, 'decision.values'))
svm.pred.outliers.dv <- c(attr(svm.pred.outliers, 'decision.values'))
names(svm.pred.outliers.dv) <- points
writeLines('test sample decision values')
print(head(svm.pred.dv))
writeLines('non-outliers and outliers decision values')
print(svm.pred.outliers.dv)
writeLines('svm.model$rho')
print(svm.model$rho)
writeLines('')
writeLines('<< setting svm.model$rho to 0 >>')
writeLines('predicting the SVM with outlier samples')
svm.model$rho <- 0
svm.pred <- predict(svm.model, testing.samples[, -10], decision.values=TRUE)
svm.pred.outliers <- predict(svm.model, outlier.samples, decision.values=TRUE)
writeLines('') # printing
svm.pred.dv <- c(attr(svm.pred, 'decision.values'))
svm.pred.outliers.dv <- c(attr(svm.pred.outliers, 'decision.values'))
names(svm.pred.outliers.dv) <- points
writeLines('test sample decision values')
print(head(svm.pred.dv))
writeLines('non-outliers and outliers decision values')
print(svm.pred.outliers.dv)
writeLines('svm.model$rho')
print(svm.model$rho)
Comments about the code:
It uses a dataset of 9 dimensions.
It splits the dataset into training and testing.
It normalizes the samples between 0 and 1 for all dimensions.
It makes the problem to be binary.
It fits a SVM model.
It predicts the testing samples, getting the decision values.
It predicts some synthetic (outlier) samples outside [0, 1] in the feature space, getting the decision values.
It shows that the decision value for outliers tends to be the negative of the bias term b generated by the model.
It sets the bias term b to 0.
It predicts the testing samples, getting the decision values.
It predicts some synthetic (outlier) samples outside [0, 1] in the feature space, getting the decision values.
It shows that the decision value for outliers tends to be 0.
Do you mean negative of the bias term instead of inverse?
The decision function of the SVM is sign(w^T x - rho), where rho is the bias term , w is the weight vector, and x is the input. But thats in the primal space / linear form. w^T x is replaced by our kernel function, which in this case is the RBF kernel.
The RBF kernel is defined as . So if the distance between two things is very large, then it gets squared - we get a huge number. γ is a positive number, so we are making our huge giant value a huge giant negative value. exp(-10) is already on the order of 5*10^-5, so for far away points the RBF kernel is going to become essentailly zero. If sample is far aware from all of your training data, than all of the kernel products will be nearly zero. that means w^T x will be nearly zero. And so what you are left with is essentially sign(0-rho), ie: the negative of your bias term.