I was reading this question and the discussion makes sense to me: when all weights are initialized to zero, gradient descent can't tell where the error came from, so it can't update those weights.
What I don't understand is why I can't see this empirically. I'm running the following piece of code (runnable here):
w = tf.Variable(tf.zeros([2,1]))
b = tf.Variable(tf.zeros([1]))
x = tf.placeholder(tf.float32, shape=[1, 2])
y = tf.placeholder(tf.float32, shape=[1])
pred = tf.sigmoid(tf.matmul(x, w) + b)
loss = tf.reduce_mean(tf.square(pred - y))
train_step = tf.train.GradientDescentOptimizer(0.1).minimize(loss)
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
for i in range(100):
for x_ex, y_ex in dataset:
sess.run(train_step, feed_dict={x: x_ex, y: y_ex})
print(sess.run(w))
And the output I'm seeing is like:
[[ 0.]
[ 0.]]
[[ 0.02530853]
[ 0. ]]
[[ 0.02530853]
[ 0.02499614]]
[[-0.00059909]
[-0.00091148]]
[[-0.00059909]
[-0.00091148]]
[[ 0.02472398]
[-0.00091148]]
[[ 0.02472398]
[ 0.02410331]]
If the weights start out as zero, why is gradient descent able to update them at all?
As a follow up question, if a weight is randomnly initialized to be positive, but the optimal value for that weight is negative, do we just have to trust that in an update step the optimizer won't accidentally update the weight to be 0 (and thus halt the weight's updatability)? I know the odds of weight + update step being exactly 0 are almost neglibible, but it could still be an issue, especially with millions of weights in a NN.
It's not necessarily a problem in gradient descent, but how the partial derivatives are being calculated with backpropagation.
How bp computes the partial derivative for weights in layer l:
δ/δΘ^{l}_{ij}=a^l_jδ^{l+1}_i where activation 'a' is applying the non-linear
function g (e.g. sigmoid, tanh, ReLU) to the neuron's output:
a^l_j=g(Θ^{l−1}a^{l−1}) and where delta is the difference propagated backwards
from the successive layer: δ^l=(Θ^l)^Tδ^{l+1}.∗g′(Θ^{l−1}a^{l−1})
The .* stands for element-wise multiplication.
So if you looked at how the activation is computed, zero-weights
prevent the activation from increasing or decreasing. All-zero weights
mean zero activation.
There are other ways to calculate the gradient which do not have this issue!
Related
I'm trying to solve the PCA problem:
For k some number and X dataset where I'm trying to find w (The PCA matrix) such that:
w = argmax( E(WXX^T))
(I might be wrong with the formulation of the optimization goal. Please correct me).
I want to solve the optimization goal with gradient decent rather than with SVD decomposition as usual.
I'm basing my code on this Stats SE post:
How can one implement PCA using gradient descent?.
But, my code doesn't seem to find an optimal solution.
def get_gd_pca(X, w):
k = w.shape[-1]
LEARNING_RATE = 0.1
EPOCHS = 200
cov = torch.cov(X)
lam = torch.rand(1, requires_grad=True)
optimizer = torch.optim.SGD([w, lam], lr=LEARNING_RATE, maximize=True)
for epoch in range(EPOCHS):
optimizer.zero_grad()
left_side_loss = torch.matmul(w.T, torch.matmul(cov, w))
right_side_loss = torch.matmul((lam * torch.eye(k)), (torch.matmul(w.T, w) - 1))
loss = torch.sum(left_side_loss - right_side_loss)
loss.backward()
optimizer.step()
# Normalizing current_P
w.data = w.data / torch.norm(w.data, dim=0)
print('current_P', w)
I have two problems:
If I'm not using normalization (last row of the for loop) the w values just explode.
If I do use it, it only works with k=1. After that, the returned vectors are the same as the first one.
Ideas what have I done wrong? I think it's connectקג to not using the Lagrangian correctly but I doesn't understand why.
I've been implementing VAE and IWAE models on the caltech silhouettes dataset and am having an issue where the VAE outperforms IWAE by a modest margin (test LL ~120 for VAE, ~133 for IWAE!). I don't believe this should be the case, according to both theory and experiments produced here.
I'm hoping someone can find some issue in how I'm implementing that's causing this to be the case.
The network I'm using to approximate q and p is the same as that detailed in the appendix of the paper above. The calculation part of the model is below:
data_k_vec = data.repeat_interleave(K,0) # Generate K samples (in my case K=50 is producing this behavior)
mu, log_std = model.encode(data_k_vec)
z = model.reparameterize(mu, log_std) # z = mu + torch.exp(log_std)*epsilon (epsilon ~ N(0,1))
decoded = model.decode(z) # this is the sigmoid output of the model
log_prior_z = torch.sum(-0.5 * z ** 2, 1)-.5*z.shape[1]*T.log(torch.tensor(2*np.pi))
log_q_z = compute_log_probability_gaussian(z, mu, log_std) # Definitions below
log_p_x = compute_log_probability_bernoulli(decoded,data_k_vec)
if model_type == 'iwae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, K)
elif model_type =='vae':
log_w_matrix = (log_prior_z + log_p_x - log_q_z).view(-1, 1)*1/K
log_w_minus_max = log_w_matrix - torch.max(log_w_matrix, 1, keepdim=True)[0]
ws_matrix = torch.exp(log_w_minus_max)
ws_norm = ws_matrix / torch.sum(ws_matrix, 1, keepdim=True)
ws_sum_per_datapoint = torch.sum(log_w_matrix * ws_norm, 1)
loss = -torch.sum(ws_sum_per_datapoint) # value of loss that gets returned to training function. loss.backward() will get called on this value
Here are the likelihood functions. I had to fuss with the bernoulli LL in order to not get nan during training
def compute_log_probability_gaussian(obs, mu, logstd, axis=1):
return torch.sum(-0.5 * ((obs-mu) / torch.exp(logstd)) ** 2 - logstd, axis)-.5*obs.shape[1]*T.log(torch.tensor(2*np.pi))
def compute_log_probability_bernoulli(theta, obs, axis=1): # Add 1e-18 to avoid nan appearances in training
return torch.sum(obs*torch.log(theta+1e-18) + (1-obs)*torch.log(1-theta+1e-18), axis)
In this code there's a "shortcut" being used in that the row-wise importance weights are being calculated in the model_type=='iwae' case for the K=50 samples in each row, while in the model_type=='vae' case the importance weights are being calculated for the single value left in each row, so that it just ends up calculating a weight of 1. Maybe this is the issue?
Any and all help is huge - I thought that addressing the nan issue would permanently get me out of the weeds but now I have this new problem.
EDIT:
Should add that the training scheme is the same as that in the paper linked above. That is, for each of i=0....7 rounds train for 2**i epochs with a learning rate of 1e-4 * 10**(-i/7)
The K-sample importance weighted ELBO is
$$ \textrm{IW-ELBO}(x,K) = \log \sum_{k=1}^K \frac{p(x \vert z_k) p(z_k)}{q(z_k;x)}$$
For the IWAE there are K samples originating from each datapoint x, so you want to have the same latent statistics mu_z, Sigma_z obtained through the amortized inference network, but sample multiple z K times for each x.
So its computationally wasteful to compute the forward pass for data_k_vec = data.repeat_interleave(K,0), you should compute the forward pass once for each original datapoint, then repeat the statistics output by the inference network for sampling:
mu = torch.repeat_interleave(mu,K,0)
log_std = torch.repeat_interleave(log_std,K,0)
Then sample z_k. And now repeat your datapoints data_k_vec = data.repeat_interleave(K,0), and use the resulting tensor to efficiently evaluate the conditional p(x |z_k) for each importance sample z_k.
Note you may also want to use the logsumexp operation when calculating the IW-ELBO for numerical stability. I can't quite figure out what's going on with the log_w_matrix calculation in your post, but this is what I would do:
log_pz = ...
log_qzCx = ....
log_pxCz = ...
log_iw = log_pxCz + log_pz - log_qzCx
log_iw = log_iw.reshape(-1, K)
iwelbo = torch.logsumexp(log_iw, dim=1) - np.log(K)
EDIT: Actually after thinking about it a bit and using the score function identity, you can interpret the IWAE gradient as an importance weighted estimate of the standard single-sample gradient, so the method in the OP for calculation of the importance weights is equivalent (if a bit wasteful), provided you place a stop_gradient operator around the normalized importance weights, which you call w_norm. So I the main problem is the absence of this stop_gradient operator.
I wrote a vanilla autoencoder using only Dense layer.
Below is my code:
iLayer = Input ((784,))
layer1 = Dense(128, activation='relu' ) (iLayer)
layer2 = Dense(64, activation='relu') (layer1)
layer3 = Dense(28, activation ='relu') (layer2)
layer4 = Dense(64, activation='relu') (layer3)
layer5 = Dense(128, activation='relu' ) (layer4)
layer6 = Dense(784, activation='softmax' ) (layer5)
model = Model (iLayer, layer6)
model.compile(loss='binary_crossentropy', optimizer='adam')
(trainX, trainY), (testX, testY) = mnist.load_data()
print ("shape of the trainX", trainX.shape)
trainX = trainX.reshape(trainX.shape[0], trainX.shape[1]* trainX.shape[2])
print ("shape of the trainX", trainX.shape)
model.fit (trainX, trainX, epochs=5, batch_size=100)
Questions:
1) softmax provides probability distribution. Understood. This means, I would have a vector of 784 values with probability between 0 and 1. For example [ 0.02, 0.03..... upto 784 items], summing all 784 elements provides 1.
2) I don't understand how the binary crossentropy works with these values. Binary cross entropy is for two values of output, right?
In the context of autoencoders the input and output of the model is the same. So, if the input values are in the range [0,1] then it is acceptable to use sigmoid as the activation function of last layer. Otherwise, you need to use an appropriate activation function for the last layer (e.g. linear which is the default one).
As for the loss function, it comes back to the values of input data again. If the input data are only between zeros and ones (and not the values between them), then binary_crossentropy is acceptable as the loss function. Otherwise, you need to use other loss functions such as 'mse' (i.e. mean squared error) or 'mae' (i.e. mean absolute error). Note that in the case of input values in range [0,1] you can use binary_crossentropy, as it is usually used (e.g. Keras autoencoder tutorial and this paper). However, don't expect that the loss value becomes zero since binary_crossentropy does not return zero when both prediction and label are not either zero or one (no matter they are equal or not). Here is a video from Hugo Larochelle where he explains the loss functions used in autoencoders (the part about using binary_crossentropy with inputs in range [0,1] starts at 5:30)
Concretely, in your example, you are using the MNIST dataset. So by default the values of MNIST are integers in the range [0, 255]. Usually you need to normalize them first:
trainX = trainX.astype('float32')
trainX /= 255.
Now the values would be in range [0,1]. So sigmoid can be used as the activation function and either of binary_crossentropy or mse as the loss function.
Why binary_crossentropy can be used even when the true label values (i.e. ground-truth) are in the range [0,1]?
Note that we are trying to minimize the loss function in training. So if the loss function we have used reaches its minimum value (which may not be necessarily equal to zero) when prediction is equal to true label, then it is an acceptable choice. Let's verify this is the case for binray cross-entropy which is defined as follows:
bce_loss = -y*log(p) - (1-y)*log(1-p)
where y is the true label and p is the predicted value. Let's consider y as fixed and see what value of p minimizes this function: we need to take the derivative with respect to p (I have assumed the log is the natural logarithm function for simplicity of calculations):
bce_loss_derivative = -y*(1/p) - (1-y)*(-1/(1-p)) = 0 =>
-y/p + (1-y)/(1-p) = 0 =>
-y*(1-p) + (1-y)*p = 0 =>
-y + y*p + p - y*p = 0 =>
p - y = 0 => y = p
As you can see binary cross-entropy have the minimum value when y=p, i.e. when the true label is equal to predicted label and this is exactly what we are looking for.
Using RBF kernel in SVM, why the decision value of test samples faraway from the training ones tend to be equal to the negative of the bias term b?
A consequence is that, once the SVM model is generated, if I set the bias term to 0, the decision value of test samples faraway from the training ones tend to 0. Why it happens?
Using the LibSVM, the bias term b is the rho. The decision value is the distance from the hyperplane.
I need to understand what defines this behavior. Does anyone understand that?
Running the following R script, you can see this behavior:
library(e1071)
library(mlbench)
data(Glass)
set.seed(2)
writeLines('separating training and testing samples')
testindex <- sort(sample(1:nrow(Glass), trunc(nrow(Glass)/3)))
training.samples <- Glass[-testindex, ]
testing.samples <- Glass[testindex, ]
writeLines('normalizing samples according to training samples between 0 and 1')
fnorm <- function(ran, data) {
(data - ran[1]) / (ran[2] - ran[1])
}
minmax <- data.frame(sapply(training.samples[, -10], range))
training.samples[, -10] <- mapply(fnorm, minmax, training.samples[, -10])
testing.samples[, -10] <- mapply(fnorm, minmax, testing.samples[, -10])
writeLines('making the dataset binary')
training.samples$Type <- factor((training.samples$Type == 1) * 1)
testing.samples$Type <- factor((testing.samples$Type == 1) * 1)
writeLines('training the SVM')
svm.model <- svm(Type ~ ., data=training.samples, cost=1, gamma=2**-5)
writeLines('predicting the SVM with outlier samples')
points = c(0, 0.8, 1, # non-outliers
1.5, -0.5, 2, -1, 2.5, -1.5, 3, -2, 10, -9) # outliers
outlier.samples <- t(sapply(points, function(p) rep(p, 9)))
svm.pred <- predict(svm.model, testing.samples[, -10], decision.values=TRUE)
svm.pred.outliers <- predict(svm.model, outlier.samples, decision.values=TRUE)
writeLines('') # printing
svm.pred.dv <- c(attr(svm.pred, 'decision.values'))
svm.pred.outliers.dv <- c(attr(svm.pred.outliers, 'decision.values'))
names(svm.pred.outliers.dv) <- points
writeLines('test sample decision values')
print(head(svm.pred.dv))
writeLines('non-outliers and outliers decision values')
print(svm.pred.outliers.dv)
writeLines('svm.model$rho')
print(svm.model$rho)
writeLines('')
writeLines('<< setting svm.model$rho to 0 >>')
writeLines('predicting the SVM with outlier samples')
svm.model$rho <- 0
svm.pred <- predict(svm.model, testing.samples[, -10], decision.values=TRUE)
svm.pred.outliers <- predict(svm.model, outlier.samples, decision.values=TRUE)
writeLines('') # printing
svm.pred.dv <- c(attr(svm.pred, 'decision.values'))
svm.pred.outliers.dv <- c(attr(svm.pred.outliers, 'decision.values'))
names(svm.pred.outliers.dv) <- points
writeLines('test sample decision values')
print(head(svm.pred.dv))
writeLines('non-outliers and outliers decision values')
print(svm.pred.outliers.dv)
writeLines('svm.model$rho')
print(svm.model$rho)
Comments about the code:
It uses a dataset of 9 dimensions.
It splits the dataset into training and testing.
It normalizes the samples between 0 and 1 for all dimensions.
It makes the problem to be binary.
It fits a SVM model.
It predicts the testing samples, getting the decision values.
It predicts some synthetic (outlier) samples outside [0, 1] in the feature space, getting the decision values.
It shows that the decision value for outliers tends to be the negative of the bias term b generated by the model.
It sets the bias term b to 0.
It predicts the testing samples, getting the decision values.
It predicts some synthetic (outlier) samples outside [0, 1] in the feature space, getting the decision values.
It shows that the decision value for outliers tends to be 0.
Do you mean negative of the bias term instead of inverse?
The decision function of the SVM is sign(w^T x - rho), where rho is the bias term , w is the weight vector, and x is the input. But thats in the primal space / linear form. w^T x is replaced by our kernel function, which in this case is the RBF kernel.
The RBF kernel is defined as . So if the distance between two things is very large, then it gets squared - we get a huge number. γ is a positive number, so we are making our huge giant value a huge giant negative value. exp(-10) is already on the order of 5*10^-5, so for far away points the RBF kernel is going to become essentailly zero. If sample is far aware from all of your training data, than all of the kernel products will be nearly zero. that means w^T x will be nearly zero. And so what you are left with is essentially sign(0-rho), ie: the negative of your bias term.
I was trying to implement an XOR gate with tensorflow. I succeeded in implementing that, but i don't fully understand why it works. I got help from stackoverflow posts here and here. So both with one hot true and without one hot true outputs. Here is the network as i understood, in order to set things clear.
My Question #1:
Notice the RELU function and Sigmoid function. Why we need that(specifically the RELU function)? You may say that in order to achieve non linearity. I understand how RELU achieves non-linearity. I got the answer from here. Now from what I understand the difference between using RELU and without using RELU is this(see the picture).[I tested the tf.nn.relu function. The output is like this]
Now, if the first function works, why not the second function? From my perspective RELU achieves non-linearity by combining multiple linear functions. So both is linear function(upper two). If first one achieves non linearity, 2nd one should too, shouldn't it? The question is that, without using the RELU why the network gets stuck?
XOR gate with one hot true outputs
hidden1_neuron = 10
def Network(x, weights, bias):
layer1 = tf.nn.relu(tf.matmul(x, weights['h1']) + bias['h1'])
layer_final = tf.matmul(layer1, weights['out']) + bias['out']
return layer_final
weight = {
'h1' : tf.Variable(tf.random_normal([2, hidden1_neuron])),
'out': tf.Variable(tf.random_normal([hidden1_neuron, 2]))
}
bias = {
'h1' : tf.Variable(tf.random_normal([hidden1_neuron])),
'out': tf.Variable(tf.random_normal([2]))
}
x = tf.placeholder(tf.float32, [None, 2])
y = tf.placeholder(tf.float32, [None, 2])
net = Network(x, weight, bias)
cross_entropy = tf.nn.softmax_cross_entropy_with_logits(net, y)
loss = tf.reduce_mean(cross_entropy)
train_op = tf.train.AdamOptimizer(0.2).minimize(loss)
init_op = tf.initialize_all_variables()
xTrain = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
yTrain = np.array([[1, 0], [0, 1], [0, 1], [1, 0]])
with tf.Session() as sess:
sess.run(init_op)
for i in range(5000):
train_data = sess.run(train_op, feed_dict={x: xTrain, y: yTrain})
loss_val = sess.run(loss, feed_dict={x: xTrain, y: yTrain})
if(not(i%500)):
print(loss_val)
result = sess.run(net, feed_dict={x:xTrain})
print(result)
The code you see above implements the XOR gate with one hot true outputs. If i take out tf.nn.relu, the network gets stuck. Why?
My Question #2:
How can I understand if a network is going to get stuck on some local minima[or some value]? Is it from the plot of cost function (or loss function)? Say, for the network designed above, I used cross entropy as the loss function. I could not find the plotting of cross entropy function. (If you can provide this, this would be very helpful.)
My Question #3:
Notice on the code there is a line hidden1_neuron = 10. It means that i have set the number of neurons in the hidden layer 10. Reducing the number of neurons to 5 makes the network to get stuck. So what should be the number of neurons on hidden layer?
The output when the network works the way it is supposed to :
2.42076
0.000456363
0.000149548
7.40216e-05
4.34194e-05
2.78939e-05
1.8924e-05
1.33214e-05
9.62602e-06
7.06308e-06
[[ 7.5128479 -7.58900356]
[-5.65254211 5.28509617]
[-6.96340656 6.62380219]
[ 7.26610374 -5.9665451 ]]
The output when the network gets stuck:
1.45679
0.346579
0.346575
0.346575
0.346574
0.346574
0.346574
0.346574
0.346574
0.346574
[[ 15.70696926 -18.21559143]
[ -7.1562047 9.75774956]
[ -0.03214722 -0.03214724]
[ -0.03214722 -0.03214724]]
Question 1
Both the ReLU and Sigmoid function is non-linear. On the contrary, the function drawn to the right of the ReLU function is linear. Applying multiple linear activation functions will still make the network linear.
Therefore, the network gets stuck when trying to perform linear regression on a non-linear problem.
Question 2
Yes, you will have to pay attention to the progression of the error rate. In larger problem instances, you would typically pay attention to the development of the error function on your test set. This is done by measuring the accuracy of the network after a period of training.
Question 3
The XOR problem requires at least 2 input, 2 hidden, and 1 output node, that is: five nodes are required to correctly model the XOR problem with a simple neural network.