I'm trying to implement the basic example from https://en.wikipedia.org/wiki/Variational_Bayesian_methods#A_basic_example in order to find the posterior distribution over the mean and variance for the input data(which I generate in the beginning).
It is my understanding that the approximated posterior should be given by the product of q(mu)*q(tau) so I thought I could get it by simple multiplying each distribution for each point in the grid and then plot it. Although I can't see any error with my distributions, the gamma distribution produces extremely small values while the Gaussian distributions only has one non-zero element. My guess is that there is something wrong in the end where I multiply the two distributions for each point in the grid produced by meshgrid but I just wrote both distributions straight from Wikipedia. Why are my posterior probabilities so small/nan and what can I do to fix it?
Here is my code:
# First Exact solution
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from scipy.special import factorial
# Produce N data points from known gaussian distribution
real_mu = 2
real_tau = 1
x = np.arange(-3,7,0.1)
N = len(x)
nrv = stats.norm.pdf(x, real_mu, real_tau)
## Approximate posterior distribution over mean and covariance ##
x = nrv # N data points from "unknown distribution"
N = len(x) #
# The Algorithm
#hyperparameters - can be set arbitrarily but smaller positive values indicates larger prior distributions over mu and tau
lambda_0 = 0.05
mu_0 = 0.05
a_0 = 0.05
b_0 = 0.05
##
x_sum = np.sum(x)
x_ave = np.sum(x)/N
x_squared = np.dot(x, x)
mu_n = (lambda_0*mu_0 + N*x_ave)/(lambda_0 + N)
a_N = a_0 + (N + 1)/2
lambda_N = 1 # Initialize lambda to some arbitrary value
difference = 9999999
while difference > 0.0000001:
b_N = b_0 + 0.5*((lambda_0 + N)*((1/lambda_N) + mu_n**2) - 2*(lambda_0*mu_0 + x_sum)*mu_n + (x_squared) + lambda_0*mu_0*mu_0)
new_lambda_N = (lambda_0 + N)*a_N/b_N
difference_1 = new_lambda_N - lambda_N
lambda_N = new_lambda_N
difference = np.absolute(difference_1)
# Calulate the approximated posterior from these parameters: q(mu, tau) = q(mu)q(tau)
t = np.arange(-2,2,0.01)
#qmu = stats.norm.pdf(t, mu_n, 1/lambda_N)
#qtau = gamma.pdf(t, a_N, loc=0, scale=b_N) #scale=1/b_N)
def gaussian(x):
return (1/(np.sqrt(2*np.pi*sigma*sigma)))*np.exp(-(x-mu_n)*(x-mu_n)/(2*sigma*sigma))
def gamma(x):
return ((b_N**a_N)*(x**(a_N-1))*np.exp(-x*b_N))/(factorial(a_N-1))
sigma = 1/lambda_N
xx, yy = np.meshgrid(t, t)
# First part in zz is from Gaussian distribution over mu and second a gamma distribution over tau
# Same as the two defined functions above
zz = ((1/(np.sqrt(2*np.pi*sigma*sigma)))*np.exp(-(xx-mu_n)*(xx-mu_n)/(2*sigma*sigma)))*((b_N**a_N)*(yy**(a_N-1))*np.exp(-yy*b_N))/(factorial(a_N-1))
plt.xlabel("mu")
plt.ylabel("tau")
plt.contourf(t,t,zz)
plt.show()
Related
I read the official doc for creating custom metric. It says:
Note that sample weighting is automatically supported for any such metric.
I wonder how sample weighting is supported for complicated metric. For example, a metric to compute weighted correlation between y_true and y_pred in Keras. Code below:
def customized_correlation(y_true, y_pred, sample_weights):
x = y_true
y = y_pred
mx = K.mean(x)
my = K.mean(y)
xm, ym = x - mx, y - my
r_num = K.sum(xm * ym * sample_weights)
r_den = K.sqrt(K.sum(K.square(xm) * sample_weights) * K.sum(K.square(ym) * sample_weights))
r = r_num / r_den
return r
If we remove the sample_weights variable in code, how does Keras know where sample_weights should be inserted to calculate the weighted correlation?
It does not, and it will not work. Using sample_weights simply means the resulting metric vector will be multiplied (element-wise) by weight vector at the very end
I am trying to develop a deep Markov Model using following tutorial:
https://pyro.ai/examples/dmm.html
This model parameterises transitions and emissions with using a neural network and for variational inference part they use RNN to map observable 'x' to latent space. And in order to ensure that their model is learning something they try to maximise ELBO or minimise negative ELBO. They refer to negative ELBO as NLL.
I am basically converting pyro code to pure pytorch. And I've put together pretty much every thing. However, now I want to get a reconstruct error on test sequences. My input is one hot encoding of my sequences and it is of shape (500,1900,4) where 4 refers to number of features and 1900 is sequence length, and 500 refers to total number of examples.
The way I am generating data is:
generation model p(x_{1:T} | z_{1:T}) p(z_{1:T})
batch_size, _, x_dim = x.size() # number of time steps we need to process in the mini-batch
T_max = x_lens.max()
z_prev = self.z_0.expand(batch_size, self.z_0.size(0)) # set z_prev=z_0 to setup the recursive conditioning in p(z_t|z_{t-1})
for t in range(1, T_max + 1):
# sample z_t ~ p(z_t | z_{t-1}) one time step at a time
z_t, z_mu, z_logvar = self.trans(z_prev) # p(z_t | z_{t-1})
p_x_t = F.softmax(self.emitter(z_t),dim=-1) # compute the probabilities that parameterize the bernoulli likelihood
safe_tensor = torch.where(torch.isnan(p_x_t), torch.zeros_like(p_x_t), p_x_t)
print('generate p_x_t : ',safe_tensor)
x_t = torch.bernoulli(safe_tensor) #sample observe x_t according to the bernoulli distribution p(x_t|z_t)
print('generate x_t : ',x_t)
z_prev = z_t
So my emissions are being defined by Bernoulli distribution. And I use softmax in order to compute probabilities that parameterise the Bernoulli likelihood. And then I sample my observables x_t according the Bernoulli distribution.
At first when I ran my model, I was getting nans at times and so I introduced the line (show below) in order to convert nans to zero:
safe_tensor = torch.where(torch.isnan(p_x_t), torch.zeros_like(p_x_t), p_x_t)
However, after 1 epoch or so 'x_t' that I sample, this tensor is just zero. Basically, what I want is that after applying softmax, I want my function to pick the highest probability and give me the corresponding label for it which is either of the 4 features. But I get nans and then when I convert all nans to zero, I end up getting all zeros in all tensors after 1 epoch or so.
Also, when I look at p_x_t tensor I get probabilities that add up to one. But when I look at x_t tensor it gives me 0's all the way. So for example:
p_x_t : tensor([[0.2168, 0.2309, 0.2555, 0.2967]
.....], device='cuda:0',
grad_fn=<SWhereBackward>)
generate x_t : tensor([[0., 0., 0., 0.],...
], device='cuda:0', grad_fn=<BernoulliBackward0>)
Here fourth label/feature is giving me highest probability. Shouldn't the x_t tensor give me 1 at least in that position so be like:
generate x_t : tensor([[0., 0., 0., 1.],...
], device='cuda:0', grad_fn=<BernoulliBackward0>)
How can I get rid of this problems?
EDIT
My transition (which is called as self.trans in generate function mentioned above):
class GatedTransition(nn.Module):
"""
Parameterizes the gaussian latent transition probability `p(z_t | z_{t-1})`
See section 5 in the reference for comparison.
"""
def __init__(self, z_dim, trans_dim):
super(GatedTransition, self).__init__()
self.gate = nn.Sequential(
nn.Linear(z_dim, trans_dim),
nn.ReLU(),
nn.Linear(trans_dim, z_dim),
nn.Softmax(dim=-1)
)
self.proposed_mean = nn.Sequential(
nn.Linear(z_dim, trans_dim),
nn.ReLU(),
nn.Linear(trans_dim, z_dim)
)
self.z_to_mu = nn.Linear(z_dim, z_dim)
# modify the default initialization of z_to_mu so that it starts out as the identity function
self.z_to_mu.weight.data = torch.eye(z_dim)
self.z_to_mu.bias.data = torch.zeros(z_dim)
self.z_to_logvar = nn.Linear(z_dim, z_dim)
self.relu = nn.ReLU()
def forward(self, z_t_1):
"""
Given the latent `z_{t-1}` corresponding to the time step t-1
we return the mean and scale vectors that parameterize the (diagonal) gaussian distribution `p(z_t | z_{t-1})`
"""
gate = self.gate(z_t_1) # compute the gating function
proposed_mean = self.proposed_mean(z_t_1) # compute the 'proposed mean'
mu = (1 - gate) * self.z_to_mu(z_t_1) + gate * proposed_mean # compute the scale used to sample z_t, using the proposed mean from
logvar = self.z_to_logvar(self.relu(proposed_mean))
epsilon = torch.randn(z_t_1.size(), device=z_t_1.device) # sampling z by re-parameterization
z_t = mu + epsilon * torch.exp(0.5 * logvar) # [batch_sz x z_sz]
if torch.isinf(z_t).any().item():
print('something is infinity')
print('z_t : ',z_t)
print('logvar : ',logvar)
print('epsilon : ',epsilon)
print('mu : ',mu)
return z_t, mu, logvar
I do not get any inf in z_t tensor when doing training and validation. Only during testing. This is how I am training, validating and testing my model:
for epoch in range(config['epochs']):
train_loader=torch.utils.data.DataLoader(dataset=train_set, batch_size=config['batch_size'], shuffle=True, num_workers=1)
train_data_iter=iter(train_loader)
n_iters=train_data_iter.__len__()
epoch_nll = 0.0 # accumulator for our estimate of the negative log likelihood (or rather -elbo) for this epoch
i_batch=1
n_slices=0
loss_records={}
while True:
try: x, x_rev, x_lens = train_data_iter.next()
except StopIteration: break # end of epoch
x, x_rev, x_lens = gVar(x), gVar(x_rev), gVar(x_lens)
if config['anneal_epochs'] > 0 and epoch < config['anneal_epochs']: # compute the KL annealing factor
min_af = config['min_anneal']
kl_anneal = min_af+(1.0-min_af)*(float(i_batch+epoch*n_iters+1)/float(config['anneal_epochs']*n_iters))
else:
kl_anneal = 1.0 # by default the KL annealing factor is unity
loss_AE = model.train_AE(x, x_rev, x_lens, kl_anneal)
epoch_nll += loss_AE['train_loss_AE']
i_batch=i_batch+1
n_slices=n_slices+x_lens.sum().item()
loss_records.update(loss_AE)
loss_records.update({'epo_nll':epoch_nll/n_slices})
times.append(time.time())
epoch_time = times[-1] - times[-2]
log("[Epoch %04d]\t\t(dt = %.3f sec)"%(epoch, epoch_time))
log(loss_records)
if args.visual:
for k, v in loss_records.items():
tb_writer.add_scalar(k, v, epoch)
# do evaluation on test and validation data and report results
if (epoch+1) % args.test_freq == 0:
save_model(model, epoch)
#test_loader=torch.utils.data.DataLoader(dataset=test_set, batch_size=config['batch_size'], shuffle=False, num_workers=1)
valid_loader=torch.utils.data.DataLoader(dataset=valid_set, batch_size=config['batch_size'], shuffle=False, num_workers=1)
for x, x_rev, x_lens in valid_loader:
x, x_rev, x_lens = gVar(x), gVar(x_rev), gVar(x_lens)
loss,kl_loss = model.valid(x, x_rev, x_lens)
#print('x_lens sum : ',x_lens.sum().item())
#print('loss : ',loss)
valid_nll = loss/x_lens.sum().item()
log("[Epoch %04d]\t\t(valid_loss = %.8f)\t\t(kl_loss = %.8f)\t\t(valid_nll = %.8f)"%(epoch, loss, kl_loss, valid_nll ))
test_loader=torch.utils.data.DataLoader(dataset=test_set, batch_size=config['batch_size'], shuffle=False, num_workers=1)
for x, x_rev, x_lens in test_loader:
x, x_rev, x_lens = gVar(x), gVar(x_rev), gVar(x_lens)
loss,kl_loss = model.valid(x, x_rev, x_lens)
test_nll = loss/x_lens.sum().item()
model.generate(x, x_rev, x_lens)
log("[test_nll epoch %08d] %.8f" % (epoch, test_nll))
The test is outside the for loop for epoch. Because I want to test my model on test sequences using generate function. Can you now help me understand if there's something wrong I am doing?
I just too an ML course and am trying to get better at tensorflow. To that end, I purchased the book by Nishant Shukhla (ML with tensorflow) and am trying to run the 2 feature example with a different data set.
With the fake dataset in the book, my code runs fine. However, with data I used in the ML course, the code refuses to converge. With a really small learning rate it does converge, but the learned weights are wrong.
Also attaching the plot of the feature data. It should not a feature scaling issue as values on both features vary between 30-100 units.
I am really struggling with how opaque tensorflow is- any help would be appreciated:
""" Solution for simple logistic regression model
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import numpy as np
import tensorflow as tf
import time
import matplotlib.pyplot as plt
# Define paramaters for the model
learning_rate = 0.0001
training_epochs = 300
data = np.loadtxt('ex2data1.txt', delimiter=',')
x1s = np.array(data[:,0]).astype(np.float32)
x2s = np.array(data[:,1]).astype(np.float32)
ys = np.array(data[:,2]).astype(np.float32)
print('Plotting data with + indicating (y = 1) examples and o \n indicating (y = 0) examples.\n')
color = ['red' if l == 0 else 'blue' for l in ys]
myplot = plt.scatter(x1s, x2s, color = color)
# Put some labels
plt.xlabel("Exam 1 score")
plt.ylabel("Exam 2 score")
# Specified in plot order
plt.show()
# Step 2: Create datasets
X1 = tf.placeholder(tf.float32, shape=(None,), name="x1")
X2 = tf.placeholder(tf.float32, shape=(None,), name="x2")
Y = tf.placeholder(tf.float32, shape=(None,), name="y")
w = tf.Variable(np.random.rand(3,1), name='w', dtype='float32',trainable=True)
y_model = tf.sigmoid(w[2]*X2 + w[1]*X1 + w[0])
cost = tf.reduce_mean(-tf.log(y_model*Y + (1-y_model)*(1-Y)))
train_op = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
writer = tf.summary.FileWriter('./graphs/logreg', tf.get_default_graph())
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
prev_error = 0.0;
for epoch in range(training_epochs):
error, loss = sess.run([cost, train_op], feed_dict={X1:x1s, X2:x2s, Y:ys})
print("epoch = ", epoch, "loss = ", loss)
if abs(prev_error - error) < 0.0001:
break
prev_error = error
w_val = sess.run(w, {X1:x1s, X2:x2s, Y:ys})
print("w learned = ", w_val)
writer.close()
sess.close()
Both X1 and X2 range from ~20-100. However, once I scaled them, the solution converged just fine.
I am building a neural network to learn to recognize handwritten digits from MNIST. I have confirmed that backpropagation calculates the gradients perfectly (gradient checking gives error < 10 ^ -10).
It appears that no matter how I train the weights, the cost function always tends towards around 3.24-3.25 (never below that, just approaching from above) and the training/test set accuracy is very low (around 11% for the test set). It appears that the h values in the end are all very close to 0.1 and to each other.
I cannot find why my program cannot produce better results. I was wondering if anyone could maybe take a look at my code and please tell me any reasons for this occurring. Thank you so much for all your help, I really appreciate it!
Here is my Python code:
import numpy as np
import math
from tensorflow.examples.tutorials.mnist import input_data
# Neural network has four layers
# The input layer has 784 nodes
# The two hidden layers each have 5 nodes
# The output layer has 10 nodes
num_layer = 4
num_node = [784,5,5,10]
num_output_node = 10
# 30000 training sets are used
# 10000 test sets are used
# Can be adjusted
Ntrain = 30000
Ntest = 10000
# Sigmoid Function
def g(X):
return 1/(1 + np.exp(-X))
# Forwardpropagation
def h(W,X):
a = X
for l in range(num_layer - 1):
a = np.insert(a,0,1)
z = np.dot(a,W[l])
a = g(z)
return a
# Cost Function
def J(y, W, X, Lambda):
cost = 0
for i in range(Ntrain):
H = h(W,X[i])
for k in range(num_output_node):
cost = cost + y[i][k] * math.log(H[k]) + (1-y[i][k]) * math.log(1-H[k])
regularization = 0
for l in range(num_layer - 1):
for i in range(num_node[l]):
for j in range(num_node[l+1]):
regularization = regularization + W[l][i+1][j] ** 2
return (-1/Ntrain * cost + Lambda / (2*Ntrain) * regularization)
# Backpropagation - confirmed to be correct
# Algorithm based on https://www.coursera.org/learn/machine-learning/lecture/1z9WW/backpropagation-algorithm
# Returns D, the value of the gradient
def BackPropagation(y, W, X, Lambda):
delta = np.empty(num_layer-1, dtype = object)
for l in range(num_layer - 1):
delta[l] = np.zeros((num_node[l]+1,num_node[l+1]))
for i in range(Ntrain):
A = np.empty(num_layer-1, dtype = object)
a = X[i]
for l in range(num_layer - 1):
A[l] = a
a = np.insert(a,0,1)
z = np.dot(a,W[l])
a = g(z)
diff = a - y[i]
delta[num_layer-2] = delta[num_layer-2] + np.outer(np.insert(A[num_layer-2],0,1),diff)
for l in range(num_layer-2):
index = num_layer-2-l
diff = np.multiply(np.dot(np.array([W[index][k+1] for k in range(num_node[index])]), diff), np.multiply(A[index], 1-A[index]))
delta[index-1] = delta[index-1] + np.outer(np.insert(A[index-1],0,1),diff)
D = np.empty(num_layer-1, dtype = object)
for l in range(num_layer - 1):
D[l] = np.zeros((num_node[l]+1,num_node[l+1]))
for l in range(num_layer-1):
for i in range(num_node[l]+1):
if i == 0:
for j in range(num_node[l+1]):
D[l][i][j] = 1/Ntrain * delta[l][i][j]
else:
for j in range(num_node[l+1]):
D[l][i][j] = 1/Ntrain * (delta[l][i][j] + Lambda * W[l][i][j])
return D
# Neural network - this is where the learning/adjusting of weights occur
# W is the weights
# learn is the learning rate
# iterations is the number of iterations we pass over the training set
# Lambda is the regularization parameter
def NeuralNetwork(y, X, learn, iterations, Lambda):
W = np.empty(num_layer-1, dtype = object)
for l in range(num_layer - 1):
W[l] = np.random.rand(num_node[l]+1,num_node[l+1])/100
for k in range(iterations):
print(J(y, W, X, Lambda))
D = BackPropagation(y, W, X, Lambda)
for l in range(num_layer-1):
W[l] = W[l] - learn * D[l]
print(J(y, W, X, Lambda))
return W
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
# Training data, read from MNIST
inputpix = []
output = []
for i in range(Ntrain):
inputpix.append(2 * np.array(mnist.train.images[i]) - 1)
output.append(np.array(mnist.train.labels[i]))
np.savetxt('input.txt', inputpix, delimiter=' ')
np.savetxt('output.txt', output, delimiter=' ')
# Train the weights
finalweights = NeuralNetwork(output, inputpix, 2, 5, 1)
# Test data
inputtestpix = []
outputtest = []
for i in range(Ntest):
inputtestpix.append(2 * np.array(mnist.test.images[i]) - 1)
outputtest.append(np.array(mnist.test.labels[i]))
np.savetxt('inputtest.txt', inputtestpix, delimiter=' ')
np.savetxt('outputtest.txt', outputtest, delimiter=' ')
# Determine the accuracy of the training data
count = 0
for i in range(Ntrain):
H = h(finalweights,inputpix[i])
print(H)
for j in range(num_output_node):
if H[j] == np.amax(H) and output[i][j] == 1:
count = count + 1
print(count/Ntrain)
# Determine the accuracy of the test data
count = 0
for i in range(Ntest):
H = h(finalweights,inputtestpix[i])
print(H)
for j in range(num_output_node):
if H[j] == np.amax(H) and outputtest[i][j] == 1:
count = count + 1
print(count/Ntest)
Your network is tiny, 5 neurons make it basically a linear model. Increase it to 256 per layer.
Notice, that trivial linear model has 768 * 10 + 10 (biases) parameters, adding up to 7690 floats. Your neural network on the other hand has 768 * 5 + 5 + 5 * 5 + 5 + 5 * 10 + 10 = 3845 + 30 + 60 = 3935. In other words despite being nonlinear neural network, it is actualy a simpler model than a trivial logistic regression applied to this problem. And logistic regression obtains around 11% error on its own, thus you cannot really expect to beat it. Of course this is not a strict argument, but should give you some intuition for why it should not work.
Second issue is related to other hyperparameters, you seem to be using:
huge learning rate (is it 2?) it should be more of order 0.0001
very little training iterations (are you just executing 5 epochs?)
your regularization parameter is huge (it is set to 1), so your network is heavily penalised for learning anything, again - change it to something order of magnitude smaller
The NN architecture is most likely under-fitting. Maybe, the learning rate is high/low. Or there are most issues with the regularization parameter.
I'm trying to learn theano and decided to implement linear regression (using their Logistic Regression from the tutorial as a template). I'm getting a wierd thing where T.grad doesn't work if my cost function uses .sum(), but does work if my cost function uses .mean(). Code snippet:
(THIS DOESN'T WORK, RESULTS IN A W VECTOR FULL OF NANs):
x = T.matrix('x')
y = T.vector('y')
w = theano.shared(rng.randn(feats), name='w')
b = theano.shared(0., name="b")
# now we do the actual expressions
h = T.dot(x,w) + b # prediction is dot product plus bias
single_error = .5 * ((h - y)**2)
cost = single_error.sum()
gw, gb = T.grad(cost, [w,b])
train = theano.function(inputs=[x,y], outputs=[h, single_error], updates = ((w, w - .1*gw), (b, b - .1*gb)))
predict = theano.function(inputs=[x], outputs=h)
for i in range(training_steps):
pred, err = train(D[0], D[1])
(THIS DOES WORK, PERFECTLY):
x = T.matrix('x')
y = T.vector('y')
w = theano.shared(rng.randn(feats), name='w')
b = theano.shared(0., name="b")
# now we do the actual expressions
h = T.dot(x,w) + b # prediction is dot product plus bias
single_error = .5 * ((h - y)**2)
cost = single_error.mean()
gw, gb = T.grad(cost, [w,b])
train = theano.function(inputs=[x,y], outputs=[h, single_error], updates = ((w, w - .1*gw), (b, b - .1*gb)))
predict = theano.function(inputs=[x], outputs=h)
for i in range(training_steps):
pred, err = train(D[0], D[1])
The only difference is in the cost = single_error.sum() vs single_error.mean(). What I don't understand is that the gradient should be the exact same in both cases (one is just a scaled version of the other). So what gives?
The learning rate (0.1) is way to big. Using mean make it divided by the batch size, so this help. But I'm pretty sure you should make it much smaller. Not just dividing by the batch size (which is equivalent to using mean).
Try a learning rate of 0.001.
Try dividing your gradient descent step size by the number of training examples.