How can i adapt multivariate normal for batch operations? - normal-distribution

I am implementing from scratch the multivariate normal probability function in python. The formula for it is as follows:
I was able to code this version, where $\mathbf{x}$ is an input vector (single sample). However, i could make good use of numpy's matrix operations and extend it to the case of using $\mathbf{X}$ (set of samples) to return all the samples probabilities at once. This is equal to the scipy's implementation.
This is the code i have made:
def multivariate_normal(X, center, cov):
k = X.shape[0]
det_cov = np.linalg.det(cov)
inv_cov = np.linalg.inv(cov)
o = 1 / np.sqrt( (2 * np.pi) ** k * det_cov)
p = np.exp( -.5 * ( np.dot(np.dot((X - center).T, inv_cov), (X - center))))
return o * p
Thanks in advance.

Related

Octave fminunc doesn't converge

I'm trying to use fminunc in Octave for a logistic problem, but it doesn't work. It says that I didn't define variables, but actually I did. If I define variables directly in the costFunction,and not in the main, it doesn't give any problem, but the function doesn't work really. In fact the exitFlag is equal to -3 and it doesn't converge at all.
Here's my function:
function [jVal, gradient] = cost(theta, X, y)
X = [1,0.14,0.09,0.58,0.39,0,0.55,0.23,0.64;1,-0.57,-0.54,-0.16,0.21,0,-0.11,-0.61,-0.35;1,0.42,0.45,-0.41,-0.6,0,-0.44,0.38,-0.29];
y = [1;0;1];
theta = [0.8;0.2;0.6;0.3;0.4;0.5;0.6;0.2;0.4];
jVal = 0;
jVal = costFunction2(X, y, theta); %this is another function that gives me jVal. I'm quite sure it is
%correct because I use it also with other algorithms and it
%works perfectly
m = length(y);
xSize = size(X, 2);
gradient = zeros(xSize, 1);
sig = X * theta;
h = 1 ./(1 + exp(-sig));
for i = 1:m
for j = 1:xSize
gradient(j) = (1/m) * sum(h(i) - y(i)) .* X(i, j);
end
end
Here's my main:
theta = [0.8;0.2;0.6;0.3;0.4;0.5;0.6;0.2;0.4];
options = optimset('GradObj', 'on', 'MaxIter', 100);
[optTheta, functionVal, exitFlag] = fminunc(#cost, theta, options)
if I compile it:
optTheta =
0.80000
0.20000
0.60000
0.30000
0.40000
0.50000
0.60000
0.20000
0.40000
functionVal = 0.15967
exitFlag = -3
How can I resolve this problem?
You are not in fact using fminunc correctly. From the documentation:
-- fminunc (FCN, X0)
-- fminunc (FCN, X0, OPTIONS)
FCN should accept a vector (array) defining the unknown variables,
and return the objective function value, optionally with gradient.
'fminunc' attempts to determine a vector X such that 'FCN (X)' is a
local minimum.
What you are passing is not a handle to a function that accepts a single vector argument. Instead, what you are passing (i.e. #cost) is a handle to a function that takes three arguments.
You need to 'convert' this into a handle to a function that takes only one input, and does what you want under the hood. The easiest way to do this is by 'wrapping' your cost function into an anonymous function that only takes one argument, and calls the cost function in the appropriate way, e.g.
fminunc( #(t) cost(t, X, y), theta, options )
Note: This assumes X and y are defined in the scope where you do this 'wrapping' business

Need a vectorized solution in pytorch

I'm doing an experiment using face images in PyTorch framework. The input x is the given face image of size 5 * 5 (height * width) and there are 192 channels.
Objective: To obtain patches of x of patch_size(given as argument).
I have obtained the required result with the help of two for loops. But I want a better-vectorized solution so that the computation cost will be very less than using two for loops.
Used: PyTorch 0.4.1, (12 GB) Nvidia TitanX GPU.
The following is my implementation using two for loops
def extractpatches( x, patch_size): # x is bsx192x5x5
patches = x.unfold( 2, patch_size , 1).unfold(3,patch_size,1)
bs,c,pi,pj, _, _ = patches.size() #bs,192,
cnt = 0
p = torch.empty((bs,pi*pj,c,patch_size,patch_size)).to(device)
s = torch.empty((bs,pi*pj, c*patch_size*patch_size)).to(device)
//Want a vectorized method instead of two for loops below
for i in range(pi):
for j in range(pj):
p[:,cnt,:,:,:] = patches[:,:,i,j,:,:]
s[:,cnt,:] = p[:,cnt,:,:,:].view(-1,c*patch_size*patch_size)
cnt = cnt+1
return s
Thanks for your help in advance.
I think you can try this as following. I used some parts of your code for my experiment and it worked for me. Here l and f are the lists of tensor patches
l = [patches[:,:,int(i/pi),i%pi,:,:] for i in range(pi * pi)]
f = [l[i].contiguous().view(-1,c*patch_size*patch_size) for i in range(pi * pi)]
You can verify the above code using toy input values.
Thanks.

How to do 2-layer nested FOR loop in PYTORCH?

I am learning to implement the Factorization Machine in Pytorch.
And there should be some feature crossing operations.
For example, I've got three features [A,B,C], after embedding, they are [vA,vB,vC], so the feature crossing is "[vA·vB], [vA·vC], [vB·vc]".
I know this operation can be simplified by the following:
It can be implemented by MATRIX OPERATIONS.
But this only gives a final result, say, a single value.
The question is, how to get all cross_vec in the following without doing FOR loop:
note: size of "feature_emb" is [batch_size x feature_len x embedding_size]
g_feature = 0
for i in range(self.featurn_len):
for j in range(self.featurn_len):
if j <= i: continue
cross_vec = feature_emb[:,i,:] * feature_emb[:,j,:]
g_feature += torch.sum(cross_vec, dim=1)
You can
cross_vec = (feature_emb[:, None, ...] * feature_emb[..., None, :]).sum(dim=-1)
This should give you corss_vec of shape (batch_size, feature_len, feature_len).
Alternatively, you can use torch.bmm
cross_vec = torch.bmm(feature_emb, feature_emb.transpose(1, 2))

Understanding code wrt Logistic Regression using gradient descent

I was following Siraj Raval's videos on logistic regression using gradient descent :
1) Link to longer video :
https://www.youtube.com/watch?v=XdM6ER7zTLk&t=2686s
2) Link to shorter video :
https://www.youtube.com/watch?v=xRJCOz3AfYY&list=PL2-dafEMk2A7mu0bSksCGMJEmeddU_H4D
In the videos he talks about using gradient descent to reduce the error for a set number of iterations so that the function converges(slope becomes zero).
He also illustrates the process via code. The following are the two main functions from the code :
def step_gradient(b_current, m_current, points, learningRate):
b_gradient = 0
m_gradient = 0
N = float(len(points))
for i in range(0, len(points)):
x = points[i, 0]
y = points[i, 1]
b_gradient += -(2/N) * (y - ((m_current * x) + b_current))
m_gradient += -(2/N) * x * (y - ((m_current * x) + b_current))
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
return [new_b, new_m]
def gradient_descent_runner(points, starting_b, starting_m, learning_rate, num_iterations):
b = starting_b
m = starting_m
for i in range(num_iterations):
b, m = step_gradient(b, m, array(points), learning_rate)
return [b, m]
#The above functions are called below:
learning_rate = 0.0001
initial_b = 0 # initial y-intercept guess
initial_m = 0 # initial slope guess
num_iterations = 1000
[b, m] = gradient_descent_runner(points, initial_b, initial_m, learning_rate, num_iterations)
# code taken from Siraj Raval's github page
Why does the value of b & m continue to update for all the iterations? After a certain number of iterations, the function will converge, when we find the values of b & m that give slope = 0.
So why do we continue iteration after that point and continue updating b & m ?
This way, aren't we losing the 'correct' b & m values? How is learning rate helping the convergence process if we continue to update values after converging? Thus, why is there no check for convergence, and so how is this actually working?
In practice, most likely you will not reach to slope 0 exactly. Thinking of your loss function as a bowl. If your learning rate is too high, it is possible to overshoot over the lowest point of the bowl. On the contrary, if the learning rate is too low, your learning will become too slow and won't reach the lowest point of the bowl before all iterations are done.
That's why in machine learning, the learning rate is an important hyperparameter to tune.
Actually, once we reach a slope 0; b_gradient and m_gradient will become 0;
thus, for :
new_b = b_current - (learningRate * b_gradient)
new_m = m_current - (learningRate * m_gradient)
new_b and new_m will remain the old correct values; as nothing will be subtracted from them.

Theano gradient doesn't work with .sum(), only .mean()?

I'm trying to learn theano and decided to implement linear regression (using their Logistic Regression from the tutorial as a template). I'm getting a wierd thing where T.grad doesn't work if my cost function uses .sum(), but does work if my cost function uses .mean(). Code snippet:
(THIS DOESN'T WORK, RESULTS IN A W VECTOR FULL OF NANs):
x = T.matrix('x')
y = T.vector('y')
w = theano.shared(rng.randn(feats), name='w')
b = theano.shared(0., name="b")
# now we do the actual expressions
h = T.dot(x,w) + b # prediction is dot product plus bias
single_error = .5 * ((h - y)**2)
cost = single_error.sum()
gw, gb = T.grad(cost, [w,b])
train = theano.function(inputs=[x,y], outputs=[h, single_error], updates = ((w, w - .1*gw), (b, b - .1*gb)))
predict = theano.function(inputs=[x], outputs=h)
for i in range(training_steps):
pred, err = train(D[0], D[1])
(THIS DOES WORK, PERFECTLY):
x = T.matrix('x')
y = T.vector('y')
w = theano.shared(rng.randn(feats), name='w')
b = theano.shared(0., name="b")
# now we do the actual expressions
h = T.dot(x,w) + b # prediction is dot product plus bias
single_error = .5 * ((h - y)**2)
cost = single_error.mean()
gw, gb = T.grad(cost, [w,b])
train = theano.function(inputs=[x,y], outputs=[h, single_error], updates = ((w, w - .1*gw), (b, b - .1*gb)))
predict = theano.function(inputs=[x], outputs=h)
for i in range(training_steps):
pred, err = train(D[0], D[1])
The only difference is in the cost = single_error.sum() vs single_error.mean(). What I don't understand is that the gradient should be the exact same in both cases (one is just a scaled version of the other). So what gives?
The learning rate (0.1) is way to big. Using mean make it divided by the batch size, so this help. But I'm pretty sure you should make it much smaller. Not just dividing by the batch size (which is equivalent to using mean).
Try a learning rate of 0.001.
Try dividing your gradient descent step size by the number of training examples.

Resources