What is the correct implementation of the dice coefficient?
This code
def dice_coef1(y_true, y_pred, smooth=1):
intersection = K.sum(y_true * y_pred, axis=[1,2,3])
union = K.sum(y_true, axis=[1,2,3]) + K.sum(y_pred, axis=[1,2,3])
dice = K.mean((2. * intersection + smooth)/(union + smooth), axis=0)
return dice
gives me 0.85, while this code
def dice_coef2(target, prediction, smooth=1):
numerator = 2.0 * K.sum(target * prediction) + smooth
denominator = K.sum(target) + K.sum(prediction) + smooth
coef = numerator / denominator
return coef
gives me 0.94.
I isolated the functions and for me they produce the same results. This means the overall mathematics are correct. So you might want to look into your data or the selection of your axis in dice_coef1. Without your dataset I cant reproduce your problem.
Related
The classical implementation of SVMs usually use SMO algorithm to train SVMs, which is a bit clumsy and lengthy. Occasionally I saw this line of code :
for ii in range(outer_loops * M):
i = int(np.random.rand() * M)
margin = Y[i] * np.dot(K[i, :], alpha)
grad = M * L * K[:, i] * alpha[i]
if (margin < 1):
grad -= Y[i] * K[:, i]
alpha -= grad / np.sqrt(ii + 1)
alpha_avg += alpha
to train the parameters.
It confused me for hours, I have no idea where the gradient pops out as well as the weird if block to "update" the gradient when the margin falls below 1. But the results turns out fairly good.
So I really want to know why this code is justified, why it's OK to use gradient descent rather than convex optimization techniques.
I was trying to implement Linear Regression in Octave 5.1.0 on a data set relating the GRE score to the probability of Admission.
The data set is of the sort,
337 0.92
324 0.76
316 0.72
322 0.8
. . .
My main Program.m file looks like,
% read the data
data = load('Admission_Predict.txt');
% initiate variables
x = data(:,1);
y = data(:,2);
m = length(y);
theta = zeros(2,1);
alpha = 0.01;
iters = 1500;
J_hist = zeros(iters,1);
% plot data
subplot(1,2,1);
plot(x,y,'rx','MarkerSize', 10);
title('training data');
% compute cost function
x = [ones(m,1), (data(:,1) ./ 300)]; % feature scaling
J = computeCost(x,y,theta);
% run gradient descent
[theta, J_hist] = gradientDescent(x,y,theta,alpha,iters);
hold on;
subplot(1,2,1);
plot((x(:,2) .* 300), (x*theta),'-');
xlabel('GRE score');
ylabel('Probability');
hold off;
subplot (1,2,2);
plot(1:iters, J_hist, '-b');
xlabel('no: of iteration');
ylabel('Cost function');
computeCost.m looks like,
function J = computeCost(x,y,theta)
m = length(y);
h = x * theta;
J = (1/(2*m))*sum((h-y) .^ 2);
endfunction
and gradientDescent.m looks like,
function [theta, J_hist] = gradientDescent(x,y,theta,alpha,iters)
m = length(y);
J_hist = zeros(iters,1);
for i=1:iters
diff = (x*theta - y);
theta = theta - (alpha * (1/(m))) * (x' * diff);
J_hist(i) = computeCost(x,y,theta);
endfor
endfunction
The graphs plotted then looks like this,
which you can see, doesn't feel right even though my Cost function seems to be minimized.
Can someone please tell me if this is right? If not, what am I doing wrong?
The easiest way to check whether your implementation is correct is to compare with a validated implementation of linear regression. I suggest using an alternative implementation approach like the one suggested here, and then comparing your results. If the fits match, then this is the best linear fit to your data and if they don't match, then there may be something wrong in your implementation.
I'm doing Andrew Ng's course on Machine Learning and I'm trying to wrap my head around the vectorised implementation of gradient descent for multiple variables which is an optional exercise in the course.
This is the algorithm in question (taken from here):
I just cannot do this in octave using sum though, I'm not sure how to multiply the sum of the hypothesis of x(i) - y(i) by the all variables xj(i). I tried different iterations of the following code but to no avail (either the dimensions are not right or the answer is wrong):
theta = theta - alpha/m * sum(X * theta - y) * X;
The correct answer, however, is entirely non-obvious (to a linear algebra beginner like me anyway, from here):
theta = theta - (alpha/m * (X * theta-y)' * X)';
Is there a rule of thumb for cases where sum is involved that governs transformations like the above?
And if so, is there the opposite version of the above (i.e. going from a sum based solution to a general multiplication one) as I was able to come up with a correct implementation using sum for gradient descent for a single variable (albeit not a very elegant one):
temp0 = theta(1) - (alpha/m * sum(X * theta - y));
temp1 = theta(2) - (alpha/m * sum((X * theta - y)' * X(:, 2)));
theta(1) = temp0;
theta(2) = temp1;
Please note that this only concerns vectorised implementations and although there are several questions on SO as to how this is done, my question is primarily concerned with the implementation of the algorithm in Octave using sum.
The general "rule of the thumb" is as follows, if you encounter something in the form of
SUM_i f(x_i, y_i, ...) g(a_i, b_i, ...)
then you can easily vectorize it (and this is what is done in the above) through
f(x, y, ...)' * g(a, b, ...)
As this is just a typical dot product, which in mathematics (in Euclidean space of finite dimension) looks like
<A, B> = SUM_i A_i B_i = A'B
thus
(X * theta-y)' * X)
is just
<X * theta-y), X> = <H_theta(X) - y, X> = SUM_i (H_theta(X_i) - y_i) X_i
as you can see this works both ways, as this is just a mathematical definition of dot product.
Referring to this part of your question specifically - "I'm not sure how to multiply the sum of the hypothesis of x(i) - y(i) by the all variables xj(i)."
In Octave you can multiply xj(i) to all the predictions using ".", so it can be written as:
m = size(X, 1);
predictions = X * theta;
sqrErrors = (predictions-y).^2;
J = 1 / (2*m) * sum(sqrErrors);
The vector multiplication automatically includes calculating the sum of the products. So you don't have to specify the sum() function. By using the sum() function, you are converting a vector into a scalar and that's bad.
You actually don't want to use summation here, because what you try to calculate are the single values for all thetas, and not the overall cost J. As you do this in one line of code, if you sum it up you end up with a single value (the sum of all thetas).
Summation was correct, though unnecessary, when you computed the values of theta one by one in the previous exercise. This works just the same:
temp0 = theta(1) - (alpha/m * (X * theta - y)' * X(:, 1));
temp1 = theta(2) - (alpha/m * (X * theta - y)' * X(:, 2));
theta(1) = temp0;
theta(2) = temp1;
I am trying to move on from simple linear single-variable gradient descent into something more advanced: best polynomial fit for a set of points. I created a simple octave test script which allows me to visually set the points in a 2D space, then start the gradient dsecent algorithm and see how it gradually approaches the best fit.
Unfortunately, it doesn't work as good as it did with the simple single-variable linear regression: the results I get ( when I get them ) are inconsistent with the polynome I expect!
Here is the code:
dim=5;
h = figure();
axis([-dim dim -dim dim]);
hold on
index = 1;
data = zeros(1,2);
while(1)
[x,y,b] = ginput(1);
if( length(b) == 0 )
break;
endif
plot(x, y, "b+");
data(index, :) = [x y];
index++;
endwhile
y = data(:, 2);
m = length(y);
X = data(:, 1);
X = [ones(m, 1), data(:,1), data(:,1).^2, data(:,1).^3 ];
theta = zeros(4, 1);
iterations = 100;
alpha = 0.001;
J = zeros(1,iterations);
for iter = 1:iterations
theta -= ( (1/m) * ((X * theta) - y)' * X)' * alpha;
plot(-dim:0.01:dim, theta(1) + (-dim:0.01:dim).*theta(2) + (-dim:0.01:dim).^2.*theta(3) + (-dim:0.01:dim).^3.*theta(4), "g-");
J(iter) = sum( (1/m) * ((X * theta) - y)' * X);
end
plot(-dim:0.01:dim, theta(1) + (-dim:0.01:dim).*theta(2) + (-dim:0.01:dim).^2.*theta(3) + (-dim:0.01:dim).^3.*theta(4), "r-");
figure()
plot(1:iter, J);
I continuously get wrong results, even though it would seem that J is minimized correctly. I checked the plotting function with the normal equation ( which works correctly of course, and although I believe the error lies somewhere in the theta equation, I cannot figure out what it.
i implemented your code and it seems to be just fine, the reason that you do not have the results that you want is that Linear regression or polynomial regression in your case suffers from local minimum when you try to minimize the objective function. The algorithm traps in local minimum during execution. i implement your code changing the step (alpha) and i saw that with smaller step it fits the data better but still you are trapping in local minimum.
Choosing random initialization point of thetas every time i am trapping in a different local minimum. If you are lucky you will find a better initial points for theta and fit the data better. I think that there are some algorithms that finds the best initial points.
Below i attach the results for random initial points and the results with Matlab's polyfit.
In the above plot replace "Linear Regression with Polynomial Regression", type error.
If you observe better the plot, you will see that by chance (using rand() ) i chose some initial points that leaded me to the best data fitting comparing the other initial points.... i am showing that with a pointer.
I implemented a gradient descent algorithm to minimize a cost function in order to gain a hypothesis for determining whether an image has a good quality. I did that in Octave. The idea is somehow based on the algorithm from the machine learning class by Andrew Ng
Therefore I have 880 values "y" that contains values from 0.5 to ~12. And I have 880 values from 50 to 300 in "X" that should predict the image's quality.
Sadly the algorithm seems to fail, after some iterations the value for theta is so small, that theta0 and theta1 become "NaN". And my linear regression curve has strange values...
here is the code for the gradient descent algorithm:
(theta = zeros(2, 1);, alpha= 0.01, iterations=1500)
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
tmp_j1=0;
for i=1:m,
tmp_j1 = tmp_j1+ ((theta (1,1) + theta (2,1)*X(i,2)) - y(i));
end
tmp_j2=0;
for i=1:m,
tmp_j2 = tmp_j2+ (((theta (1,1) + theta (2,1)*X(i,2)) - y(i)) *X(i,2));
end
tmp1= theta(1,1) - (alpha * ((1/m) * tmp_j1))
tmp2= theta(2,1) - (alpha * ((1/m) * tmp_j2))
theta(1,1)=tmp1
theta(2,1)=tmp2
% ============================================================
% Save the cost J in every iteration
J_history(iter) = computeCost(X, y, theta);
end
end
And here is the computation for the costfunction:
function J = computeCost(X, y, theta) %
m = length(y); % number of training examples
J = 0;
tmp=0;
for i=1:m,
tmp = tmp+ (theta (1,1) + theta (2,1)*X(i,2) - y(i))^2; %differenzberechnung
end
J= (1/(2*m)) * tmp
end
If you are wondering how the seemingly complex looking for loop can be vectorized and cramped into a single one line expression, then please read on. The vectorized form is:
theta = theta - (alpha/m) * (X' * (X * theta - y))
Given below is a detailed explanation for how we arrive at this vectorized expression using gradient descent algorithm:
This is the gradient descent algorithm to fine tune the value of θ:
Assume that the following values of X, y and θ are given:
m = number of training examples
n = number of features + 1
Here
m = 5 (training examples)
n = 4 (features+1)
X = m x n matrix
y = m x 1 vector matrix
θ = n x 1 vector matrix
xi is the ith training example
xj is the jth feature in a given training example
Further,
h(x) = ([X] * [θ]) (m x 1 matrix of predicted values for our training set)
h(x)-y = ([X] * [θ] - [y]) (m x 1 matrix of Errors in our predictions)
whole objective of machine learning is to minimize Errors in predictions. Based on the above corollary, our Errors matrix is m x 1 vector matrix as follows:
To calculate new value of θj, we have to get a summation of all errors (m rows) multiplied by jth feature value of the training set X. That is, take all the values in E, individually multiply them with jth feature of the corresponding training example, and add them all together. This will help us in getting the new (and hopefully better) value of θj. Repeat this process for all j or the number of features. In matrix form, this can be written as:
This can be simplified as:
[E]' x [X] will give us a row vector matrix, since E' is 1 x m matrix and X is m x n matrix. But we are interested in getting a column matrix, hence we transpose the resultant matrix.
More succinctly, it can be written as:
Since (A * B)' = (B' * A'), and A'' = A, we can also write the above as
This is the original expression we started out with:
theta = theta - (alpha/m) * (X' * (X * theta - y))
i vectorized the theta thing...
may could help somebody
theta = theta - (alpha/m * (X * theta-y)' * X)';
I think that your computeCost function is wrong.
I attended NG's class last year and I have the following implementation (vectorized):
m = length(y);
J = 0;
predictions = X * theta;
sqrErrors = (predictions-y).^2;
J = 1/(2*m) * sum(sqrErrors);
The rest of the implementation seems fine to me, although you could also vectorize them.
theta_1 = theta(1) - alpha * (1/m) * sum((X*theta-y).*X(:,1));
theta_2 = theta(2) - alpha * (1/m) * sum((X*theta-y).*X(:,2));
Afterwards you are setting the temporary thetas (here called theta_1 and theta_2) correctly back to the "real" theta.
Generally it is more useful to vectorize instead of loops, it is less annoying to read and to debug.
If you are OK with using a least-squares cost function, then you could try using the normal equation instead of gradient descent. It's much simpler -- only one line -- and computationally faster.
Here is the normal equation:
http://mathworld.wolfram.com/NormalEquation.html
And in octave form:
theta = (pinv(X' * X )) * X' * y
Here is a tutorial that explains how to use the normal equation: http://www.lauradhamilton.com/tutorial-linear-regression-with-octave
While not scalable like a vectorized version, a loop-based computation of a gradient descent should generate the same results. In the example above, the most probably case of the gradient descent failing to compute the correct theta is the value of alpha.
With a verified set of cost and gradient descent functions and a set of data similar with the one described in the question, theta ends up with NaN values just after a few iterations if alpha = 0.01. However, when set as alpha = 0.000001, the gradient descent works as expected, even after 100 iterations.
Using only vectors here is the compact implementation of LR with Gradient Descent in Mathematica:
Theta = {0, 0}
alpha = 0.0001;
iteration = 1500;
Jhist = Table[0, {i, iteration}];
Table[
Theta = Theta -
alpha * Dot[Transpose[X], (Dot[X, Theta] - Y)]/m;
Jhist[[k]] =
Total[ (Dot[X, Theta] - Y[[All]])^2]/(2*m); Theta, {k, iteration}]
Note: Of course one assumes that X is a n * 2 matrix, with X[[,1]] containing only 1s'
This should work:-
theta(1,1) = theta(1,1) - (alpha*(1/m))*((X*theta - y)'* X(:,1) );
theta(2,1) = theta(2,1) - (alpha*(1/m))*((X*theta - y)'* X(:,2) );
its cleaner this way, and vectorized also
predictions = X * theta;
errorsVector = predictions - y;
theta = theta - (alpha/m) * (X' * errorsVector);
If you remember the first Pdf file for Gradient Descent form machine Learning course, you would take care of learning rate. Here is the note from the mentioned pdf.
Implementation Note: If your learning rate is too large, J(theta) can di-
verge and blow up', resulting in values which are too large for computer
calculations. In these situations, Octave/MATLAB will tend to return
NaNs. NaN stands fornot a number' and is often caused by undened
operations that involve - infinity and +infinity.