Metal shader language: Why their is difference in result of similar instructions? - metal

Under metal shader, Why those 2 lines do not work the same :
float b = (color.rgb * float3(1,1.1,0.9)).x;
and
float b = dot(color.rgb, float3(1,1.1,0.9));

These are different operations. * is component-wise multiplication, while dot is the vector dot product.
Suppose color is defined as float3(0.6f, 0.7f, 0.8f).
Then the first expression, (color.rgb * float3(1.0f, 1.1f, 0.9f)).x, first multiplies the vectors together component-wise, producing the vector (0.6, 0.77, 0.72), then takes the first component (x), so the result is 0.6.
The second expression, dot(color.rgb, float3(1.0f, 1.1f, 0.9f)), is the sum of the component-wise products of the vectors (often called the dot product or inner product), so the result is (0.6 * 1.0 + 0.7 * 1.1 + 0.8 * 0.9), which happens to be 2.09.

Related

Does H.264 encoded video with BT.709 matrix include any gamma adjustment?

I have read the BT.709 spec a number of times and the thing that is just not clear is should an encoded H.264 bitstream actually apply any gamma curve to the encoded data? Note the specific mention of a gamma like formula in the BT.709 spec. Apple provided examples of OpenGL or Metal shaders that read YUV data from CoreVideo provided buffers do not do any sort of gamma adjustment. YUV values are being read and processed as though they are simple linear values. I also examined the source code of ffmpeg and found no gamma adjustments being applied after the BT.709 scaling step. I then created a test video with just two linear grayscale colors 5 and 26 corresponding to 2% and 10% levels. When converted to H.264 with both ffmpeg and iMovie, the output BT.709 values are (YCbCr) (20 128 128) and (38 128 128) and these values exactly match the output of the BT.709 conversion matrix without any gamma adjustment.
A great piece of background on this topic can be found at Quicktime Gamma Bug. It seems that some historical issues with Quicktime and Adobe encoders were improperly doing different gamma adjustments and the results made video streams look awful on different players. This is really confusing because if you compare to sRGB, it clearly indicates how to apply a gamma encoding and then decode it to convert between sRGB and linear. Why does BT.709 go into so much detail about the same sort of gamma adjustment curve if no gamma adjustment is applied after the matrix step when creating a h.264 data stream? Are all the color steps in a h.264 stream meant to be coded as straight linear (gamma 1.0) values?
In case specific example input would make things more clear, I am attaching 3 color bar images, the exact values of different colors can be displayed in an image editor with these image files.
This first image is in the sRGB colorspace and is tagged as sRGB.
This second image has been converted to the linear RGB colorspace and is tagged with a linear RGB profile.
This third image has been converted to REC.709 profile levels with Rec709-elle-V4-rec709.icc from elles_icc_profiles
. This seems to be what one would need to do to simulate "camera" gamma as described in BT.709.
Note how the sRGB value in the lower right corner (0x555555) becomes linear RGB (0x171717) and the BT.709 gamma encoded value becomes (0x464646). What is unclear is if I should be passing a linear RGB value into ffmpeg or if I should be passing an already BT.709 gamma encoded value which would then need to be decoded in the client before the linear conversion Matrix step to get back to RGB.
Update:
Based on the feedback, I have updated my C based implementation and Metal shader and uploaded to github as an iOS example project MetalBT709Decoder.
Encoding a normalized linear RGB value is implemented like this:
static inline
int BT709_convertLinearRGBToYCbCr(
float Rn,
float Gn,
float Bn,
int *YPtr,
int *CbPtr,
int *CrPtr,
int applyGammaMap)
{
// Gamma adjustment to non-linear value
if (applyGammaMap) {
Rn = BT709_linearNormToNonLinear(Rn);
Gn = BT709_linearNormToNonLinear(Gn);
Bn = BT709_linearNormToNonLinear(Bn);
}
// https://www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-BT.709-6-201506-I!!PDF-E.pdf
float Ey = (Kr * Rn) + (Kg * Gn) + (Kb * Bn);
float Eb = (Bn - Ey) / Eb_minus_Ey_Range;
float Er = (Rn - Ey) / Er_minus_Ey_Range;
// Quant Y to range [16, 235] (inclusive 219 values)
// Quant Eb, Er to range [16, 240] (inclusive 224 values, centered at 128)
float AdjEy = (Ey * (YMax-YMin)) + 16;
float AdjEb = (Eb * (UVMax-UVMin)) + 128;
float AdjEr = (Er * (UVMax-UVMin)) + 128;
*YPtr = (int) round(AdjEy);
*CbPtr = (int) round(AdjEb);
*CrPtr = (int) round(AdjEr);
return 0;
}
Decoding from YCbCr to linear RGB is implemented like so:
static inline
int BT709_convertYCbCrToLinearRGB(
int Y,
int Cb,
int Cr,
float *RPtr,
float *GPtr,
float *BPtr,
int applyGammaMap)
{
// https://en.wikipedia.org/wiki/YCbCr#ITU-R_BT.709_conversion
// http://www.niwa.nu/2013/05/understanding-yuv-values/
// Normalize Y to range [0 255]
//
// Note that the matrix multiply will adjust
// this byte normalized range to account for
// the limited range [16 235]
float Yn = (Y - 16) * (1.0f / 255.0f);
// Normalize Cb and CR with zero at 128 and range [0 255]
// Note that matrix will adjust to limited range [16 240]
float Cbn = (Cb - 128) * (1.0f / 255.0f);
float Crn = (Cr - 128) * (1.0f / 255.0f);
const float YScale = 255.0f / (YMax-YMin);
const float UVScale = 255.0f / (UVMax-UVMin);
const
float BT709Mat[] = {
YScale, 0.000f, (UVScale * Er_minus_Ey_Range),
YScale, (-1.0f * UVScale * Eb_minus_Ey_Range * Kb_over_Kg), (-1.0f * UVScale * Er_minus_Ey_Range * Kr_over_Kg),
YScale, (UVScale * Eb_minus_Ey_Range), 0.000f,
};
// Matrix multiply operation
//
// rgb = BT709Mat * YCbCr
// Convert input Y, Cb, Cr to normalized float values
float Rn = (Yn * BT709Mat[0]) + (Cbn * BT709Mat[1]) + (Crn * BT709Mat[2]);
float Gn = (Yn * BT709Mat[3]) + (Cbn * BT709Mat[4]) + (Crn * BT709Mat[5]);
float Bn = (Yn * BT709Mat[6]) + (Cbn * BT709Mat[7]) + (Crn * BT709Mat[8]);
// Saturate normalzied linear (R G B) to range [0.0, 1.0]
Rn = saturatef(Rn);
Gn = saturatef(Gn);
Bn = saturatef(Bn);
// Gamma adjustment for RGB components after matrix transform
if (applyGammaMap) {
Rn = BT709_nonLinearNormToLinear(Rn);
Gn = BT709_nonLinearNormToLinear(Gn);
Bn = BT709_nonLinearNormToLinear(Bn);
}
*RPtr = Rn;
*GPtr = Gn;
*BPtr = Bn;
return 0;
}
I believe this logic is implemented correctly, but I am having a very difficult time validating the results. When I generate a .m4v file that contains gamma adjusted color values (osxcolor_test_image_24bit_BT709.m4v), the result come out as expected. But a test case like (bars_709_Frame01.m4v) that I found here does not seem to work as the color bar values seem to be encoded as linear (no gamma adjustment).
For a SMPTE test pattern, the 0.75 graylevel is linear RGB (191 191 191), should this RGB be encoded with no gamma adjustment as (Y Cb Cr) (180 128 128) or should the value in the bitstream appear as the gamma adjusted (Y Cb Cr) (206 128 128)?
(follow up)
After doing additional research into this gamma issue, it has become clear that what Apple is actually doing in AVFoundation is using a 1.961 gamma function. This is the case when encoding with AVAssetWriterInputPixelBufferAdaptor, when using vImage, or with CoreVideo APIs. This piecewise gamma function is defined as follows:
#define APPLE_GAMMA_196 (1.960938f)
static inline
float Apple196_nonLinearNormToLinear(float normV) {
const float xIntercept = 0.05583828f;
if (normV < xIntercept) {
normV *= (1.0f / 16.0f);
} else {
const float gamma = APPLE_GAMMA_196;
normV = pow(normV, gamma);
}
return normV;
}
static inline
float Apple196_linearNormToNonLinear(float normV) {
const float yIntercept = 0.00349f;
if (normV < yIntercept) {
normV *= 16.0f;
} else {
const float gamma = 1.0f / APPLE_GAMMA_196;
normV = pow(normV, gamma);
}
return normV;
}
Your original question: Does H.264 encoded video with BT.709 matrix include any gamma adjustment?
The encoded video only contains gamma adjustment - if you feed the encoder gamma adjusted values.
A H.264 encoder doesn't care about the transfer characteristics.
So if you compress linear and then decompress - you'll get linear.
So if you compress with gamma and then decompress - you'll get gamma.
Or if your bits are encoded with a Rec. 709 transfer function - the encoder won't change the gamma.
But you can specify the transfer characteristic in the H.264 stream as metadata. (Rec. ITU-T H.264 (04/2017) E.1.1 VUI parameters syntax). So the encoded streams carries the color space information around but it is not used in encoding or decoding.
I would assume that 8 bit video always contains a non linear transfer function. Otherwise you would use the 8 bit fairly unwisely.
If you convert to linear to do effects and composition - I'd recommend increasing the bit depth or linearizing into floats.
A color space consists of primaries, transfer function and matrix coefficients.
The gamma adjustment is encoded in the transfer function (and not in the matrix).

Computing 3D coordinates of keypoints in multiple images

I have multiple images of an object taken by the same calibrated camera. Let's say calibrated means both intrinsic and extrinsic parameters (I can put a checkerboard next to the object, so all parameters can be retrieved). On these images I can find matching keypoints using SIFT or SURF, and some matching algorithm, this is basic OpenCV. But how do I do the 3D reconstruction of these points from multiple images? This is not a classic stereo arrangement, so there are more than 2 images with the same object points on them, and I want to use as many as possible for increased accuracy.
Are there any built-in OpenCV functions that do this?
(Note that this is done off-line, the solution does not need to be fast, but robust)
I guess you are looking for so-called Structur from motion approaches. They are using multiple images from different viewpoints and return a 3D reconstruction (e.g. a pointcloud). It looks like OpenCV has a SfM module in the contrib package, but I have no experiences with it.
However, I used to work with bundler. It was quite uncomplicated and returns the entire information (camera calibration and point positions) as text file and you can view the point cloud with Meshlab. Please note that it uses SIFT keypoints and descriptors for correspondence establishment.
I think I have found a solution for this. Structure from motion algorithms deal with the case where the cameras are not calibrated, but in this case all intrinsic and extrinsic parameters are known.
The problem degrades into a linear least squares problem:
We have to compute the coordinates for a single object point:
X = [x, y, z, 1]'
C = [x, y, z]'
X = [[C], [1]]
We are given n images, which have these transformation matrices:
Pi = Ki * [Ri|ti]
These matrices are already known. The object point is projected on the images at
U = [ui, vi]
We can write in homogeneous coordinates (the operator * represents both matrix multiplication, dot product and scalar multiplication):
[ui * wi, vi * wi, wi]' = Pi * X
Pi = [[p11i, p12i, p13i, p14i],
[p21i, p22i, p23i, p24i],
[p31i, p32i, p33i, p34i]]
Let's define the following:
p1i = [p11i, p12i, p13i] (the first row of Pi missing the last element)
p2i = [p21i, p22i, p23i] (the second row of Pi missing the last element)
p3i = [p31i, p32i, p33i] (the third row of Pi missing the last element)
a1i = p14i
a2i = p24i
a3i = p34i
Then we can write:
Q = [x, y, z]
wi = p3i * Q + a3i
ui = (p1i * Q + a1i) / wi =
= (p1i * Q + a1i) / (p3i * Q + a3i)
ui * p3i * Q + ui * a3i - p1i * Q - a1i = 0
(ui * p3i - p1i) * Q = a1i - a3i
Similarly for vi:
(vi * p3i - p2i) * Q = a2i - a3i
And this holds for i = 1..n. We can write this in matrix form:
G * Q = b
G = [[u1 * p31 - p11],
[v1 * p31 - p21],
[u2 * p32 - p12],
[v2 * p32 - p22],
...
[un * p3n - p1n],
[vn * p3n - p2n]]
b = [[a11 - a31 * u1],
[a21 - a31 * v1],
[a12 - a32 * u2],
[a22 - a32 * v2],
...
[a1n - a3n * un],
[a2n - a3n * vn]]
Since G and b are known from the Pi matrices, and the image points [ui, vi], we can compute the pseudoinverse of G (call it G_), and compute:
Q = G_ * b

Implementing gradient descent for multiple variables in Octave using "sum"

I'm doing Andrew Ng's course on Machine Learning and I'm trying to wrap my head around the vectorised implementation of gradient descent for multiple variables which is an optional exercise in the course.
This is the algorithm in question (taken from here):
I just cannot do this in octave using sum though, I'm not sure how to multiply the sum of the hypothesis of x(i) - y(i) by the all variables xj(i). I tried different iterations of the following code but to no avail (either the dimensions are not right or the answer is wrong):
theta = theta - alpha/m * sum(X * theta - y) * X;
The correct answer, however, is entirely non-obvious (to a linear algebra beginner like me anyway, from here):
theta = theta - (alpha/m * (X * theta-y)' * X)';
Is there a rule of thumb for cases where sum is involved that governs transformations like the above?
And if so, is there the opposite version of the above (i.e. going from a sum based solution to a general multiplication one) as I was able to come up with a correct implementation using sum for gradient descent for a single variable (albeit not a very elegant one):
temp0 = theta(1) - (alpha/m * sum(X * theta - y));
temp1 = theta(2) - (alpha/m * sum((X * theta - y)' * X(:, 2)));
theta(1) = temp0;
theta(2) = temp1;
Please note that this only concerns vectorised implementations and although there are several questions on SO as to how this is done, my question is primarily concerned with the implementation of the algorithm in Octave using sum.
The general "rule of the thumb" is as follows, if you encounter something in the form of
SUM_i f(x_i, y_i, ...) g(a_i, b_i, ...)
then you can easily vectorize it (and this is what is done in the above) through
f(x, y, ...)' * g(a, b, ...)
As this is just a typical dot product, which in mathematics (in Euclidean space of finite dimension) looks like
<A, B> = SUM_i A_i B_i = A'B
thus
(X * theta-y)' * X)
is just
<X * theta-y), X> = <H_theta(X) - y, X> = SUM_i (H_theta(X_i) - y_i) X_i
as you can see this works both ways, as this is just a mathematical definition of dot product.
Referring to this part of your question specifically - "I'm not sure how to multiply the sum of the hypothesis of x(i) - y(i) by the all variables xj(i)."
In Octave you can multiply xj(i) to all the predictions using ".", so it can be written as:
m = size(X, 1);
predictions = X * theta;
sqrErrors = (predictions-y).^2;
J = 1 / (2*m) * sum(sqrErrors);
The vector multiplication automatically includes calculating the sum of the products. So you don't have to specify the sum() function. By using the sum() function, you are converting a vector into a scalar and that's bad.
You actually don't want to use summation here, because what you try to calculate are the single values for all thetas, and not the overall cost J. As you do this in one line of code, if you sum it up you end up with a single value (the sum of all thetas).
Summation was correct, though unnecessary, when you computed the values of theta one by one in the previous exercise. This works just the same:
temp0 = theta(1) - (alpha/m * (X * theta - y)' * X(:, 1));
temp1 = theta(2) - (alpha/m * (X * theta - y)' * X(:, 2));
theta(1) = temp0;
theta(2) = temp1;

iOS Accelerate low-pass FFT filter mirroring result

I am trying to port an existing FFT based low-pass filter to iOS using the Accelerate vDSP framework.
It seems like the FFT works as expected for about the first 1/4 of the sample. But then after that the results seem wrong, and even more odd are mirrored (with the last half of the signal mirroring most of the first half).
You can see the results from a test application below. First is plotted the original sampled data, then an example of the expected filtered results (filtering out signal higher than 15Hz), then finally the results of my current FFT code (note that the desired results and example FFT result are at a different scale than the original data):
The actual code for my low-pass filter is as follows:
double *lowpassFilterVector(double *accell, uint32_t sampleCount, double lowPassFreq, double sampleRate )
{
double stride = 1;
int ln = log2f(sampleCount);
int n = 1 << ln;
// So that we get an FFT of the whole data set, we pad out the array to the next highest power of 2.
int fullPadN = n * 2;
double *padAccell = malloc(sizeof(double) * fullPadN);
memset(padAccell, 0, sizeof(double) * fullPadN);
memcpy(padAccell, accell, sizeof(double) * sampleCount);
ln = log2f(fullPadN);
n = 1 << ln;
int nOver2 = n/2;
DSPDoubleSplitComplex A;
A.realp = (double *)malloc(sizeof(double) * nOver2);
A.imagp = (double *)malloc(sizeof(double) * nOver2);
// This can be reused, just including it here for simplicity.
FFTSetupD setupReal = vDSP_create_fftsetupD(ln, FFT_RADIX2);
vDSP_ctozD((DSPDoubleComplex*)padAccell,2,&A,1,nOver2);
// Use the FFT to get frequency counts
vDSP_fft_zripD(setupReal, &A, stride, ln, FFT_FORWARD);
const double factor = 0.5f;
vDSP_vsmulD(A.realp, 1, &factor, A.realp, 1, nOver2);
vDSP_vsmulD(A.imagp, 1, &factor, A.imagp, 1, nOver2);
A.realp[nOver2] = A.imagp[0];
A.imagp[0] = 0.0f;
A.imagp[nOver2] = 0.0f;
// Set frequencies above target to 0.
// This tells us which bin the frequencies over the minimum desired correspond to
NSInteger binLocation = (lowPassFreq * n) / sampleRate;
// We add 2 because bin 0 holds special FFT meta data, so bins really start at "1" - and we want to filter out anything OVER the target frequency
for ( NSInteger i = binLocation+2; i < nOver2; i++ )
{
A.realp[i] = 0;
}
// Clear out all imaginary parts
bzero(A.imagp, (nOver2) * sizeof(double));
//A.imagp[0] = A.realp[nOver2];
// Now shift back all of the values
vDSP_fft_zripD(setupReal, &A, stride, ln, FFT_INVERSE);
double *filteredAccell = (double *)malloc(sizeof(double) * fullPadN);
// Converts complex vector back into 2D array
vDSP_ztocD(&A, stride, (DSPDoubleComplex*)filteredAccell, 2, nOver2);
// Have to scale results to account for Apple's FFT library algorithm, see:
// http://developer.apple.com/library/ios/#documentation/Performance/Conceptual/vDSP_Programming_Guide/UsingFourierTransforms/UsingFourierTransforms.html#//apple_ref/doc/uid/TP40005147-CH202-15952
double scale = (float)1.0f / fullPadN;//(2.0f * (float)n);
vDSP_vsmulD(filteredAccell, 1, &scale, filteredAccell, 1, fullPadN);
// Tracks results of conversion
printf("\nInput & output:\n");
for (int k = 0; k < sampleCount; k++)
{
printf("%3d\t%6.2f\t%6.2f\t%6.2f\n", k, accell[k], padAccell[k], filteredAccell[k]);
}
// Acceleration data will be replaced in-place.
return filteredAccell;
}
In the original code the library was handling non power-of-two sizes of input data; in my Accelerate code I am padding out the input to the nearest power of two. In the case of the sample test below the original sample data is 1000 samples so it's padded to 1024. I don't think that would affect results but I include that for the sake of possible differences.
If you want to experiment with a solution, you can download the sample project that generates the graphs here (in the FFTTest folder):
FFT Example Project code
Thanks for any insight, I've not worked with FFT's before so I feel like I am missing something critical.
If you want a strictly real (not complex) result, then the data before the IFFT must be conjugate symmetric. If you don't want the result to be mirror symmetric, then don't zero the imaginary component before the IFFT. Merely zeroing bins before the IFFT creates a filter with a huge amount of ripple in the passband.
The Accelerate framework also supports more FFT lengths than just powers of 2.

gradient descent seems to fail

I implemented a gradient descent algorithm to minimize a cost function in order to gain a hypothesis for determining whether an image has a good quality. I did that in Octave. The idea is somehow based on the algorithm from the machine learning class by Andrew Ng
Therefore I have 880 values "y" that contains values from 0.5 to ~12. And I have 880 values from 50 to 300 in "X" that should predict the image's quality.
Sadly the algorithm seems to fail, after some iterations the value for theta is so small, that theta0 and theta1 become "NaN". And my linear regression curve has strange values...
here is the code for the gradient descent algorithm:
(theta = zeros(2, 1);, alpha= 0.01, iterations=1500)
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
tmp_j1=0;
for i=1:m,
tmp_j1 = tmp_j1+ ((theta (1,1) + theta (2,1)*X(i,2)) - y(i));
end
tmp_j2=0;
for i=1:m,
tmp_j2 = tmp_j2+ (((theta (1,1) + theta (2,1)*X(i,2)) - y(i)) *X(i,2));
end
tmp1= theta(1,1) - (alpha * ((1/m) * tmp_j1))
tmp2= theta(2,1) - (alpha * ((1/m) * tmp_j2))
theta(1,1)=tmp1
theta(2,1)=tmp2
% ============================================================
% Save the cost J in every iteration
J_history(iter) = computeCost(X, y, theta);
end
end
And here is the computation for the costfunction:
function J = computeCost(X, y, theta) %
m = length(y); % number of training examples
J = 0;
tmp=0;
for i=1:m,
tmp = tmp+ (theta (1,1) + theta (2,1)*X(i,2) - y(i))^2; %differenzberechnung
end
J= (1/(2*m)) * tmp
end
If you are wondering how the seemingly complex looking for loop can be vectorized and cramped into a single one line expression, then please read on. The vectorized form is:
theta = theta - (alpha/m) * (X' * (X * theta - y))
Given below is a detailed explanation for how we arrive at this vectorized expression using gradient descent algorithm:
This is the gradient descent algorithm to fine tune the value of θ:
Assume that the following values of X, y and θ are given:
m = number of training examples
n = number of features + 1
Here
m = 5 (training examples)
n = 4 (features+1)
X = m x n matrix
y = m x 1 vector matrix
θ = n x 1 vector matrix
xi is the ith training example
xj is the jth feature in a given training example
Further,
h(x) = ([X] * [θ]) (m x 1 matrix of predicted values for our training set)
h(x)-y = ([X] * [θ] - [y]) (m x 1 matrix of Errors in our predictions)
whole objective of machine learning is to minimize Errors in predictions. Based on the above corollary, our Errors matrix is m x 1 vector matrix as follows:
To calculate new value of θj, we have to get a summation of all errors (m rows) multiplied by jth feature value of the training set X. That is, take all the values in E, individually multiply them with jth feature of the corresponding training example, and add them all together. This will help us in getting the new (and hopefully better) value of θj. Repeat this process for all j or the number of features. In matrix form, this can be written as:
This can be simplified as:
[E]' x [X] will give us a row vector matrix, since E' is 1 x m matrix and X is m x n matrix. But we are interested in getting a column matrix, hence we transpose the resultant matrix.
More succinctly, it can be written as:
Since (A * B)' = (B' * A'), and A'' = A, we can also write the above as
This is the original expression we started out with:
theta = theta - (alpha/m) * (X' * (X * theta - y))
i vectorized the theta thing...
may could help somebody
theta = theta - (alpha/m * (X * theta-y)' * X)';
I think that your computeCost function is wrong.
I attended NG's class last year and I have the following implementation (vectorized):
m = length(y);
J = 0;
predictions = X * theta;
sqrErrors = (predictions-y).^2;
J = 1/(2*m) * sum(sqrErrors);
The rest of the implementation seems fine to me, although you could also vectorize them.
theta_1 = theta(1) - alpha * (1/m) * sum((X*theta-y).*X(:,1));
theta_2 = theta(2) - alpha * (1/m) * sum((X*theta-y).*X(:,2));
Afterwards you are setting the temporary thetas (here called theta_1 and theta_2) correctly back to the "real" theta.
Generally it is more useful to vectorize instead of loops, it is less annoying to read and to debug.
If you are OK with using a least-squares cost function, then you could try using the normal equation instead of gradient descent. It's much simpler -- only one line -- and computationally faster.
Here is the normal equation:
http://mathworld.wolfram.com/NormalEquation.html
And in octave form:
theta = (pinv(X' * X )) * X' * y
Here is a tutorial that explains how to use the normal equation: http://www.lauradhamilton.com/tutorial-linear-regression-with-octave
While not scalable like a vectorized version, a loop-based computation of a gradient descent should generate the same results. In the example above, the most probably case of the gradient descent failing to compute the correct theta is the value of alpha.
With a verified set of cost and gradient descent functions and a set of data similar with the one described in the question, theta ends up with NaN values just after a few iterations if alpha = 0.01. However, when set as alpha = 0.000001, the gradient descent works as expected, even after 100 iterations.
Using only vectors here is the compact implementation of LR with Gradient Descent in Mathematica:
Theta = {0, 0}
alpha = 0.0001;
iteration = 1500;
Jhist = Table[0, {i, iteration}];
Table[
Theta = Theta -
alpha * Dot[Transpose[X], (Dot[X, Theta] - Y)]/m;
Jhist[[k]] =
Total[ (Dot[X, Theta] - Y[[All]])^2]/(2*m); Theta, {k, iteration}]
Note: Of course one assumes that X is a n * 2 matrix, with X[[,1]] containing only 1s'
This should work:-
theta(1,1) = theta(1,1) - (alpha*(1/m))*((X*theta - y)'* X(:,1) );
theta(2,1) = theta(2,1) - (alpha*(1/m))*((X*theta - y)'* X(:,2) );
its cleaner this way, and vectorized also
predictions = X * theta;
errorsVector = predictions - y;
theta = theta - (alpha/m) * (X' * errorsVector);
If you remember the first Pdf file for Gradient Descent form machine Learning course, you would take care of learning rate. Here is the note from the mentioned pdf.
Implementation Note: If your learning rate is too large, J(theta) can di-
verge and blow up', resulting in values which are too large for computer
calculations. In these situations, Octave/MATLAB will tend to return
NaNs. NaN stands fornot a number' and is often caused by undened
operations that involve - infinity and +infinity.

Resources