I am trying to implement a convolutional neural network from scratch and I am not able to figure out how to perform (vectorized)operations on multi-channel images like rgb, which have 3 dimensions. On following the articles and tutorials such as this CS231n tutorial , it's pretty clear to implement a network for a single input as the input layer will be a 3d matrix but there are always multiple data points in a dataset. so, I cannot figure out how to implement these networks for vectorized operation on entire datsets.
I have implemented a network which takes a 3d matrix as input but now I have realized that It will not work on entire dataset but I will have to propagate one input at a time.I don't really know whether conv nets are vectorized over entire dataset or not .But if they are, how can I vectorize my convolutional network for multi-channel images ?
If I got your question right, you're basically asking how to do convolutional layer for a mini-batch, which will be a 4-D tensor.
To put it simply, you want to treat each input in a batch independently and apply convolution to each one. It's fairly straightforward to code without vectorization using a loop.
A vectorization implementation is often based on im2col technique, which basically transforms the 4-D input tensor into a giant matrix and performs a matrix multiplication. Here's an implementation of a forward pass using numpy.lib.stride_tricks in python:
import numpy as np
def conv_forward(x, w, b, stride, pad):
N, C, H, W = x.shape
F, _, HH, WW = w.shape
# Check dimensions
assert (W + 2 * pad - WW) % stride == 0, 'width does not work'
assert (H + 2 * pad - HH) % stride == 0, 'height does not work'
# Pad the input
p = pad
x_padded = np.pad(x, ((0, 0), (0, 0), (p, p), (p, p)), mode='constant')
# Figure out output dimensions
H += 2 * pad
W += 2 * pad
out_h = (H - HH) / stride + 1
out_w = (W - WW) / stride + 1
# Perform an im2col operation by picking clever strides
shape = (C, HH, WW, N, out_h, out_w)
strides = (H * W, W, 1, C * H * W, stride * W, stride)
strides = x.itemsize * np.array(strides)
x_stride = np.lib.stride_tricks.as_strided(x_padded,
shape=shape, strides=strides)
x_cols = np.ascontiguousarray(x_stride)
x_cols.shape = (C * HH * WW, N * out_h * out_w)
# Now all our convolutions are a big matrix multiply
res = w.reshape(F, -1).dot(x_cols) + b.reshape(-1, 1)
# Reshape the output
res.shape = (F, N, out_h, out_w)
out = res.transpose(1, 0, 2, 3)
out = np.ascontiguousarray(out)
return out
Note that it uses some non-trivial features of linear algebra library, which are implemented in numpy, but may be not in your library.
BTW, you generally don't want to push the entire data set as one batch - split it into several batches.
Related
In layer normalization, we compute mean and variance across the input layer (instead of across batch which is what we do in batch normalization). And then normalize the input layer according to mean and variance, and then return gamma times normalized layer plus beta.
My question is, are the gamma and beta scalars with shape (1, 1) and (1, 1) respectively or their shapes are (1, number of hidden units) and (1, number of hidden units) respectively.
Here is how I have implemented the layer normalization, is this correct!
def layernorm(layer, gamma, beta):
mean = np.mean(layer, axis = 1, keepdims = True)
variance = np.mean((layer - mean) ** 2, axis=1, keepdims = True)
layer_hat = (layer - mean) * 1.0 / np.sqrt(variance + 1e-8)
outpus = gamma * layer_hat + beta
return outpus
where gamma and beta are defined as below:
gamma = np.random.normal(1, 128)
beta = np.random.normal(1, 128)
According to the Tensorflow's implementation, assume the input has shape [B, rest], gamma and beta are of shape rest. rest could be (h, ) for a 2-dimensional input or (h, w, c) for a 4-dimensional input.
I would like to generate a polynomial 'fit' to the cluster of colored pixels in the image here
(The point being that I would like to measure how much that cluster approximates an horizontal line).
I thought of using grabit or something similar and then treating this as a cloud of points in a graph. But is there a quicker function to do so directly on the image file?
thanks!
Here is a Python implementation. Basically we find all (xi, yi) coordinates of the colored regions, then set up a regularized least squares system where the we want to find the vector of weights, (w0, ..., wd) such that yi = w0 + w1 xi + w2 xi^2 + ... + wd xi^d "as close as possible" in the least squares sense.
import numpy as np
import matplotlib.pyplot as plt
def rgb2gray(rgb):
return np.dot(rgb[...,:3], [0.299, 0.587, 0.114])
def feature(x, order=3):
"""Generate polynomial feature of the form
[1, x, x^2, ..., x^order] where x is the column of x-coordinates
and 1 is the column of ones for the intercept.
"""
x = x.reshape(-1, 1)
return np.power(x, np.arange(order+1).reshape(1, -1))
I_orig = plt.imread("2Md7v.jpg")
# Convert to grayscale
I = rgb2gray(I_orig)
# Mask out region
mask = I > 20
# Get coordinates of pixels corresponding to marked region
X = np.argwhere(mask)
# Use the value as weights later
weights = I[mask] / float(I.max())
# Convert to diagonal matrix
W = np.diag(weights)
# Column indices
x = X[:, 1].reshape(-1, 1)
# Row indices to predict. Note origin is at top left corner
y = X[:, 0]
We want to find vector w that minimizes || Aw - y ||^2
so that we can use it to predict y = w . x
Here are 2 versions. One is a vanilla least squares with l2 regularization and the other is weighted least squares with l2 regularization.
# Ridge regression, i.e., least squares with l2 regularization.
# Should probably use a more numerically stable implementation,
# e.g., that in Scikit-Learn
# alpha is regularization parameter. Larger alpha => less flexible curve
alpha = 0.01
# Construct data matrix, A
order = 3
A = feature(x, order)
# w = inv (A^T A + alpha * I) A^T y
w_unweighted = np.linalg.pinv( A.T.dot(A) + alpha * np.eye(A.shape[1])).dot(A.T).dot(y)
# w = inv (A^T W A + alpha * I) A^T W y
w_weighted = np.linalg.pinv( A.T.dot(W).dot(A) + alpha * \
np.eye(A.shape[1])).dot(A.T).dot(W).dot(y)
The result
# Generate test points
n_samples = 50
x_test = np.linspace(0, I_orig.shape[1], n_samples)
X_test = feature(x_test, order)
# Predict y coordinates at test points
y_test_unweighted = X_test.dot(w_unweighted)
y_test_weighted = X_test.dot(w_weighted)
# Display
fig, ax = plt.subplots(1, 1, figsize=(10, 5))
ax.imshow(I_orig)
ax.plot(x_test, y_test_unweighted, color="green", marker='o', label="Unweighted")
ax.plot(x_test, y_test_weighted, color="blue", marker='x', label="Weighted")
fig.legend()
fig.savefig("curve.png")
For simple straight line fit, set the argument order of feature to 1. You can then use the gradient of the line to get a sense of how close it is to a horizontal line (e.g., by checking the angle of its slope).
It is also possible to set this to any degree of polynomial you want. I find that degree 3 looks pretty good. In this case, the 6 times the absolute value of the coefficient corresponding to x^3 (w_unweighted[3] or w_weighted[3]) is one measure of the curvature of the line.
See A measure for the curvature of a quadratic polynomial in Matlab for additional details.
The following piece of python code works well for finding gradient descent:
def gradientDescent(x, y, theta, alpha, m, numIterations):
xTrans = x.transpose()
for i in range(0, numIterations):
hypothesis = np.dot(x, theta)
loss = hypothesis - y
cost = np.sum(loss ** 2) / (2 * m)
print("Iteration %d | Cost: %f" % (i, cost))
gradient = np.dot(xTrans, loss) / m
theta = theta - alpha * gradient
return theta
Here, x = m*n (m = no. of sample data and n = total features) feature matrix.
However, if my features are non-numerical (say, director and genre) of '2' movies then my feature matrix may look like:
['Peter Jackson', 'Action'
Sergio Leone', 'Comedy']
In such a case, how can I map these features to numerical values and apply gradient descent ?
You can map your features to numerical value of your choice and then apply gradient descent the usual way.
In python you can use panda to do that easily:
import pandas as pd
df = pd.DataFrame(X, ['director', 'genre'])
df.director = df.director.map({'Peter Jackson': 0, 'Sergio Leone': 1})
df.genre = df.genre.map({'Action': 0, 'Comedy': 1})
As you can see, this way can become pretty complicated and it might be better to write a piece of code doing that dynamically.
I is an mxn matrix and each element of I is a 1x3 vector (I is a 3-channel Mat image actually).
M is a 3x3 matrix.
J is an matrix having the same dimension as I and is computed as follows: each element of J is the vector-matrix product of the corresponding (i.e. having the same coordinates) element of I and M.
I.e. if v1(r1,g1,b1) is an element of I and v2(r2,g2,b2) is its corresponding element of J, then v2 = v1 * M (this is a vector-matrix product, not a per-element product).
Question: How to compute J efficiently (in terms of speed)?
Thank you for your help.
As far as I know, the most efficient way to implement such an operation is as follows:
Reshape I from mxnx3 to (m·n)x3, let's call it I'
Calculate J' = I' * M
Reshape J' from (m·n)x3 to mxnx3, this is the J we wanted
The idea is to stack each pixel-wise operation pi'·M into one single operation P'·M, where P is the 3x(m·n) matrix containing each pixel in columns (hence P' holds one pixel per row. It's just a convention, really).
Here is a code sample written in c++:
// read some image
cv::Mat I = cv::imread("image.png"); // rows x cols x 3
// some matrix M, that modifies each pixel
cv::Mat M = (cv::Mat_<float>(3, 3) << 0, 0, 0,
0, .5, 0,
0, 0, .5); // 3 x 3
// remember old dimension
uint8_t prevChannels = I.channels;
uint32_t prevRows = I.rows;
// reshape I
uint32_t newRows = I.rows * I.cols;
I = I.reshape(1, newRows); // (rows * cols) x 3
// compute J
cv::Mat J = I * M; // (rows * cols) x 3
// reshape to original dimensions
J = J.reshape(prevChannels, prevRows); // rows x cols x 3
OpenCV provides an O(1) reshaping operation.
Thus performance depends solely on matrix multiplication, which I expect to be as efficient as possible in a computer vision library.
To further enhance performance, you might want to take a look at matrix multiplication using the ocl and gpu modules.
I implemented a gradient descent algorithm to minimize a cost function in order to gain a hypothesis for determining whether an image has a good quality. I did that in Octave. The idea is somehow based on the algorithm from the machine learning class by Andrew Ng
Therefore I have 880 values "y" that contains values from 0.5 to ~12. And I have 880 values from 50 to 300 in "X" that should predict the image's quality.
Sadly the algorithm seems to fail, after some iterations the value for theta is so small, that theta0 and theta1 become "NaN". And my linear regression curve has strange values...
here is the code for the gradient descent algorithm:
(theta = zeros(2, 1);, alpha= 0.01, iterations=1500)
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
tmp_j1=0;
for i=1:m,
tmp_j1 = tmp_j1+ ((theta (1,1) + theta (2,1)*X(i,2)) - y(i));
end
tmp_j2=0;
for i=1:m,
tmp_j2 = tmp_j2+ (((theta (1,1) + theta (2,1)*X(i,2)) - y(i)) *X(i,2));
end
tmp1= theta(1,1) - (alpha * ((1/m) * tmp_j1))
tmp2= theta(2,1) - (alpha * ((1/m) * tmp_j2))
theta(1,1)=tmp1
theta(2,1)=tmp2
% ============================================================
% Save the cost J in every iteration
J_history(iter) = computeCost(X, y, theta);
end
end
And here is the computation for the costfunction:
function J = computeCost(X, y, theta) %
m = length(y); % number of training examples
J = 0;
tmp=0;
for i=1:m,
tmp = tmp+ (theta (1,1) + theta (2,1)*X(i,2) - y(i))^2; %differenzberechnung
end
J= (1/(2*m)) * tmp
end
If you are wondering how the seemingly complex looking for loop can be vectorized and cramped into a single one line expression, then please read on. The vectorized form is:
theta = theta - (alpha/m) * (X' * (X * theta - y))
Given below is a detailed explanation for how we arrive at this vectorized expression using gradient descent algorithm:
This is the gradient descent algorithm to fine tune the value of θ:
Assume that the following values of X, y and θ are given:
m = number of training examples
n = number of features + 1
Here
m = 5 (training examples)
n = 4 (features+1)
X = m x n matrix
y = m x 1 vector matrix
θ = n x 1 vector matrix
xi is the ith training example
xj is the jth feature in a given training example
Further,
h(x) = ([X] * [θ]) (m x 1 matrix of predicted values for our training set)
h(x)-y = ([X] * [θ] - [y]) (m x 1 matrix of Errors in our predictions)
whole objective of machine learning is to minimize Errors in predictions. Based on the above corollary, our Errors matrix is m x 1 vector matrix as follows:
To calculate new value of θj, we have to get a summation of all errors (m rows) multiplied by jth feature value of the training set X. That is, take all the values in E, individually multiply them with jth feature of the corresponding training example, and add them all together. This will help us in getting the new (and hopefully better) value of θj. Repeat this process for all j or the number of features. In matrix form, this can be written as:
This can be simplified as:
[E]' x [X] will give us a row vector matrix, since E' is 1 x m matrix and X is m x n matrix. But we are interested in getting a column matrix, hence we transpose the resultant matrix.
More succinctly, it can be written as:
Since (A * B)' = (B' * A'), and A'' = A, we can also write the above as
This is the original expression we started out with:
theta = theta - (alpha/m) * (X' * (X * theta - y))
i vectorized the theta thing...
may could help somebody
theta = theta - (alpha/m * (X * theta-y)' * X)';
I think that your computeCost function is wrong.
I attended NG's class last year and I have the following implementation (vectorized):
m = length(y);
J = 0;
predictions = X * theta;
sqrErrors = (predictions-y).^2;
J = 1/(2*m) * sum(sqrErrors);
The rest of the implementation seems fine to me, although you could also vectorize them.
theta_1 = theta(1) - alpha * (1/m) * sum((X*theta-y).*X(:,1));
theta_2 = theta(2) - alpha * (1/m) * sum((X*theta-y).*X(:,2));
Afterwards you are setting the temporary thetas (here called theta_1 and theta_2) correctly back to the "real" theta.
Generally it is more useful to vectorize instead of loops, it is less annoying to read and to debug.
If you are OK with using a least-squares cost function, then you could try using the normal equation instead of gradient descent. It's much simpler -- only one line -- and computationally faster.
Here is the normal equation:
http://mathworld.wolfram.com/NormalEquation.html
And in octave form:
theta = (pinv(X' * X )) * X' * y
Here is a tutorial that explains how to use the normal equation: http://www.lauradhamilton.com/tutorial-linear-regression-with-octave
While not scalable like a vectorized version, a loop-based computation of a gradient descent should generate the same results. In the example above, the most probably case of the gradient descent failing to compute the correct theta is the value of alpha.
With a verified set of cost and gradient descent functions and a set of data similar with the one described in the question, theta ends up with NaN values just after a few iterations if alpha = 0.01. However, when set as alpha = 0.000001, the gradient descent works as expected, even after 100 iterations.
Using only vectors here is the compact implementation of LR with Gradient Descent in Mathematica:
Theta = {0, 0}
alpha = 0.0001;
iteration = 1500;
Jhist = Table[0, {i, iteration}];
Table[
Theta = Theta -
alpha * Dot[Transpose[X], (Dot[X, Theta] - Y)]/m;
Jhist[[k]] =
Total[ (Dot[X, Theta] - Y[[All]])^2]/(2*m); Theta, {k, iteration}]
Note: Of course one assumes that X is a n * 2 matrix, with X[[,1]] containing only 1s'
This should work:-
theta(1,1) = theta(1,1) - (alpha*(1/m))*((X*theta - y)'* X(:,1) );
theta(2,1) = theta(2,1) - (alpha*(1/m))*((X*theta - y)'* X(:,2) );
its cleaner this way, and vectorized also
predictions = X * theta;
errorsVector = predictions - y;
theta = theta - (alpha/m) * (X' * errorsVector);
If you remember the first Pdf file for Gradient Descent form machine Learning course, you would take care of learning rate. Here is the note from the mentioned pdf.
Implementation Note: If your learning rate is too large, J(theta) can di-
verge and blow up', resulting in values which are too large for computer
calculations. In these situations, Octave/MATLAB will tend to return
NaNs. NaN stands fornot a number' and is often caused by undened
operations that involve - infinity and +infinity.