Understanding a passage in the Paper about VGGNet - machine-learning

I don't understand a passage in the article about the VGGNet. Maybe someone can help.
In my opinion, the number of weights in a convolutional layer is
p=w*h*d*n+n
where w is the width of the filters, h the height of the filters, d the depth of the filters and n the num of the filters.
In the article the following is written:
assuming that both the input and the output of a three-layer 3 × 3 onvolution stack has C channels, the stack is parametrised by 3*(3^2*C^2) = 27C^2
weights; at the same time, a single 7 × 7 conv. layer would require 7^2*C^2 = 49C^2 parameters.
I do not understand, what is meant by channels here, and why this formula is used.
Can someone explain this to me?
Thanks in advance.

Your intuition is correct; we just need to unpack their explanation a bit. For the first case:
w = 3 # filter width
h = 3 # filter height
d = C # filter depth (number of channels is same as number of input filters; eg RGB is C=3)
n = C # number of output filters/channels
This then makes whdn = 9C^2 parameters. Then, they also say there are three of these stacked, so thats 27C^2.
For a single 7x7 filter, then it's all the same 7x7xCxCx1.
The final difference is that you add n once more at the end in your original post; that is the bias terms, which in VGG they skip (many people skip bias terms; their value is debatable in some settings).

Related

Implementing convolution from scratch in Julia

I am trying to implement convolution by hand in Julia. I'm not too familiar with image processing or Julia, so maybe I'm biting more than I can chew.
Anyway, when I apply this method with a 3*3 edge filter edge = [0 -1 0; -1 4 -1; 0 -1 0] as convolve(img, edge), I am getting an error saying that my values are exceeding the allowed values for the RGBA type.
Code
function convolve(img::Matrix{<:Any}, kernel)
(half_kernel_w, half_kernel_h) = size(kernel) .÷ 2
(width, height) = size(img)
cpy_im = copy(img)
for row ∈ 1+half_kernel_h:height-half_kernel_h
for col ∈ 1+half_kernel_w:width-half_kernel_w
from_row, to_row = row .+ (-half_kernel_h, half_kernel_h)
from_col, to_col = col .+ (-half_kernel_h, half_kernel_h)
cpy_im[row, col] = sum((kernel .* RGB.(img[from_row:to_row, from_col:to_col])))
end
end
cpy_im
end
Error (original)
ArgumentError: element type FixedPointNumbers.N0f8 is an 8-bit type representing 256 values from 0.0 to 1.0, but the values (-0.0039215684f0, -0.007843137f0, -0.007843137f0, 1.0f0) do not lie within this range.
See the READMEs for FixedPointNumbers and ColorTypes for more information.
I am able to identify a simple case where such error may occur (a white pixel surrounded by all black pixels or vice-versa). I tried "fixing" this by attempting to follow the advice here from another stackoverflow question, but I get more errors to the effect of Math on colors is deliberately undefined in ColorTypes, but see the ColorVectorSpace package..
Code attempting to apply solution from the other SO question
function convolve(img::Matrix{<:Any}, kernel)
(half_kernel_w, half_kernel_h) = size(kernel) .÷ 2
(width, height) = size(img)
cpy_im = copy(img)
for row ∈ 1+half_kernel_h:height-half_kernel_h
for col ∈ 1+half_kernel_w:width-half_kernel_w
from_row, to_row = row .+ [-half_kernel_h, half_kernel_h]
from_col, to_col = col .+ [-half_kernel_h, half_kernel_h]
cpy_im[row, col] = sum((kernel .* RGB.(img[from_row:to_row, from_col:to_col] ./ 2 .+ 128)))
end
end
cpy_im
end
Corresponding error
MethodError: no method matching +(::ColorTypes.RGBA{Float32}, ::Int64)
Math on colors is deliberately undefined in ColorTypes, but see the ColorVectorSpace package.
Closest candidates are:
+(::Any, ::Any, !Matched::Any, !Matched::Any...) at operators.jl:591
+(!Matched::T, ::T) where T<:Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8} at int.jl:87
+(!Matched::ChainRulesCore.AbstractThunk, ::Any) at ~/.julia/packages/ChainRulesCore/a4mIA/src/tangent_arithmetic.jl:122
Now, I can try using convert etc., but when I look at the big picture, I start to wonder what the idiomatic way of solving this problem in Julia is. And that is my question. If you had to implement convolution by hand from scratch, what would be a good way to do so?
EDIT:
Here is an implementation that works, though it may not be idiomatic
function convolve(img::Matrix{<:Any}, kernel)
(half_kernel_h, half_kernel_w) = size(kernel) .÷ 2
(height, width) = size(img)
cpy_im = copy(img)
# println(Dict("width" => width, "height" => height, "half_kernel_w" => half_kernel_w, "half_kernel_h" => half_kernel_h, "row range" => 1+half_kernel_h:(height-half_kernel_h), "col range" => 1+half_kernel_w:(width-half_kernel_w)))
for row ∈ 1+half_kernel_h:(height-half_kernel_h)
for col ∈ 1+half_kernel_w:(width-half_kernel_w)
from_row, to_row = row .+ (-half_kernel_h, half_kernel_h)
from_col, to_col = col .+ (-half_kernel_w, half_kernel_w)
vals = Dict()
for method ∈ [red, green, blue, alpha]
x = sum((kernel .* method.(img[from_row:to_row, from_col:to_col])))
if x > 1
x = 1
elseif x < 0
x = 0
end
vals[method] = x
end
cpy_im[row, col] = RGBA(vals[red], vals[green], vals[blue], vals[alpha])
end
end
cpy_im
end
First of all, the error
Math on colors is deliberately undefined in ColorTypes, but see the ColorVectorSpace package.
should direct you to read the docs of the ColorVectorSpace package, where you will learn that using ColorVectorSpace will now enable math on RGB types. (The absence of default support it deliberate, because the way the image-processing community treats RGB is colorimetrically wrong. But everyone has agreed not to care, hence the ColorVectorSpace package.)
Second,
ArgumentError: element type FixedPointNumbers.N0f8 is an 8-bit type representing 256 values from 0.0 to 1.0, but the values (-0.0039215684f0, -0.007843137f0, -0.007843137f0, 1.0f0) do not lie within this range.
indicates that you're trying to write negative entries with an element type, N0f8, that can't support such values. Instead of cpy_im = copy(img), consider something like cpy_im = [float(c) for c in img] which will guarantee a floating-point representation that can support negative values.
Third, I would recommend avoiding steps like RGB.(img...) when nothing about your function otherwise addresses whether images are numeric, grayscale, or color. Fundamentally the only operations you need are scalar multiplication and addition, and it's better to write your algorithm generically leveraging only those two properties.
Tim Holy's answer above is correct - keep things simple and avoid relying on third-party packages when you don't need to.
I might point out that another option you may not have considered is to use a different algorithm. What you are implementing is the naive method, whereas many convolution routines using different algorithms for different sizes, such as im2col and Winograd (you can look these two up, I have a website that covers the idea behind both here).
The im2col routine might be worth doing as essentially you can break the routine in several pieces:
Unroll all 'regions' of the image to do a dot-product with the filter/kernel on, and stack them together into a single matrix.
Do a matrix-multiply with the unrolled input and filter/kernel.
Roll the output back into the correct shape.
It might be more complicated overall, but each part is simpler, so you may find this easier to do. A matrix multiply routine is definitely quite easy to implement. For 1x1 (single-pixel) convolutions where the image and filter have the same ordering (i.e. NCHW images and FCHW filter) the first and last steps are trivial as essentially no rolling/unrolling is necessary.
A final word of advice - start simpler and add in the code to handle edge-cases, convolutions are definitely fiddly to work with.
Hope this helps!

arbitrarily weighted moving average (low- and high-pass filters)

Given input signal x (e.g. a voltage, sampled thousand times per second couple of minutes long), I'd like to calculate e.g.
/ this is not q
y[3] = -3*x[0] - x[1] + x[2] + 3*x[3]
y[4] = -3*x[1] - x[2] + x[3] + 3*x[4]
. . .
I'm aiming for variable window length and weight coefficients. How can I do it in q? I'm aware of mavg and signal processing in q and moving sum qidiom
In the DSP world it's called applying filter kernel by doing convolution. Weight coefficients define the kernel, which makes a high- or low-pass filter. The example above calculates the slope from last four points, placing the straight line via least squares method.
Something like this would work for parameterisable coefficients:
q)x:10+sums -1+1000?2f
q)f:{sum x*til[count x]xprev\:y}
q)f[3 1 -1 -3] x
0n 0n 0n -2.385585 1.423811 2.771659 2.065391 -0.951051 -1.323334 -0.8614857 ..
Specific cases can be made a bit faster (running 0 xprev is not the best thing)
q)g:{prev[deltas x]+3*x-3 xprev x}
q)g[x]~f[3 1 -1 -3]x
1b
q)\t:100000 f[3 1 1 -3] x
4612
q)\t:100000 g x
1791
There's a kx white paper of signal processing in q if this area interests you: https://code.kx.com/q/wp/signal-processing/
This may be a bit old but I thought I'd weigh in. There is a paper I wrote last year on signal processing that may be of some value. Working purely within KDB, dependent on the signal sizes you are using, you will see much better performance with a FFT based convolution between the kernel/window and the signal.
However, I've only written up a simple radix-2 FFT, although in my github repo I do have the untested work for a more flexible Bluestein algorithm which will allow for more variable signal length. https://github.com/callumjbiggs/q-signals/blob/master/signal.q
If you wish to go down the path of performing a full manual convolution by a moving sum, then the best method would be to break it up into blocks equal to the kernel/window size (which was based on some work Arthur W did many years ago)
q)vec:10000?100.0
q)weights:30?1.0
q)wsize:count weights
q)(weights$(((wsize-1)#0.0),vec)til[wsize]+) each til count v
32.5931 75.54583 100.4159 124.0514 105.3138 117.532 179.2236 200.5387 232.168.
If your input list not big then you could use the technique mentioned here:
https://code.kx.com/q/cookbook/programming-idioms/#how-do-i-apply-a-function-to-a-sequence-sliding-window
That uses 'scan' adverb. As that process creates multiple lists which might be inefficient for big lists.
Other solution using scan is:
q)f:{sum y*next\[z;x]} / x-input list, y-weights, z-window size-1
q)f[x;-3 -1 1 3;3]
This function also creates multiple lists so again might not be very efficient for big lists.
Other option is to use indices to fetch target items from the input list and perform the calculation. This will operate only on input list.
q) f:{[l;w;i]sum w*l i+til 4} / w- weight, l- input list, i-current index
q) f[x;-3 -1 1 3]#'til count x
This is a very basic function. You can add more variables to it as per your requirements.

How to apply different cost functions to different output channels of a convolutional network?

I have a convolutional neural network whose output is a 4-channel 2D image. I want to apply sigmoid activation function to the first two channels and then use BCECriterion to computer the loss of the produced images with the ground truth ones. I want to apply squared loss function to the last two channels and finally computer the gradients and do backprop. I would also like to multiply the cost of the squared loss for each of the two last channels by a desired scalar.
So the cost has the following form:
cost = crossEntropyCh[{1, 2}] + l1 * squaredLossCh_3 + l2 * squaredLossCh_4
The way I'm thinking about doing this is as follow:
criterion1 = nn.BCECriterion()
criterion2 = nn.MSECriterion()
error = criterion1:forward(model.output[{{}, {1, 2}}], groundTruth1) + l1 * criterion2:forward(model.output[{{}, {3}}], groundTruth2) + l2 * criterion2:forward(model.output[{{}, {4}}], groundTruth3)
However, I don't think this is the correct way of doing it since I will have to do 3 separate backprop steps, one for each of the cost terms. So I wonder, can anyone give me a better solution to do this in Torch?
SplitTable and ParallelCriterion might be helpful for your problem.
Your current output layer is followed by nn.SplitTable that splits your output channels and converts your output tensor into a table. You can also combine different functions by using ParallelCriterion so that each criterion is applied on the corresponding entry of output table.
For details, I suggest you read documentation of Torch about tables.
After comments, I added the following code segment solving the original question.
M = 100
C = 4
H = 64
W = 64
dataIn = torch.rand(M, C, H, W)
layerOfTables = nn.Sequential()
-- Because SplitTable discards the dimension it is applied on, we insert
-- an additional dimension.
layerOfTables:add(nn.Reshape(M,C,1,H,W))
-- We want to split over the second dimension (i.e. channels).
layerOfTables:add(nn.SplitTable(2, 5))
-- We use ConcatTable in order to create paths accessing to the data for
-- numereous number of criterions. Each branch from the ConcatTable will
-- have access to the data (i.e. the output table).
criterionPath = nn.ConcatTable()
-- Starting from offset 1, NarrowTable will select 2 elements. Since you
-- want to use this portion as a 2 dimensional channel, we need to combine
-- then by using JoinTable. Without JoinTable, the output will be again a
-- table with 2 elements.
criterionPath:add(nn.Sequential():add(nn.NarrowTable(1, 2)):add(nn.JoinTable(2)))
-- SelectTable is simplified version of NarrowTable, and it fetches the desired element.
criterionPath:add(nn.SelectTable(3))
criterionPath:add(nn.SelectTable(4))
layerOfTables:add(criterionPath)
-- Here goes the criterion container. You can use this as if it is a regular
-- criterion function (Please see the examples on documentation page).
criterionContainer = nn.ParallelCriterion()
criterionContainer:add(nn.BCECriterion())
criterionContainer:add(nn.MSECriterion())
criterionContainer:add(nn.MSECriterion())
Since I used almost every possible table operation, it looks a little bit nasty. However, this is the only way I could solve this problem. I hope that it helps you and others suffering from the same problem. This is how the result looks like:
dataOut = layerOfTables:forward(dataIn)
print(dataOut)
{
1 : DoubleTensor - size: 100x2x64x64
2 : DoubleTensor - size: 100x1x64x64
3 : DoubleTensor - size: 100x1x64x64
}

Histogram calculation in julia-lang

refer to julia-lang documentations :
hist(v[, n]) → e, counts
Compute the histogram of v, optionally using approximately n bins. The return values are a range e, which correspond to the edges of the bins, and counts containing the number of elements of v in each bin. Note: Julia does not ignore NaN values in the computation.
I choose a sample range of data
testdata=0:1:10;
then use hist function to calculate histogram for 1 to 5 bins
hist(testdata,1) # => (-10.0:10.0:10.0,[1,10])
hist(testdata,2) # => (-5.0:5.0:10.0,[1,5,5])
hist(testdata,3) # => (-5.0:5.0:10.0,[1,5,5])
hist(testdata,4) # => (-5.0:5.0:10.0,[1,5,5])
hist(testdata,5) # => (-2.0:2.0:10.0,[1,2,2,2,2,2])
as you see when I want 1 bin it calculates 2 bins, and when I want 2 bins it calculates 3.
why does this happen?
As the person who wrote the underlying function: the aim is to get bin widths that are "nice" in terms of a base-10 counting system (i.e. 10k, 2×10k, 5×10k). If you want more control you can also specify the exact bin edges.
The key word in the doc is approximate. You can check what hist is actually doing for yourself in Julia's base module here.
When you do hist(test,3), you're actually calling
hist(v::AbstractVector, n::Integer) = hist(v,histrange(v,n))
That is, in a first step the n argument is converted into a FloatRange by the histrange function, the code of which can be found here. As you can see, the calculation of these steps is not entirely straightforward, so you should play around with this function a bit to figure out how it is constructing the range that forms the basis of the histogram.

Normalize a feature in this table

This has become quite a frustrating question, but I've asked in the Coursera discussions and they won't help. Below is the question:
I've gotten it wrong 6 times now. How do I normalize the feature? Hints are all I'm asking for.
I'm assuming x_2^(2) is the value 5184, unless I am adding the x_0 column of 1's, which they don't mention but he certainly mentions in the lectures when talking about creating the design matrix X. In which case x_2^(2) would be the value 72. Assuming one or the other is right (I'm playing a guessing game), what should I use to normalize it? He talks about 3 different ways to normalize in the lectures: one using the maximum value, another with the range/difference between max and mins, and another the standard deviation -- they want an answer correct to the hundredths. Which one am I to use? This is so confusing.
...use both feature scaling (dividing by the
"max-min", or range, of a feature) and mean normalization.
So for any individual feature f:
f_norm = (f - f_mean) / (f_max - f_min)
e.g. for x2,(midterm exam)^2 = {7921, 5184, 8836, 4761}
> x2 <- c(7921, 5184, 8836, 4761)
> mean(x2)
6676
> max(x2) - min(x2)
4075
> (x2 - mean(x2)) / (max(x2) - min(x2))
0.306 -0.366 0.530 -0.470
Hence norm(5184) = 0.366
(using R language, which is great at vectorizing expressions like this)
I agree it's confusing they used the notation x2 (2) to mean x2 (norm) or x2'
EDIT: in practice everyone calls the builtin scale(...) function, which does the same thing.
It's asking to normalize the second feature under second column using both feature scaling and mean normalization. Therefore,
(5184 - 6675.5) / 4075 = -0.366
Usually we normalize all of them to have zero mean and go between [-1, 1].
You can do that easily by dividing by the maximum of the absolute value and then remove the mean of the samples.
"I'm assuming x_2^(2) is the value 5184" is this because it's the second item in the list and using the subscript _2? x_2 is just a variable identity in maths, it applies to all rows in the list. Note that the highest raw mid-term exam result (i.e. that which is not squared) goes down on the final test and the lowest raw mid-term result increases the most for the final exam result. Theta is a fixed value, a coefficient, so somewhere your normalisation of x_1 and x_2 values must become (EDIT: not negative, less than 1) in order to allow for this behaviour. That should hopefully give you a starting basis, by identifying where the pivot point is.
I had the same problem, in my case the thing was that I was using as average the maximum x2 value (8836) minus minimum x2 value (4761) divided by two, instead of the sum of each x2 value divided by the number of examples.
For the same training set, I got the question as
Q. What is the normalized feature x^(3)_1?
Thus, 3rd training ex and 1st feature makes out to 94 in above table.
Now, normalized form is
x = (x - mean(x's)) / range(x)
Values are :
x = 94
mean(89+72+94+69) / 4 = 81
range = 94 - 69 = 25
Normalized x = (94 - 81) / 25 = 0.52
I'm taking this course at the moment and a really trivial mistake I made first time I answered this question was using comma instead of dot in the answer, since I did by hand and in my country we use comma to denote decimals. Ex:(0,52 instead of 0.52)
So in the second time I tried I used dot and works fine.

Resources