Suppose i have an RGB image with dimension 125 * 125 and i use 10 filters with 5 * 5 dimension and stride =3 so what is feature map for this layer ? and what is the total number of parameters
my guess for feature map: 10*((125-5)/3)+1 = (41 * 41 * 10)(no of filters) but what is the difference between RGB image or Greyscale image so for RGB image it should be 41 * 41 * 30 ( no of filters * no of channels of input image)?
and for total number of parameters:5 * 5 * 3 * 10=750 ?
No of feature maps = no of filters and the size of each feature map would be 41 * 41(you correctly calculated it, if the padding is zero). So, in the case above you would have 10 feature maps of size 41 * 41 independent of rgb or greyscale, if you have 10 filters.
For rgb vs greyscale, think about channels as feature maps for input layer and a filter gets applied on all feature map at once. If you have an input rgb image with 5 * 5 filter size, it is actually a filter of size 5 * 5 * 3(no. of channels). So, independent of rgb or greyscale if you apply 10 filters of 5*5 on an image of size 125 * 125 with stride 3, you will always get out 10 feature maps of 41 * 41.
Whenever you define a filter, it is defined as x * y but the z is always equal to 'no of channels on which you have applied the conv' for 2d convolutions. That way total no of parameters for each filter would be x * y * n_channels + 1 (additional 1 is for the constant)
In your case of 10 filters, for rgb image, each filter is of size 5 * 5 * 3. So, total parameters for rgb= 10 * (5 * 5 * 3 + 1). And for greyscale each filter is 5 * 5 * 1. So, total parameters for greyscale =10 * (5 * 5 * 1 + 1)
Note: I have a well-recieved post on the intuition on how CNNs work on Data Science Stack Exchange. Do check it out as well.
You are right about the shape of the feature maps, with the given stride.
The total number of parameters in CNN is given by -
Params = (n*m*l+1)*k
where, l = number of input feature maps / image channels,
k = number of output feature maps / # of filters,
n*m = filter dimensions
Important! - The +1 in the formula is because each of the output feature
maps have a bias term added, which is also a trainable parameter. So, don't
forget to add that!
So for your case is, number of trainable params: ((5*5*3+1)*10) = 760
And, for a greyscale it is: ((5*5*1+1)*10) = 260
A very good visualization that I find intuitive to show how filters work to create feature maps is this.
The input feature maps are the same as the number of channels of the input image if it's the first CNN layer. Since CNNs are usually stacked, the output channels of previous CNNs are called input feature maps to the current CNN layer.
Related
I have an image, a 2D array of uint8_ts. I want to resize the image using a separable filter. Consider shrinking the width first. Because original & target sizes are unrelated, we'll use a different set of coefficients for every destination pixel. For a particular in & out size, for all y, we might have:
out(500, y) = in(673, y) * 12 + in(674, y) * 63 + in(675, y) * 25
out(501, y) = in(674, y) * 27 + in(675, y) * 58 + in(676, y) * 15
How can I use Eigen to speed this up, e.g. vectorize it for me?
This can be expressed as a matrix multiply with a sparse matrix, of dimension in_width * out_width, where in each row, only 3 out of the in_width values are non-zero. In the actual use case, 4 to 8 will typically be non-zero. But those non-zero values will be contiguous, and it would be nice to use SSE or whatever to make it fast.
Note that the source matrix has 8 bit scalars. The final result, after scaling both width & height will be 8 bits as well. It might be nice for the intermediate matrix (image) and filter to be higher precision, say 16 bits. But even if they're 8 bits, when multiplying, we'll need to take the most significant bits of the product, not the least significant.
Is this too far out of with Eigen can do? This sort of convolution, with a kernel that's different at every pixel (only because the output size isn't an integral multiple of the input size), seems common enough.
The dimension of the image is 64 x 128. That is 8192 magnitude and gradient values. After the binning stage, we are left with 1152 values as we converted 64 pixels into 9 bins based on their orientation. Can you please explain to me how after L2 normalization we get 3780 vectors?
Assumption: You have the gradients of the 64 x 128 patch.
Calculate Histogram of Gradients in 8x8 cells
This is where it starts to get interesting. The image is divided into 8x8 cells and a HOG is calculated for each 8x8 cells. One of the reasons why we use 8x8 cells is that it provides a compact representation. An 8x8 image patch contains 8x8x3 = 192 pixel values (color image). The gradient of this patch contains 2 values (magnitude and direction) per pixel which adds up to 8x8x2 = 128 values. These 128 numbers are represented using a 9-bin histogram which can be stored as an array of 9 numbers. This makes it more compact and calculating histograms over a patch makes this representation more robust to noise.
The histogram is essentially a vector of 9 bins corresponding to angles 0, 20, 40, 60 ... 180 corresponding to unsigned gradients.
16 x 16 Block Normalization
After creating the histogram based on the gradient of the image, we want our descriptor to be independent of lighting variations. Hence, we normalize the histogram. The vector norm for a RGB color [128, 64, 32] is sqrt(128*128 + 64*64 + 32*32) = 146.64, which is the infamous L2-norm. Dividing each element of this vector by 146.64 gives us a normalized vector [0.87, 0.43, 0.22]. If we were to multiply each element of this vector by 2, the normalized vector will remain the same as before.
Although simply normalizing the 9x1 histogram is intriguing, normalizing a bigger sized block of 16 x 16 is better. A 16 x 16 block has 4 histograms, which can be concatenated to form a 36 x 1 element vector and it can be normalized the same way as the 3 x 1 vector in the example. The window is then moved by 8 pixels and a normalized 36 x 1 vector is calculated over this window and the process is repeated (see the animation: Courtesy)
Calculate the HOG feature vector
This is where your question comes in.
To calculate the final feature vector for the entire image patch, the 36 x 1 vectors are concatenated into on giant vector. Let us calculate the size:
How many positions of the 16 x 16 blocks do we have? There are 7 horizontal and 15 vertical positions, which gives - 105 positions.
Each 16 x 16 block is represented by a 36 x 1 vector. So when we concatenate them all into one giant vector we obtain a 36 x 105 = 3780 dimensional vector.
For more details, look at the tutorial where I learned.
Hope it helps!
I don't understand part of this (quora: How does the last layer of a ConvNet connects to the first fully connected layer):
Make an one hot representation of feature maps. So we would have 64 *
7 * 7 = 3136 input features which is again processed by a 3136 neurons
reducing it to 1024 features. The matrix multiplication this layer
would be (1x3136) * (3136x1024) => 1x1024
I mean, what is the process to reduce 3136 inputs using 3136 neurons to 1024 features?
I would explain it using layman's terms how I understand it.
One hot representation of feature maps is a way for categorical values to be represented by a matrix using 1 and 0. This is a way for machines to read/process the data (in your example, an image or a picture). Then ig makes computations using matrix algebra.
Now the part of the computation is multiplication of 1 row and 3136 columns of binary values (1 or 0) and another matrix of size 3136 rows and 1024 columns. When you multiple these two matrices, the resulting matrix is 1 row and 1024 columns. This is now the matrix of 1's and 0's that represents your image or picture.
Hope I got your question right.
You need to understand matrix multiplication. (1x3136) * (3136x1024) is an example of matrix multiplication that first multiplier's((1x3136)) column number must be equal to second multiplier's (3136x1024) row number. This results in (1x1024) because first multiplier's row becomes result's row, while second multiplier's column becomes result's column.
Also, check this :
https://www.khanacademy.org/math/precalculus/precalc-matrices/multiplying-matrices-by-matrices/v/multiplying-a-matrix-by-a-matrix
We have proposed and simulated an LSB insertion method that does 10% less bits changes then the regular LSB. However to our surprise the PSNR value of the proposed method is very close to the regular LSB ( around ~1% increase )
It is really confusing as our proposed method does less LSB changes but the PSNR value is still the same. Would appreciate any help.
We are using Matlab and testing on RGB image ( embeding on LSB of all three RGB channels )
I think your results are not unexpected and that you're probably overestimating the effect because PSNR is a logarithmic function. I will plug in some numbers to show you that effect, but I will make some assumptions about details you haven't mentioned.
I assume that you embed only 1 bit per pixel and that is in the least significant bit and not in the 2nd, 3rd or any other.
What this means is that if a bit flips, the square error between the original and the modified pixel is 1. If we assume a stream of random bits, we expect to change half of the pixels where we embed our information. All of this can be encapsulated in the following two functions.
mse = #(bits, x, y) (0.5 * bits) / (x * y)
psnr = #(signal, noise) = 10 * log10(signal^2 / noise)
Where bits is the total number of embedding bits, x and y the size of the image, signal is the maximum intensity value in the original image (I used 255) and noise is the mse.
Let's assume a 512x512 image. For simplicity, let's also calculate the PSNR for only one colour component.
5,000 bits: 68.3368 db
4,500 bits: 68.7944 db
PSNR increase: 0.6696%
100,000 bits: 55.3265 db
90,000 bits: 55.7841 db
PSNR increase: 0.8271%
In fact, you can derive a general symbolic form.
mse_orig = (bits / 2) / (x * y) = bits / (2 * x * y)
mse_steg = (0.9 * bits / 2) / (x * y) = 0.9 * mse_orig
psnr = 10 * log10(signal^2 / mse)
percent_change = 100 * (psnr_steg - psnr_orig) / psnr_orig
By using the property log(a / b) = log(a) - log(b), the above can be simplified to:
percent_change = 100 * (-log10(0.9) / log10(2* x * y * signal^2 / bits))
Where bits is the number of bits you embed with regular lsb embedding (i.e., 5,000 and NOT 4,500).
Suppose that a given 3-bit image(L=8) of size 64*64 pixels (M*N=4096) has the intensity distribution shown as below. How to obtain histogram equalization transformation function
and then compute the equalized histogram of the image?
Rk nk
0 800
1 520
2 970
3 660
4 330
5 450
6 260
7 106
"Histogram Equalization is the process of obtaining transformation function automatically. So you need not have to worry about shape and nature of transformation function"
So in Histogram equalization, transformation function is calculated using cumulative frequency approach and this process is automatic. From the histogram of the image, we determine the cumulative histogram, c, rescaling the values as we go so that they occupy an 8-bit range. In this way, c becomes a look-up table that can be subsequently applied to the image in order to carry out equalization.
rk nk c sk = c/MN (L-1)sk rounded value
0 800 800 0.195 1.365 1
1 520 1320 0.322 2.254 2
2 970 2290 0.559 3.913 4
3 660 2950 0.720 5.04 5
4 330 3280 0.801 5.601 6
5 450 3730 0.911 6.377 6
6 260 3990 0.974 6.818 7
7 106 4096 1.000 7.0 7
Now the equalized histogram is therefore
rk nk
0 0
1 800
2 520
3 0
4 970
5 660
6 330 + 450 = 780
7 260 + 106 = 366
The algorithm for equalization can be given as
Compute a scaling factor, α= 255 / number of pixels
Calculate histogram of the image
Create a look up table c with
c[0] = α * histogram[0]
for all remaining grey levels, i, do
c[i] = c[i-1] + α * histogram[i]
end for
for all pixel coordinates, x and y, do
g(x, y) = c[f(x, y)]
end for
But there is a problem with histogram equalization and that is mainly because it is a completely automatic technique, with no parameters to set. At times, it can improve our ability to interpret an image dramatically. However, it is difficult to predict how beneficial equalization will be for any given image; in fact, it may not be of any use at all. This is because the improvement in contrast is optimal statistically, rather than perceptually. In images with narrow histograms and relatively few grey levels, a massive increase in contrast due to histogram equalisation can have the adverse effect of reducing perceived image quality. In particular, sampling or quantisation artefacts and image noise may become more prominent.
The alternative to obtaining the transformation (mapping) function automatically is Histogram Specification. In histogram specification instead of requiring a flat histogram, we specify a particular shape explicitly. We might wish to do this in cases where it is desirable for a set of related images to have the same histogram- in order, perhaps, that a particular operation produces the same results for all images.
Histogram specification can be visualised as a two-stage process. First, we transform the input image by equalisation into a temporary image with a flat histogram. Then we transform this equalised, temporary image into an output image possessing the desired histogram. The mapping function for the second stage is easily obtained. Since a rescaled version of the cumulative histogram can be used to transform a histogram with any shape into a flat histogram, it follows that the inverse of the cumulative histogram will perform the inverse transformation from a fiat histogram to one with a specified shape.
For more details about histogram equalization and mapping functions with C and C++ code
https://programming-technique.blogspot.com/2013/01/histogram-equalization-using-c-image.html