What is being normalized by Keras/TensorFlow BatchNormalization

What is being normalized by Keras/TensorFlow BatchNormalization - machine-learning

My question is what is being normalized by BatchNormalization (BN).
I am asking, does BN normalize the channels for each pixel separately or for all the pixels together. And does it do it on a per image basis or on all the channels of the entire batch.
Specifically, BN is operating on X. Say, X.shape = [m,h,w,c]. So with axis=3, it is operating on the "c" dimension which is the number of channels (for rgb) or the number of feature maps.
So lets say the X is an rgb and thus has 3 channels. Does the BN do the following: (this is a simplified version of the BN to discuss the dimensional aspects. I understand that gamma and beta are learned but not concerned with that here.)
For each image=X in m:
For each pixel (h,w) take the mean of the associated r, g, & b values.
For each pixel (h,w) take the variance of the associated r, g, & b values
Do r = (r-mean)/var, g = (g-mean)/var, & b = (b-mean)/var, where r, g, & b are the red, green, & blue channels of X respectively.
Then repeat this process for the next image in m,
In keras, the docs for BatchNormalization says:
axis: Integer, the axis that should be normalized (typically the features axis).
For instance, after a Conv2D layer with data_format="channels_first",
set axis=1 in BatchNormalization.
But what is it exactly doing along each dimension?

First up, there are several ways to apply batch normalization, which are even mentioned in the original paper specifically for convolutional neural networks. See the discussion in this question, which outlines the difference between a usual and convolutional BN, and also the reason why both approaches make sense.
Particularly keras.layers.BatchNormalization implements the convolutional BN, which means that for an input [m,h,w,c] it computes c means and standard deviations across m*h*w values. The shapes of the running mean, running std dev and gamma and beta variables are just (c,). The values across spatial dimensions (pixels), as well as across the batch, are shared.
So a more accurate algorithm would be: for each R, G, and B channel compute the mean/variance across all pixels and all images in this channel and apply the normalization.

Related

Understanding convolutional layers shapes

I've been reading about convolutional nets and I've programmed a few models myself. When I see visual diagrams of other models it shows each layer being smaller and deeper than the last ones. Layers have three dimensions like 256x256x32. What is this third number? I assume the first two numbers are the number of nodes but I don't know what the depth is.

TLDR; 256x256x32 refers to the layer's output shape rather than the layer itself.
There are many articles and posts out there explaining how convolution layers work. I'll try to answer your question without going into too many details, just focusing on shapes.
Assuming you are working with 2D convolution layers, your input and output will both be three-dimensional. That is, without considering the batch which would correspond to a 4th axis... Therefore, the shape of a convolution layer input will be (c, h, w) (or (h, w, c) depending on the framework) where c is the number of channels, h is the width of the input and w the width. You can see it as a c-channel hxw image.
The most intuitive example of such input is the input of the first convolution layer of your convolutional neural network: most likely an image of size hxw with c channels for example c=1 for greyscale or c=3 for RGB...
What's important is that for all pixels of that input, the values on each channel gives additional information on that pixel. Having three channels will give each pixel ('pixel' as in position in the 2D input space) a richer content than having a single. Since each pixel will be encoded with three values (three channels) vs. a single one (one channel). This kind of intuition about what channels represent can be extrapolated to a higher number of channels. As we said an input can have c channels.
Now going back to convolution layers, here is a good visualization. Imagine having a 5x5 1-channel input. And a convolution layer consisting of a single 3x3 filter (i.e. kernel_size=3)
input
filter
convolution
output
shape
(1, 5, 5)
(3, 3)
(3,3)
representation
Now keep in mind the dimension of the output will depend on the stride and padding of the convolution layer. Here the shape of the output is the same as the shape of the filter, it does not necessarily have to be! Take an input shape of (1, 5, 5), with the same convolution settings, you would end up with a shape of (4, 4) (which is different from the filter shape (3, 3).
Also, something to note is that if the input had more than one channel: shape (c, h, w), the filter would have to have the same number of channels. Each channel of the input would convolve with each channel of the filter and the results would be averaged into a single 2D feature map. So you would have an intermediate output of (c, 3, 3), which after averaging over the channels, would leave us with (1, 3, 3)=(3, 3). As a result, considering a convolution with a single filter, however many input channels there are, the output will always have a single channel.
From there what you can do is assemble multiple filters on the same layer. This means you define your layer as having k 3x3 filters. So a layer consists k filters. For the computation of the output, the idea is simple: one filter gives a (3, 3) feature map, so k filters will give k (3, 3) feature maps. These maps are then stacked into what will be the channel dimension. Ultimately, you're left with an output shape of... (k, 3, 3).
Let k_h and k_w, be the kernel height and kernel width respectively. And h', w' the height and width of one outputted feature map:
input
layer
output
shape
(c, h, w)
(k, c, k_h, k_w)
(k, h', w')
description
c-channel hxw feature map
k filters of shape (c, k_h, k_w)
k-channel h'xw' feature map
Back to your question:
Layers have 3 dimensions like 256x256x32. What is this third number? I assume the first two numbers are the number of nodes but I don't know what the depth is.
Convolution layers have four dimensions, but one of them is imposed by your input channel count. You can choose the size of your convolution kernel, and the number of filters. This number will determine is the number of channels of the output.
256x256 seems extremely high and you most likely correspond to the output shape of the feature map. On the other hand, 32 would be the number of channels of the output, which... as I tried to explain is the number of filters in that layer. Usually speaking the dimensions represented in visual diagrams for convolution networks correspond to the intermediate output shapes, not the layer shapes.
As an example, take the VGG neural network:
Very Deep Convolutional Networks for Large-Scale Image Recognition
Input shape for VGG is (3, 224, 224), knowing that the result of the first convolution has shape (64, 224, 224) you can determine there is a total of 64 filters in that layer.
As it turns out the kernel size in VGG is 3x3. So, here is a question for you: knowing there is a single bias parameter per filter, how many total parameters are in VGG's first convolution layer?

Sorry for the short answer, but when you have a digital image, you have 2 dimensions and then you often have 3 for the colors. The convolutional filter looks into parts of the picture with lower height/width dimensions and much more depth channels (in your case 32) to get more information. This is then fed into the neural network to learn.

I created the example in PyTorch to demonstrate the output you had:
import torch
import torch.nn as nn
bs=16
x = torch.randn(bs, 3, 256, 256)
c = nn.Conv2d(3,32,kernel_size=5,stride=1,padding=2)
out = c(x)
print(out.shape, out.shape[1])
Out:
torch.Size([16, 32, 256, 256]) 32
It's a real tensor inside. It may help.
You can play with a lot of convolution parameters.

What is a mathematical relation of diameter and sigma arguments in bilateral filter function?

While learning an image denoising technique based on bilateral filter, I encountered this tutorial which provides with full lists of arguments used to run OpenCV's bilateralFilter function. What I see, it's slightly confusing, because there is no explanation about a mathematical rule to alter the diameter value by manipulating both the sigma arguments. So, if picking some specific arguments to pass into that function, I realize hardly what diameter corresponds with a particular couple of sigma values.
Does there exist a dependency between both deviations and the diameter? If my inference is correct, what equation (may be, introduced in OpenCV documentation) is to be referred if applying bilateral filter in a program-based solution?

According to the documentation, the bilateralFilter function in OpenCV takes a parameter d, the neighborhood diameter, as well as a parameter sigmaSpace, the spatial sigma. They can be selected separately, but if d "is non-positive, it is computed from sigmaSpace." For more details we need to look at the source code:
if( d <= 0 )
radius = cvRound(sigma_space*1.5);
else
radius = d/2;
radius = MAX(radius, 1);
d = radius*2 + 1;
That is, if d is not positive, then it is taken as 3 times sigmaSpace. d is also always forced to be odd, so that there is a central pixel in the neighborhood.
Note that the other sigma, sigmaColor, is unrelated to the spatial size of the filter.
In general, if one chooses a sigmaSpace that is too large for the given d, then the Gaussian kernel will be cut off in a way that makes it not appear like a Gaussian, and loose its nice filtering properties (see for example here for an explanation). If it is taken too small for the given d, then many pixels in the neighborhood will always have a near-zero weight, meaning that computational work is wasted. The default value is rather small (one typically uses a radius of 3 times sigma for Gaussian filtering), but is still quite reasonable given the computational cost of the bilateral filter (a smaller neighborhood is cheaper).

These two value (d and sigma) are totally unrelated to each other. Sigma determines the values of the pixels of the kernel, but d determines the size of the kernel.
For example consider this Gaussian filter with sigma=1:
It's a filter kernel and and as you can see the pixel values of the kernel only depends on sigma (the 3*3 matrix in the middle is equal in both kernel), but reducing the size of the kernel (or reducing the diameter) will make the outer pixels ineffective without effecting the values of the middle pixels.
And now if you change the sigma, (with k=3) the kernel is still 3*3 but the pixels' values would be different.

superpixels extracted via energy-driven sampling (SEEDS)

I am interested in superpixels extracted via energy-driven sampling (SEEDS) which is a method of image segmentation using superpixels. This is also what OpenCV uses to create superpixels. I am having troubles finding documentation behind the SEEDS algorithm. OpenCV gives a very general description which can be found here.
I am looking for a more in depth description on how SEEDS functions (either a general walk through or a mathematical explanation). Any links or thoughts concerning the algorithm would be much appreciated! I can't seem to find any good material. Thanks!

I will first go through some general links and resources and then try to describe the general idea of the algorithm.
SEEDS implementations:
You obviously already saw the documentation here. A usage example for OpenCV's SEEDS implementation can be found here: Itseez/opencv_contrib/modules/ximgproc/samples/seeds.cpp, and allows to adapt the number of superpixels, the number of levels and other parameters live - so after reading up on the idea behind SEEDS you should definitely try the example. The original implementation, as well as a revised implementation (part of my bachelor thesis), can be found on GitHub: davidstutz/superpixels-revisited/lib_seeds and davidstutz/seeds-revised. The implementations should be pretty comparable, though.
Publication and other resources:
The paper was released on arxiv: arxiv.org/abs/1309.3848. A somewhat shorter description (which may be easier to follow) is available on my website: davidstutz.de/efficient-high-quality-superpixels-seeds-revised. The provided algorithm description should be easy to follow and -- in the best case -- allow to implement SEEDS (see the "Algorithm" section of the article). A more precise description can also be found in my bachelor thesis, in particular in section 3.1.
General description:
Note that this description is based on both the above mentioned article and my bachelor thesis. Both offer a mathematically concise description.
Given an image of with width W and height H, SEEDS starts by grouping pixels into blocks of size w x h. These blocks are further arranged into groups of 2 x 2. This schemes is repeated for L levels (this is the number of levels parameter). So at level l, you have blocks of size
w*2^(l - 1) x h*2^(l - 1).
The number of superpixels is determined by the blocks at level L, i.e. letting w_L and h_L denote the width and height of the blocks at level L, the number of superpixels is
S = W/w_L * H/h_L
where we use integer divisions.
The initial superpixel segmentation which is now iteratively refined by exchanging blocks of pixels and individual pixels between neighboring superpixels. To this end, color histograms of the superpixels and all blocks are computed (the histograms are determined by the number of bins parameter in the implementation). This can be done efficiently by seeing that the histogram of a superpixel is just the sum of the histograms of the 2 x 2 blocks it consists of, and the histogram of one of these blocks is the sum of the histograms of the 2 x 2 underlying blocks (and so on). So let h_i be the histogram of a block of pixels belonging to superpixel j, and h_j the histogram of this superpixel. Then, the similarity of the block j to superpixel j is computed by the histogram intersection of h_i and h_j (see one of the above resources for the equation). Similarly, the similarity of a pixel and a superpixel is either the Euclidean distance of the pixel color to the superpixel mean color (this is the better performing option), or the probability of the pixel's color belonging to the superpixel (which is simply the normalized entry of the superpixel's histogram at the pixel's color). With this background, the algorithm can be summarized as follow:
initialize block hierarchy and the initial superpixel segmentation
for l = L - 1 to 1 // go through all levels
// for level l = L these are the initial superpixels
for each block in level l
initialize the color histogram of this block
// as described this is done using the histograms of the level below
// now we start exchanging blocks between superpixels
for l = L - 1 to 1
for each block at level l
if the block lies at the border to a superpixel it does not belong to
compute the histogram intersection with both superpixels
assign the block to the superpixel with the highest intersection
// now we exchange individual pixels between superpixels
for all pixels
if the pixel lies at the border to a superpixel it does not belong to
compute the Euclidean distance of the pixel to both superpixel's mean color
assign the pixel to the closest superpixel
In practice, the block updates and pixel updates are iterated more than ones (which is the number of iterations parameter), and often twice as many iterations per level are done (which is the double step parameter). In the original segmentation, the number of superpixels is computed from w, h, L and the image size. In OpenCV, using the above equations, w and h is computed from the desired number of superpixels and number of levels (which are determined by the corresponding parameters).
One parameter remains unclear: the prior tries to enforce smooth boundaries. In practice this is done by considering the 3 x 3 neighborhood around a pixel which is going to be updated. If most of the pixels in this neighborhood belong to superpixel j, the pixel to be updated is also more likely to belong to superpixel j (and vice versa). OpenCV's implementation as well as my implementation (SEEDS revised), allow to consider larger neighborhoods k x k with k in {0,...,5} in the case of OpenCV.

normalization in image processing

What is the correct mean of normalization in image processing? I googled it but i had different definition. I'll try to explain in detail each definition.
Normalization of a kernel matrix
If normalization is referred to a matrix (such as a kernel matrix for convolution filter), usually each value of the matrix is divided by the sum of the values of the matrix in order to have the sum of the values of the matrix equal to one (if all values are greater than zero). This is useful because a convolution between an image matrix and our kernel matrix give an output image with values between 0 and the max value of the original image. But if we use a sobel matrix (that have some negative values) this is not true anymore and we have to stretch the output image in order to have all values between 0 and max value.
Normalization of an image
I basically find two definition of normalization. The first one is to "cut" values too high or too low. i.e. if the image matrix has negative values one set them to zero and if the image matrix has values higher than max value one set them to max values. The second one is to linear stretch all the values in order to fit them into the interval [0, max value].

I will extend a bit the answer from #metsburg. There are several ways of normalizing an image (in general, a data vector), which are used at convenience for different cases:
Data normalization or data (re-)scaling: the data is projected in to a predefined range (i.e. usually [0, 1] or [-1, 1]). This is useful when you have data from different formats (or datasets) and you want to normalize all of them so you can apply the same algorithms over them. Is usually performed as follows:
Inew = (I - I.min) * (newmax - newmin)/(I.max - I.min) + newmin
Data standarization is another way of normalizing the data (used a lot in machine learning), where the mean is substracted to the image and dividied by its standard deviation. It is specially useful if you are going to use the image as an input for some machine learning algorithm, as many of them perform better as they assume features to have a gaussian form with mean=0,std=1. It can be performed easyly as:
Inew = (I - I.mean) / I.std
Data stretching or (histogram stretching when you work with images), is refereed as your option 2. Usually the image is clamped to a minimum and maximum values, setting:
Inew = I
Inew[I < a] = a
Inew[I > b] = b
Here, image values that are lower than a are set to a, and the same happens inversely with b. Usually, values of a and b are calculated as percentage thresholds. a= the threshold that separates bottom 1% of the data and b=the thredhold that separates top 1% of the data. By doing this, you are removing outliers (noise) from the image.
This is similar (simpler) to histogram equalization, which is another used preprocessing step.
Data normalization, can also be refereed to a normalization of a vector respect to a norm (l1 norm or l2/euclidean norm). This, in practice, is translated as to:
Inew = I / ||I||
where ||I|| refeers to a norm of I.
If the norm is choosen to be the l1 norm, the image will be divided by the sum of its absolute values, making the sum of the whole image be equal to 1. If the norm is choosen to be l2 (or euclidean), then image is divided by the sum of the square values of I, making the sum of square values of I be equal to 1.
The first 3 are widely used with images (not the 3 of them, as scaling and standarization are incompatible, but 1 of them or scaling + streching or standarization + stretching), the last one is not that useful. It is usually applied as a preprocess for some statistical tools, but not if you plan to work with a single image.

Answer by #Imanol is great, i just want to add some examples:
Normalize the input either pixel wise or dataset wise. Three normalization schemes are often seen:
Normalizing the pixel values between 0 and 1:
img /= 255.0
Normalizing the pixel values between -1 and 1 (as Tensorflow does):
img /= 127.5
img -= 1.0
Normalizing according to the dataset mean & standard deviation (as Torch does):
img /= 255.0
mean = [0.485, 0.456, 0.406] # Here it's ImageNet statistics
std = [0.229, 0.224, 0.225]
for i in range(3): # Considering an ordering NCHW (batch, channel, height, width)
img[i, :, :] -= mean[i]
img[i, :, :] /= std[i]

In data science, there are two broadly used normalization types:
1) Where we try to shift the data so that there sum is a particular value, usually 1 (https://stats.stackexchange.com/questions/62353/what-does-it-mean-to-use-a-normalizing-factor-to-sum-to-unity)
2) Normalize data to fit it within a certain range (usually, 0 to 1): https://stats.stackexchange.com/questions/70801/how-to-normalize-data-to-0-1-range

Geometric representation of Perceptrons (Artificial neural networks)

I am taking this course on Neural networks in Coursera by Geoffrey Hinton (not current).
I have a very basic doubt on weight spaces.
https://d396qusza40orc.cloudfront.net/neuralnets/lecture_slides%2Flec2.pdf
Page 18.
If I have a weight vector (bias is 0) as [w1=1,w2=2] and training case as {1,2,-1} and {2,1,1}
where I guess {1,2} and {2,1} are the input vectors. How can it be represented geometrically?
I am unable to visualize it? Why is training case giving a plane which divides the weight space into 2? Could somebody explain this in a coordinate axes of 3 dimensions?
The following is the text from the ppt:
1.Weight-space has one dimension per weight.
2.A point in the space has particular setting for all the weights.
3.Assuming that we have eliminated the threshold each hyperplane could be represented as a hyperplane through the origin.
My doubt is in the third point above. Kindly help me understand.

It's probably easier to explain if you look deeper into the math. Basically what a single layer of a neural net is performing some function on your input vector transforming it into a different vector space.
You don't want to jump right into thinking of this in 3-dimensions. Start smaller, it's easy to make diagrams in 1-2 dimensions, and nearly impossible to draw anything worthwhile in 3 dimensions (unless you're a brilliant artist), and being able to sketch this stuff out is invaluable.
Let's take the simplest case, where you're taking in an input vector of length 2, you have a weight vector of dimension 2x1, which implies an output vector of length one (effectively a scalar)
In this case it's pretty easy to imagine that you've got something of the form:
input = [x, y]
weight = [a, b]
output = ax + by
If we assume that weight = [1, 3], we can see, and hopefully intuit that the response of our perceptron will be something like this:
With the behavior being largely unchanged for different values of the weight vector.
It's easy to imagine then, that if you're constraining your output to a binary space, there is a plane, maybe 0.5 units above the one shown above that constitutes your "decision boundary".
As you move into higher dimensions this becomes harder and harder to visualize, but if you imagine that that plane shown isn't merely a 2-d plane, but an n-d plane or a hyperplane, you can imagine that this same process happens.
Since actually creating the hyperplane requires either the input or output to be fixed, you can think of giving your perceptron a single training value as creating a "fixed" [x,y] value. This can be used to create a hyperplane. Sadly, this cannot be effectively be visualized as 4-d drawings are not really feasible in browser.
Hope that clears things up, let me know if you have more questions.

I have encountered this question on SO while preparing a large article on linear combinations (it's in Russian, https://habrahabr.ru/post/324736/). It has a section on the weight space and I would like to share some thoughts from it.
Let's take a simple case of linearly separable dataset with two classes, red and green:
The illustration above is in the dataspace X, where samples are represented by points and weight coefficients constitutes a line. It could be conveyed by the following formula:
w^T * x + b = 0
But we can rewrite it vice-versa making x component a vector-coefficient and w a vector-variable:
x^T * w + b = 0
because dot product is symmetrical. Now it could be visualized in the weight space the following way:
where red and green lines are the samples and blue point is the weight.
More possible weights are limited to the area below (shown in magenta):
which could be visualized in dataspace X as:
Hope it clarifies dataspace/weightspace correlation a bit. Feel free to ask questions, will be glad to explain in more detail.

The "decision boundary" for a single layer perceptron is a plane (hyper plane)
where n in the image is the weight vector w, in your case w={w1=1,w2=2}=(1,2) and the direction specifies which side is the right side. n is orthogonal (90 degrees) to the plane)
A plane always splits a space into 2 naturally (extend the plane to infinity in each direction)
you can also try to input different value into the perceptron and try to find where the response is zero (only on the decision boundary).
Recommend you read up on linear algebra to understand it better:
https://www.khanacademy.org/math/linear-algebra/vectors_and_spaces

For a perceptron with 1 input & 1 output layer, there can only be 1 LINEAR hyperplane. And since there is no bias, the hyperplane won't be able to shift in an axis and so it will always share the same origin point. However, if there is a bias, they may not share a same point anymore.

I think the reason why a training case can be represented as a hyperplane because...
Let's say
[j,k] is the weight vector and
[m,n] is the training-input
training-output = jm + kn
Given that a training case in this perspective is fixed and the weights varies, the training-input (m, n) becomes the coefficient and the weights (j, k) become the variables.
Just as in any text book where z = ax + by is a plane,
training-output = jm + kn is also a plane defined by training-output, m, and n.

Equation of a plane passing through origin is written in the form:
ax+by+cz=0
If a=1,b=2,c=3;Equation of the plane can be written as:
x+2y+3z=0
So,in the XYZ plane,Equation: x+2y+3z=0
Now,in the weight space;every dimension will represent a weight.So,if the perceptron has 10 weights,Weight space will be 10 dimensional.
Equation of the perceptron: ax+by+cz<=0 ==> Class 0
ax+by+cz>0 ==> Class 1
In this case;a,b & c are the weights.x,y & z are the input features.
In the weight space;a,b & c are the variables(axis).
So,for every training example;for eg: (x,y,z)=(2,3,4);a hyperplane would be formed in the weight space whose equation would be:
2a+3b+4c=0
passing through the origin.
I hope,now,you understand it.

Consider we have 2 weights. So w = [w1, w2]. Suppose we have input x = [x1, x2] = [1, 2]. If you use the weight to do a prediction, you have z = w1*x1 + w2*x2 and prediction y = z > 0 ? 1 : 0.
Suppose the label for the input x is 1. Thus, we hope y = 1, and thus we want z = w1*x1 + w2*x2 > 0. Consider vector multiplication, z = (w ^ T)x. So we want (w ^ T)x > 0. The geometric interpretation of this expression is that the angle between w and x is less than 90 degree. For example, the green vector is a candidate for w that would give the correct prediction of 1 in this case. Actually, any vector that lies on the same side, with respect to the line of w1 + 2 * w2 = 0, as the green vector would give the correct solution. However, if it lies on the other side as the red vector does, then it would give the wrong answer.
However, suppose the label is 0. Then the case would just be the reverse.
The above case gives the intuition understand and just illustrates the 3 points in the lecture slide. The testing case x determines the plane, and depending on the label, the weight vector must lie on one particular side of the plane to give the correct answer.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart