Batchnorm2d Pytorch - Why pass number of channels to batchnorm? - machine-learning

Why do I need to pass the previous nummber of channels to the batchnorm? The batchnorm should normalize over each datapoint in the batch, why does it need to have the number of channels then ?

Batch normalisation has learnable parameters, because it includes an affine transformation.
From the documentation of nn.BatchNorm2d:
The mean and standard-deviation are calculated per-dimension over the mini-batches and γ and β are learnable parameter vectors of size C (where C is the input size). By default, the elements of γ are set to 1 and the elements of β are set to 0.
Since the norm is calculated per channel, the parameters γ and β are vectors of size num_channels (one element per channel), which results in an individual scale and shift per channel. As with any other learnable parameter in PyTorch, they need to be created with a fixed size, hence you need to specify the number of channels
batch_norm = nn.BatchNorm2d(10)
# γ
batch_norm.weight.size()
# => torch.Size([10])
# β
batch_norm.bias.size()
# => torch.Size([10])
Note: Setting affine=False does not use any parameters and the number of channels wouldn't be needed, but they are still required, in order to have a consistent interface.

Related

Fourier Transform Scaling the magnitude

I have a set of signals S1, S2, ....,SN, for which I am numerically computing the Fourier transforms F1, F2, ,,,,FN. Where Si's and Fi's are C++ vectors (my computations are in C++).
My Computation objective is as follows:
compute product: F = F1*F2*...*FN
inverse Fourier transform F to get a S.
What I numerically observe is when I am trying to compute the product either I am running into situation where numbers become too small. Or numbers become too big.
I thought of scaling F1 by say a1 so that the overflow or underflow is avoided.
With the scaling my step 1 above will become
F' = (F1/a1)*(F2/a2)*...*(FN/aN) = F'1*F'2*...*F'N
And when I inverse transform my final S' will differ from S by a scale factor. The structural form of the S will not change. By this I mean only the normalization of S is different.
My question is :
Is my rationale correct.
If my rationale is correct then given a C++ vector "Fi" how can I choose a good scale "ai" to scale up or down the Fi's.
Many thanks in advance.
You can change the range of the Fi vector in a new domain, let's say [a, b]. So from your vector Fi, the minimum number will become a and the maximum number will become b. You can scale the vector using this equation:
Fnew[i] = (b-a)/(max-min) * (F[i] - min) + a,
where: min = the minimum value from F
max= the maximum value from F
The only problem now is choosing a and b. I would choose [1,2], because the values are small (so you won't get overflow easily), but not smaller than 0.

Batch Normalization in Convolutional Neural Network

I am newbie in convolutional neural networks and just have idea about feature maps and how convolution is done on images to extract features. I would be glad to know some details on applying batch normalisation in CNN.
I read this paper https://arxiv.org/pdf/1502.03167v3.pdf and could understand the BN algorithm applied on a data but in the end they mentioned that a slight modification is required when applied to CNN:
For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way. To achieve this, we jointly normalize all the activations in a mini- batch, over all locations. In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations – so for a mini-batch of size m and feature maps of size p × q, we use the effec- tive mini-batch of size m′ = |B| = m · pq. We learn a pair of parameters γ(k) and β(k) per feature map, rather than per activation. Alg. 2 is modified similarly, so that during inference the BN transform applies the same linear transformation to each activation in a given feature map.
I am total confused when they say
"so that different elements of the same feature map, at different locations, are normalized in the same way"
I know what feature maps mean and different elements are the weights in every feature map. But I could not understand what location or spatial location means.
I could not understand the below sentence at all
"In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations"
I would be glad if someone cold elaborate and explain me in much simpler terms
Let's start with the terms. Remember that the output of the convolutional layer is a 4-rank tensor [B, H, W, C], where B is the batch size, (H, W) is the feature map size, C is the number of channels. An index (x, y) where 0 <= x < H and 0 <= y < W is a spatial location.
Usual batchnorm
Now, here's how the batchnorm is applied in a usual way (in pseudo-code):
# t is the incoming tensor of shape [B, H, W, C]
# mean and stddev are computed along 0 axis and have shape [H, W, C]
mean = mean(t, axis=0)
stddev = stddev(t, axis=0)
for i in 0..B-1:
out[i,:,:,:] = norm(t[i,:,:,:], mean, stddev)
Basically, it computes H*W*C means and H*W*C standard deviations across B elements. You may notice that different elements at different spatial locations have their own mean and variance and gather only B values.
Batchnorm in conv layer
This way is totally possible. But the convolutional layer has a special property: filter weights are shared across the input image (you can read it in detail in this post). That's why it's reasonable to normalize the output in the same way, so that each output value takes the mean and variance of B*H*W values, at different locations.
Here's how the code looks like in this case (again pseudo-code):
# t is still the incoming tensor of shape [B, H, W, C]
# but mean and stddev are computed along (0, 1, 2) axes and have just [C] shape
mean = mean(t, axis=(0, 1, 2))
stddev = stddev(t, axis=(0, 1, 2))
for i in 0..B-1, x in 0..H-1, y in 0..W-1:
out[i,x,y,:] = norm(t[i,x,y,:], mean, stddev)
In total, there are only C means and standard deviations and each one of them is computed over B*H*W values. That's what they mean when they say "effective mini-batch": the difference between the two is only in axis selection (or equivalently "mini-batch selection").
Some clarification on Maxim's answer.
I was puzzled by seeing in Keras that the axis you specify is the channels axis, as it doesn't make sense to normalize over the channels - as every channel in a conv-net is considered a different "feature". I.e. normalizing over all channels is equivalent to normalizing number of bedrooms with size in square feet (multivariate regression example from Andrew's ML course). This is usually not what you want - what you do is normalize every feature by itself. I.e. you normalize the number of bedrooms across all examples to be with mu=0 and std=1, and you normalize the the square feet across all examples to be with mu=0 and std=1.
This is why you want C means and stds, because you want a mean and std per channel/feature.
After checking and testing it myself I realized the issue: there's a bit of a confusion/misconception here. The axis you specify in Keras is actually the axis which is not in the calculations. i.e. you get average over every axis except the one specified by this argument. This is confusing, as it is exactly the opposite behavior of how NumPy works, where the specified axis is the one you do the operation on (e.g. np.mean, np.std, etc.).
I actually built a toy model with only BN, and then calculated the BN manually - took the mean, std across all the 3 first dimensions [m, n_W, n_H] and got n_C results, calculated (X-mu)/std (using broadcasting) and got identical results to the Keras results.
Hope this helps anyone who was confused as I was.
I'm only 70% sure of what I say, so if it does not make sense, please edit or mention it before downvoting.
About location or spatial location: they mean the position of pixels in an image or feature map. A feature map is comparable to a sparse modified version of image where concepts are represented.
About so that different elements of the same feature map, at different locations, are normalized in the same way:
some normalisation algorithms are local, so they are dependent of their close surrounding (location) and not the things far apart in the image. They probably mean that every pixel, regardless of their location, is treated just like the element of a set, independently of it's direct special surrounding.
About In Alg. 1, we let B be the set of all values in a feature map across both the elements of a mini-batch and spatial locations: They get a flat list of every values of every training example in the minibatch, and this list combines things whatever their location is on the feature map.
Firstly we need to make it clear that the depth of a kernel is determined by previous feature map's channel num, and the number of kernel in this layer determins the channel num of next feature map (the next layer).
then we should make it clear that each kernel(three dimentional usually) will generate just one channel of feature map in the next layer.
thirdly we should try to accept the idea of each points in the generated feature map (regardless of their position) are generated by the same kernel, by sliding on previous layer. So they could be seen as a distribution generated by this kernel, and they could be seen as samples of a stochastic variable. Then they should be averaged to obtain the mean and then the variance. (it not rigid, only helps to understand)
This is what they say "so that different elements of the same feature map, at different locations, are normalized in the same way"

What should be taken as m in m estimate of probability in Naive Bayes

What should be taken as m in m estimate of probability in Naive Bayes?
So for this example
what m value should I take? Can I take it to be 1.
Here p=prior probabilities=0.5.
So can I take P(a_i|selected)=(n_c+ 0.5)/ (3+1)
For Naive Bayes text classification the given P(W|V)=
In the book it says that this is adopted from the m-estimate by letting uniform priors and with m equal to the size of the vocabulary.
But if we have only 2 classes then p=0.5. So how can mp be 1? Shouldn't it be |vocabulary|*0.5? How is this equation obtained from m-estimate?
In calculating the probabilities for attribute profession,As the prior probabilities are 0.5 and taking m=1
P(teacher|selected)=(2+0.5)/(3+1)=5/8
P(farmer|selected)=(1+0.5)/(3+1)=3/8
P(Business|Selected)=(0+0.5)/(3+1)= 1/8
But shouldn't the class probabilities add up to 1? In this case it is not.
Yes, you can use m=1. According to wikipedia if you choose m=1 it is called Laplace smoothing. m is generally chosen to be small (I read that m=2 is also used). Especially if you don't have that many samples in total, because a higher m distorts your data more.
Background information: The parameter m is also known as pseudocount (virtual examples) and is used for additive smoothing. It prevents the probabilities from being 0. A zero probability is very problematic, since it puts any multiplication to 0. I found a nice example illustrating the problem in this book preview here (search for pseudocount)
"m estimate of probability" is confusing.
In the given examples, m and p should be like this.
m = 3 (* this could be any value. you can specify this.)
p = 1/3 = |v| (* number of unique values in the feature)
If you use m=|v| then m*p=1, so it is called Laplace smoothing. "m estimate of probability" is the generalized version of Laplace smoothing.
In the above example you may think m=3 is too much, then you can reduce m to 0.2 like this.
I believe the uniform prior should be 1/3, not 1/2. This is because you have 3 professions, so you're assigning equal prior probability to each one. Like this, mp=1, and the probabilities you listed sum to 1.
From p = uniform priors and with m equal to the size of the vocabulary.
Will get :

Is likelihood calculated over the whole training set or a single example?

Suppose I have a training set of (x, y) pairs, where x is the input example and y is the corresponding target and y is a value (1 ... k) (k is the number of classes).
When calculating the likelihood of the training set, should it be calculated for the whole training set (all of the examples), that is:
L = P(y | x) = p(y1 | x1) * p(y2 | x2) * ...
Or is the likelihood computed for a specific training example (x, y)?
I'm asking because I saw these lecture notes (page 2), where he seems to calculate L_i, that is the likelihood for every training example separately.
The likelihood function describes the probability of generating a set of training data given some parameters and can be used to find those parameters which generate the training data with maximum probability. You can create the likelihood function for a subset of the training data, but that wouldn't be represent the likelihood of the whole data. What you can do however (and what is apparently silently done in the lecture notes) is to assume that your data is independent and identically distributed (iid). Therefore, you can split the joint probability function into smaller pieces, i.e. p(x|theta) = p(x1|theta) * p(x2|theta) * ... (based on the independence assumption), and you can use the same function with the same parameters (theta) for each of these pieces, e.g. a normal distribution (based on the identicality assumption). You can then use the logarithm to turn the product into a sum, i.e. p(x|theta) = p(x1|theta) + p(x2|theta) + .... That function can be maximized by setting its derivative to zero. The resulting maximum is the theta which creates your x with maximum probability, i.e. your maximum likelihood estimator.

Genetic algorithms: fitness function for feature selection algorithm

I have data set n x m where there are n observations and each observation consists of m values for m attributes. Each observation has also observed result assigned to it. m is big, too big for my task. I am trying to find a best and smallest subset of m attributes that still represents the whole dataset quite well, so that I could use only these attributes for teaching a neural network.
I want to use genetic algorithm for this. The problem is the fittness function. It should tell how well the generated model (subset of attributes) still reflects the original data. And I don't know how to evaluate certain subset of attributes against the whole set.
Of course I could use the neural network(that will later use this selected data anyway) for checking how good the subset is - the smaller the error, the better the subset. BUT, this takes a looot of time in my case and I do not want to use this solution. I am looking for some other way that would preferably operate only on the data set.
What I thought about was: having subset S (found by genetic algorithm), trim data set so that it contains values only for subset S and check how many observations in this data ser are no longer distinguishable (have same values for same attributes) while having different result values. The bigger the number is, the worse subset it is. But this seems to me like a bit too computationally exhausting.
Are there any other ways to evaluate how well a subset of attributes still represents the whole data set?
This cost function should do what you want: sum the factor loadings that correspond to the features comprising each subset.
The higher that sum, the greater the share of variability in the response variable that is explained with just those features. If i understand the OP, this cost function is a faithful translation of "represents the whole set quite well" from the OP.
Reducing to code is straightforward:
Calculate the covariance matrix of your dataset (first remove the
column that holds the response variable, i.e., probably the last
one). If your dataset is m x n (columns x rows), then this
covariance matrix will be n x n, with "1"s down the main diagonal.
Next, perform an eigenvalue decomposition on this covariance
matrix; this will give you the proportion of the total variability
in the response variable, contributed by that eigenvalue (each
eigenvalue corresponds to a feature, or column). [Note,
singular-value decomposition (SVD) is often used for this step, but
it's unnecessary--an eigenvalue decomposition is much simpler, and
always does the job as long as your matrix is square, which
covariance matrices always are].
Your genetic algorithm will, at each iteration, return a set of
candidate solutions (features subsets, in your case). The next task
in GA, or any combinatorial optimization, is to rank those candiate
solutions by their cost function score. In your case, the cost
function is a simple summation of the eigenvalue proportion for each
feature in that subset. (I guess you would want to scale/normalize
that calculation so that the higher numbers are the least fit
though.)
A sample calculation (using python + NumPy):
>>> # there are many ways to do an eigenvalue decomp, this is just one way
>>> import numpy as NP
>>> import numpy.linalg as LA
>>> # calculate covariance matrix of the data set (leaving out response variable column)
>>> C = NP.corrcoef(d3, rowvar=0)
>>> C.shape
(4, 4)
>>> C
array([[ 1. , -0.11, 0.87, 0.82],
[-0.11, 1. , -0.42, -0.36],
[ 0.87, -0.42, 1. , 0.96],
[ 0.82, -0.36, 0.96, 1. ]])
>>> # now calculate eigenvalues & eivenvectors of the covariance matrix:
>>> eva, evc = LA.eig(C)
>>> # now just get value proprtions of each eigenvalue:
>>> # first, sort the eigenvalues, highest to lowest:
>>> eva1 = NP.sort(eva)[::-1]
>>> # get value proportion of each eigenvalue:
>>> eva2 = NP.cumsum(eva1/NP.sum(eva1)) # "cumsum" is just cumulative sum
>>> title1 = "ev value proportion"
>>> print( "{0}".format("-"*len(title1)) )
-------------------
>>> for row in q :
print("{0:1d} {1:3f} {2:3f}".format(int(row[0]), row[1], row[2]))
ev value proportion
1 2.91 0.727
2 0.92 0.953
3 0.14 0.995
4 0.02 1.000
so it's the third column of values just above (one for each feature) that are summed (selectively, depending on which features are present in a given subset you are evaluating with the cost function).

Resources