Instance Normalization with batch size 1 - machine-learning

I am really confused with the meaning of Instance Norm and whether I can use it with a batch size of 1. I am using PyTorch and nothing in the documentation says that batch size should be greater than 1.
I know that for BatchNorm the performance is adversely affected when batch size is less than 8 and hence it puts a sort of soft bound on the batch size. However, I did not see any such analysis on Instance Norm and am a bit confused now. Should I remove the norm layer if my batch size is 1 then?

A good overview of the different norms is shown in the Group Normalization paper.
Instance normalisation is summarised as:
[...] IN computes µ and σ along the (H, W) axes for each sample and each channel.
The mean and standard deviation are computed on the spatial dimensions (H, W) only and are independent of the batch size and channels (there are N x C different norms). Hence, you can use it with a batch size of 1.

Related

Post-process multi-class predictions for image segmentation?

My FCN is trained to detect 10 different classes and produces an output of 500x500x10 with each of the final dimensions being the prediction probabilities for a different class.
Usually, I've seen using a uniform threshold, for instance 0.5, to binarize the probability matrices. However, in my case, this doesn't quite cut it because the IoU for some of the classes increases when the threshold is 0.3 and for other classes it is 0.8.
Hence, I don't have to arbitrarily pick the threshold for each class but rather use a more probabilistic approach to finalizing the threshold values. I thought of using CRFs but this also requires the thresholding to have already been done. Any ideas on how to proceed?
Example: consider an image of a forest with 5 different birds. Now im trying to output an image that has segmented the forest and the five birds, 6 classes, each with a separate label. The network outputs 6 confusion matrices indicating the confidence that a pixel falls into a particular class. Now, the correct answer for a pixel isnt always the class with the highest confidence value. Therefore, a one size fits all method or a max value method won't work.
CRF Postprocessing Approch
You don't need to set thresholds to use a CRF. I'm not familiar with any python libraries for CRFs, but in principle, what you need to define is:
A probability distribution of the 10 classes for each of the nodes
(pixels), which is simply the output of your network.
Pairwise potentials: 10*10 matrix, where element Aij denotes the "strength" of the configuration that one pixel is of class i and the other of class j. If you set the potentials to have a value alpha (alpha >> 1) in the diagonal and 1 elsewhere, then alpha is the regularization force that gives you consistency of the predictions (if pixel X is of class Y, then the neighboring pixels of X are more likely to be of the same class).
This is just one example of how you can define your CRF.
End to End NN Approach
Add a loss to your network that will penalize pixels that have neighbors of a different class. Please note that you will still end up with a tune-able parameter for the weight of the new regularization loss.

How to enable a Convolutional NN to take variable size input?

So, I've seen that many of the first CNN examples in Machine Learning use the MNIST dataset. Each image there is 28x28, and so we know the shape of the input before hand. How would this be done for variable size input, let's say you have some images that are 56x56 and some 28x28.
I'm looking for a language and framework agnostic answer if possible or in tensorflow terms preferable
In some cases, resizing the images appropriately (for example to keep the aspectratio) will be sufficient. But, this can introduce distortion, and in case this is harmful, another solution is to use Spatial Pyramidal Pooling (SPP). The problem with different image sizes is that it produces layers of different sizes, for example, taking the features of the n-th layer of some network, you can end up with a featuremap of size 128*fw*fh where fw and fh vary depending on the size of the input example. What SPP does in order to alleviate this problem, is to turn this variable size feature map into a fix-length vector of features. It operates on different scales, by dividing the image into equal patches and performing maxpooling on them. I think this paper does a great job at explaining it. An example application can be seen here.
As a quick explanation, imagine you have a feature map of size k*fw*fh. You can consider it as k maps of the form
X Y
Z T
where each of the blocks are of size fw/2*fh/2. Now, performing maxpooling on each of those blocks separately gives you a vector of size 4, and therefore, you can grossly describe the k*fw*fh map as a k*4 fixed-size vector of features.
Now, call this fixed-size vector w and set it aside, and this time, consider the k*fw*fh featuremap as k featureplanes written as
A B C D
E F G H
I J K L
M N O P
and again, perform maxpooling separately on each block. So, using this, you obtain a more fine-grained representation, as a vector of length v=k*16.
Now, concatenating the two vectors u=[v;w] gives you a fixed-size representation. This is exaclty what a 2-scale SPP does (well, of course you can change the number/sizes of divisions).
Hope this helps.
When you use CNN for classification task, your network has two part:
Feature generator. Part generates feature map with size WF x HF and CF channels by image with size WI x HI and CI channels . The relation between image sizes and feature map size depends of structure your NN (for example, on amount of pooling layers and stride of them).
Classifier. Part solves the task of classification vectors with WF*HF*CF components into classes.
You can put image with different size into feature generator, and get feature map with different sizes. But classifier can only be training on some fixed lengths vectors. Therefore you obviously train your network for some fixed sizes of images. If you have images with different size you resize it to input size of network, or crop some part of image.
Another way described in the article
K. He, X. Zhang, S. Ren, J. Sun, "Spatial pyramid pooling in deep convolutional networks for visual recognition," arXiv:1406.4729 2014
Authors offered Spatial pyramid pooling, which solve the problem with different image on the input of CNN. But I don't sure is spatial pyramid pooling layer exists in tensorflow.

How to use principal component analysis (PCA) to speed up detection?

I am not sure whether I am applying PCA correctly or not! I have p features and n observations (instances). I put these in an nxp matrix X. I perform mean normalization and I get the normalized matrix B. I calculate the eigenvalues and eigenvectors of the pxp covariance matrix C=(1/(n-1))B*.B where * denotes the conjugate transpose.
The eigenvectors corresponding to the descendingly ordered eigenvalues are in a pxp matrix E. Let's say I want to reduce the number of attributes from p to k. I use the equation X_new=B.E_reduced where E_reduced is produced by choosing the first k columns of E. Here are my questions:
1) Should it be X_new=B.E_reduced or X_new=X.E_reduced?
2) Should I repeat the above calculations in the testing phase? If testing phase is similar to training phase, then no speed-up is gained because I have to calculate all the p features for each instance in the testing phase and PCA makes the algorithm slower because of eigenvector calculation overhead.
3) After applying PCA, I noticed that the accuracy decreased. Is this related to the number k (I set k=p/2) or the fact that I am using linear PCA instead of kernel PCA? What is the best way to choose the number k? I read that I can find the ratio of summation of k eigenvalues over the summation of all eigenvalues and make a decision based on this ratio.
You apply the multiplication to the centered data usually, so your projected data is also centered.
Never re-run PCA during testing. Only usenit on training data, and keep the shift vector and projection matrix. You need to apply exactly the same projection as during training, not recompute a new projection.
Decreased performance can have many reasons. E.g. did you also apply scaling using the roots of the eigenvalues? And what method did you use the first place?

Bayes Classification with Multivariate Parzen Window using Spherical Kernel

I'm having a problem implementing a Bayes Classifier with the Parzen window algorithm using a spherical (or isotropic) kernel.
I am running the algorithm with test data containing 2 dimensions and 3 different classes (For each class, I have 10 test points, and 40 training points, all in 2 dimensions). When I change the value of my hyper-parameter (sigma_sq for the spherical Gaussian kernel), I find that there is no effect on how the points are classified.
This is my density estimator. My self.sigma_sq is the same across all the dimensions of my data (2 dimensions)
for i in range(test_data.shape[0]):
log_prob_intermediate = 0
for j in range(n): #n is size of training set
c = -self.n_dims * np.log(2*np.pi)/2.0 - self.n_dims*np.log(self.sigma_sq)/2.0
log_prob_intermediate += (c - np.sum((test_data[i,:] - self.train_data[j,:])**2.0) / (2.0 * self.sigma_sq))
log_prob.append(log_prob_intermediate / n)
How I implemented my Bayes Classifier:
There are 3 classes that my Bayes Classifier must distinguish. I created 3 training sets and 3 test sets (one training and test set per class). For each point in my test set, I run the density estimator for each class on the point. This gives me a vector of 3 values: the log probability that my new point is in class1, class2, or class3. I then choose the maximum value and assign the new point to that class.
Since I am using a spherical Gaussian kernel, I am of the understanding that my sigma_sq must be common for each density estimator (one density estimator for each class). Is this correct? If I had a different sigma_sq for each dimension pair, wouldn't this give me somewhat of a diagonal Gaussian kernel?
For my list of 30 test points (10 for each class), I find that running the bayes classifier on these points continues to give me the exact same classification for each point, regardless of what sigma I use. Is this normal? Since it's a spherical Gaussian kernel, and all my dimensions use the same kernel, is increasing or decreasing my sigma_sq just having a proportional effect on my log probability with no change in the classification? OR do I have some sort of problem with my density estimator that I can't figure out.
Lets address each thing separately
Using the same sigma for each dimension makes your kernel radial, this is true; however, you can (and should!) use different sigma for each class, as each distribution usually requires different density estimator, for simple heuristics read for example about Scott's rule of thumb for the kernel width selection in gaussian case or later work by Silverman.
It is hard to tell whether in your particular case choice of sigma should change the classification - in general it should be true; but each dataset has its own properties. However, your data is just 2D, which makes it perfect for visualization. Draw your data, then, draw each KDE and simply visually investigate what is going on.

Why is image segmentation needed for object detection?

What is the point of image segmentation algorithms like SLIC? Most object detection algorithms work over the entire set of (square) sub-images anyway.
The only conceivable benefit to segmenting the image is that now, the classifier has shape information available to it. Is that right?
Most classifiers I know of take rectangular input images. What classifiers allow you to pass variable sized image segments to them?
First, SLIC, and the kind of algorithms I'm guessing you refer to, are not segmentation algorithms, they are oversegmentation algorithms. There is a difference between those two terms. segmentation methods split the image in objects while oversegmentation methods split the image in small clusters (spatially adjacent group of pixels with similar characteristics), these clusters are usually called superpixels. See the image** of superpixels below:
Now, answering parts of your original question:
Why to use superpixels?
They reduce the dimensionality of your data/complexity of the problem. From N = w x h pixels to M superpixels, with M << N. This is, for an image composed of N = 320x480 = 153600 pixels, M = 400 or M = 800 superpixels seem to be enough to oversegment it. Now, letting for latter how to classify them, just consider how easier your problem has become reducing from N=100K to N=800 training examples to train/classify. The superpixels still represent your data properly, as they adhere to image boundaries.
They allow you to compute more complex features. With pixels, you can only compute some simple statistics on them, or use a filter-bank/feature extractor to extract some features in its vicinity. This however represents your pixel's neighbour very locally, without having in consideration the context. With superpixels, you can compute a superpixel descriptor from all the pixels that belong to it. This is, features are usually computed at pixel level as before, but, then features are merged into a superpixel descriptor by different methods. Some of the methods to do that are: simple mean of all pixels inside a superpixel, histograms, bag-of-words, correlation. As a simple example, imagine you only consider grayscale as a feature for your image/classifier, if you use simple pixels, all you have is pixel's intensity, which is very local and noisy. If you use superpixels, you can compute a histogram of the intensities of all the pixels inside, which describes much better the region than a single local intensity value.
They allow you to compute new features. Over superpixels you can compute some regional statistics (1st order as mean or variance or second order covariance). You can now extract some other information not available before (e.g. shape, length, diameter, area...).
How to use them?
Compute pixel features
Merge pixel features into superpixel descriptors
Classify/Optimize superpixel descriptors
In step 2., either by averaging, using histogram or using bag-of-words model, the superpixel descriptor is computed fixed-sized (e.g. 100 bins for histogram). Therefore, at the end you have reduced the X = N x K training data (N = 100K pixels times K features) to X = M x D (with M = 1K superpixels and D the length of the superpixel descriptor). Usually D > K but M << N, therefore you endup with some regional/more robust features that represent better your data with lower data dimensionality, which is great and reduces the complexity of your problem (classify, optimize) in average 2-3 orders of magnitude!
Conclusions
You can compute more complex (robust) features, but you have to be careful how, when and whatfor do you use superpixels as your data representation. You might lose some information (e.g. you lose your 2D grid lattices) or if you don't have enough training examples you might make the problem more difficult as the features are more complex and could be that you transform a linearly-separable problem into a non-linear one.
** Image from SLIC superpixels: http://ivrl.epfl.ch/research/superpixels

Resources