Normalization of a sigmoid would constrain them to the linear regime of the nonlinearity - normalization

In the paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift", the authors have stated that:
"Note that simply normalizing each input of a layer may change what the layer can represent. For instance, normalizing the inputs of a sigmoid would constrain them to the linear regime of the nonlinearity. To address this, we make sure that the transformation inserted in the network can represent the identity transform. To accomplish this, we introduce, for each activation, a pair of parameters γ and β, which scale and shift the normalized value."
What does "the linear regime of the nonlinearity" mean and how do scaling and shifting help?

Related

Why are we using gaussian/normal density to represent probability in Maximum Likelihood Estimation for Linear Regression

I was reading the MLE derivations for linear regression when I noticed that we were directly assigning the normal density to be the probability:
p(y|x;theta) = probability density of normal distribution
Shouldn't there be some sort of integral involved? Is there a proof that shows that the density of the gaussian is an equivalent representation of the probability?
For other regressions like Linear Regression, we're using the sigmoid function, which essentially represents a CDF for a probability distribution, which makes sense to take the value at a single point and call it a probability.

Does loss function becomes non convex when we add polynomial features?

When we use polynomial features in case of polynomial regression, logistic regression, svm , does the loss function becomes non convex ?
If a loss function is convex for any choice of X -> y you're trying to estimate then adding a fixed set of polynomial features won't change that. You're simply trading your initial problem with the estimation problem X' -> y, where X' has the additional features.
If you're additionally trying to estimate the parameters for the new feature(s) then it's pretty easy to get a non-convex loss in those dimensions (assuming there are parameters to choose -- if you're just talking about adding a polynomial basis then this doesn't apply).
As some measure of proof, take the example of a 1D estimation problem and choose the feature f(x) = (x-a)^3. Assume your dataset has the single point (0, 0). With a little work you can show that the loss even for linear regression over the new feature is non-convex in places with respect to the parameter a. Note that the loss IS still convex with respect to the new features -- standard linear regression always satisfies that property -- it's the fact that we used linear regression along with a choice of polynomial to build a new non-convex regressor that causes this behavior.

ReLU weight initialization?

I have read that the " He weight Initialization" (He et al., 2015) built on the Lecun weight initialization and suggested a zero-mean Gaussian distribution where the standard deviation is
enter image description here
and this function should be used with ReLU to solve the vanishing/exploding gradient problem. For me, it does make sense because the way ReLu was built makes it no bothered with vanishing/exploding gradient problem. Since, if the input is less than 0 the derivative would be zero otherwise the derivative would be one. So, whatever the variance is, the gradient would be zero or one. Therefore, the He weight Initialization is useless. I know that I am missing something, that's why I am asking if anyone would tell me the usefulness of that weight initialization?
Weight initialization is applied, in general terms, to weights of layers that have learnable / trainable parameters, just like dense layers, convolutional layers, and other layers. ReLU is an activation function, fully deterministic, and has no initialization.
Regarding to the vanishing gradient problem, the backpropagation step is funded by computing the gradients by the chain rule (partial derivatives) for each weight (see here):
(...) each of the neural network's weights receive an update
proportional to the partial derivative of the error function with
respect to the current weight in each iteration of training.
The more deep a network is, the smaller these gradients get, and when a network becomes deep enough, the backprop step is less effective (in the worst case, it stops learning) and this becomes a problem:
This has the effect of multiplying n of these small numbers to compute
gradients of the "front" layers in an n-layer network, meaning that
the gradient (error signal) decreases exponentially with n while the
front layers train very slowly.
Choosing a proper activation function, like ReLU, help avoiding this to happen, as you mentioned in the OP, by making partial derivatives of this activation not too small:
Rectifiers such as ReLU suffer less from the vanishing gradient
problem, because they only saturate in one direction.
Hope this helps!

Which layers in neural networks have weights/biases and which don't?

I've heard several different varieties about setting up weights and biases in a neural network, and it's left me with a few questions:
Which layers use weights? (I've been told the input layer doesn't, are there others?)
Does each layer get a global bias (1 per layer)? Or does each individual neuron get its own bias?
In common textbook networks like a multilayer perceptron - each hidden layer and the output layer in a regressor, or up to the softmax, normalized output layer of a classifier, have weights. Every node has a single bias.
Here's a paper that I find particularly helpful explaining the conceptual function of this arrangement:
http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Essentially, the combination of weights and biases allow the network to form intermediate representations that are arbitrary rotations, scales, and distortions (thanks to nonlinear activation functions) for previous layers, ultimately linearizing the relationship between input and output.
This arrangement can also be expressed by the simple linear-algebraic expression L2 = sigma(W L1 + B) where L1 and L2 are activation vectors of two adjacent layers, W is a weight matrix, B is a bias vector, and sigma is an activation function, which is somewhat mathematically and computationally appealing.

Logistic Regression and Naive Bayes for this dataset

Can both Naive Bayes and Logistic regression classify both of these dataset perfectly ? My understanding is that Naive Bayes can , and Logistic regression with complex terms can classify these datasets. Please help if I am wrong.
Image of datasets is here:
Lets run both algorithms on two similar datasets to the ones you posted and see what happens...
EDIT The previous answer I posted was incorrect. I forgot to account for the variance in Gaussian Naive Bayes. (The previous solution was for naive bayes using Gaussians with fixed, identity covariance, which gives a linear decision boundary).
It turns out that LR fails at the circular dataset while NB could succeed.
Both methods succeed at the rectangular dataset.
The LR decision boundary is linear while the NB boundary is quadratic (the boundary between two axis-aligned Gaussians with different covariances).
Applying NB the circular dataset gives two means in roughly the same position, but with different variances, leading to a roughly circular decision boundary - as the radius increases, the probability of the higher variance Gaussian increases compared to that of the lower variance Gaussian. In this case, many of the inner points on the inner circle are incorrectly classified.
The two plots below show a gaussian NB solution with fixed variance.
In the plots below, the contours represent probability contours of the NB solution.
This gaussian NB solution also learns the variances of individual parameters, leading to an axis-aligned covariance in the solution.
Naive Bayes/Logistic Regression can get the second (right) of these two pictures, in principle, because there's a linear decision boundary that perfectly separates.
If you used a continuous version of Naive Bayes with class-conditional Normal distributions on the features, you could separate because the variance of the red class is greater than that of the blue, so your decision boundary would be circular. You'd end up with distributions for the two classes which had the same mean (the centre point of the two rings) but where the variance of the features conditioned on the red class would be greater than that of the features conditioned on the blue class, leading to a circular decision boundary somewhere in the margin. This is a non-linear classifier, though.
You could get the same effect with histogram binning of the feature spaces, so long as the histograms' widths were narrow enough. In this case both logistic regression and Naive Bayes will work, based on histogram-like feature vectors.
How would you use Naive Bayes on these data sets?
In the usual form, Naive Bayes needs binary / categorial data.

Resources