Reducing (Versus Delaying) Overfitting in Neural Network - machine-learning

In neural nets, regularization (e.g. L2, dropout) is commonly used to reduce overfitting. For example, the plot below shows typical loss vs epoch, with and without dropout. Solid lines = Train, dashed = Validation, blue = baseline (no dropout), orange = with dropout. Plot courtesy of Tensorflow tutorials.
Weight regularization behaves similarly.
Regularization delays the epoch at which validation loss starts to increase, but regularization apparently does not decrease the minimum value of validation loss (at least in my models and the tutorial from which the above plot is taken).
If we use early stopping to stop training when validation loss is minimum (to avoid overfitting) and if regularization is only delaying the minimum validation loss point (vs. decreasing the minimum validation loss value) then it seems that regularization does not result in a network with greater generalization but rather just slows down training.
How can regularization be used to reduce the minimum validation loss (to improve model generalization) as opposed to just delaying it? If regularization is only delaying minimum validation loss and not reducing it, then why use it?

Over-generalizing from a single tutorial plot is arguably not a good idea; here is a relevant plot from the original dropout paper:
Clearly, if the effect of dropout was to delay convergence it would not be of much use. But of course it does not work always (as your plot clearly suggests), hence it should not be used by default (which is arguably the lesson here)...


Meaning of a constant very low , training loss in learning curve

I have a trained a fasttext model on a binary text classification problem and generated the learning curve over increasing training size
I get very quick a very low training loss , close to 0, which stays constant.
I interpret this as the model overfitting on the data.
But the validation loss curve looks good to me, slowly decreasing.
Crossvalidation on unknow data produces as well accuracies with little variation, about 90% accuracy.
So I am wondering, if I indeed have an "Overfiiting" model as the learning curve suggests.
Is there any other check I can do on my model ?
As the fasttext model uses as well epochs, I am even wondering if a learning curve should vary the epochs (and keep training size constant) or "slowly increase training set size while keep epoch constant" (or both ...)

Why would a neural networks validation loss and accuracy fluctuate at first?

I am training a neural network and at the beginning of training my networks loss and accuracy on the validation data fluctuates a lot, but towards the end of training it stabilizes. I am reduce learning rate on plateau for this network. Could it be that the network starts with a high learning rate and as the learning rate decreases both accuracy and loss stabilize?
For SGD, the amount of change in the parameters is a multiple of the learning rate and the gradient of the parameter values with respect to the loss.
θ = θ − α ∇θ E[J(θ)]
Every step it takes will be in a sub-optimal direction (ie slightly wrong) as the optimiser has usually only seen some of the values. At the start of training you are relatively from the optimal solution, so the gradient ∇θ E[J(θ)] is large, therefore each sub-optimal step has a large effect on your loss and accuracy.
Over time, as you (hopefully) get closer to the optimal solution, the gradient is smaller, so the steps become smaller, meaning that the effects of being slightly wrong are diminished. Smaller errors on each step makes your loss decrease more smoothly, so reduces fluctuations.

Neural Network with Input - Relu - SoftMax - Cross Entropy Weights and Activations grow unbounded

I have implemented a neural network with 3 layers Input to Hidden Layer with 30 neurons(Relu Activation) to Softmax Output layer. I am using the cross entropy cost function. No outside libraries are being used. This is working on the NMIST dataset so 784 input neurons and 10 output neurons.
I have got about 96% accuracy with hyperbolic tangent as my hidden layer activation.
When I try to switch to relu activation my activations grow very fast which cause my weights grow unbounded as well until it blows up!
Is this a common problem to have when using relu activation?
I have tried L2 Regularization with minimal success. I end up having to set the learning rate lower by a factor of ten compared to the tanh activation and I have tried adjusting the weight decay rate accordingly and still the best accuracy I have gotten is about 90%. The rate of weight decay is still outpaced in the end by the updating of certain weights in the network which lead to an explosion.
It seems everyone is just replacing their activation functions with relu and they experience better results, so I keep looking for bugs and validating my implementation.
Is there more that goes into using relu as an activation function? Maybe I have problems in my implemenation, can someone validate accuracy with the same neural net structure?
as you can see the Relu function is unbounded on positive values, thus creating the weights to grow
in fact, that's why hyperbolic tangent and alike function are being used in those cases, to bound the output value between a certain range (-1 to 1 or 0 to 1 in most cases)
there is another approach to deal with this phenomenon called weights decay
the basic motivation is to get a more generalised model (avoid overfitting) and make sure the weights won't blow up you use a regulation value depending on the weight itself when update them
meaning that bigger weights get bigger penalty
you can farther read about it here

Logistic Regression is sensitive to outliers? Using on synthetic 2D dataset

I am currently using sklearn's Logistic Regression function to work on a synthetic 2d problem. The dataset is shown as below:
I'm basic plugging the data into sklearn's model, and this is what I'm getting (the light green; disregard the dark green):
The code for this is only two lines; model = LogisticRegression();,tr_labels). I've checked the plotting function; that's fine as well. I'm using no regularizer (should that affect it?)
It seems really strange to me that the boundaries behave in this way. Intuitively I feel they should be more diagonal, as the data is (mostly) located top-right and bottom-left, and from testing some things out it seems a few stray datapoints are what's causing the boundaries to behave in this manner.
For example here's another dataset and its boundaries
Would anyone know what might be causing this? From my understanding Logistic Regression shouldn't be this sensitive to outliers.
Your model is overfitting the data (The decision regions it found perform indeed better on the training set than the diagonal line you would expect).
The loss is optimal when all the data is classified correctly with probability 1. The distances to the decision boundary enter in the probability computation. The unregularized algorithm can use large weights to make the decision region very sharp, so in your example it finds an optimal solution, where (some of) the outliers are classified correctly.
By a stronger regularization you prevent that and the distances play a bigger role. Try different values for the inverse regularization strength C, e.g.
model = LogisticRegression(C=0.1),tr_labels)
Note: the default value C=1.0 corresponds already to a regularized version of logistic regression.
Let us further qualify why logistic regression overfits here: After all, there's just a few outliers, but hundreds of other data points. To see why it helps to note that
logistic loss is kind of a smoothed version of hinge loss (used in SVM).
SVM does not 'care' about samples on the correct side of the margin at all - as long as they do not cross the margin they inflict zero cost. Since logistic regression is a smoothed version of SVM, the far-away samples do inflict a cost but it is negligible compared to the cost inflicted by samples near the decision boundary.
So, unlike e.g. Linear Discriminant Analysis, samples close to the decision boundary have disproportionately more impact on the solution than far-away samples.

Backpropagation in Gradient Descent for Neural Networks vs. Linear Regression

I'm trying to understand "Back Propagation" as it is used in Neural Nets that are optimized using Gradient Descent. Reading through the literature it seems to do a few things.
Use random weights to start with and get error values
Perform Gradient Descent on the loss function using these weights to arrive at new weights.
Update the weights with these new weights until the loss function is minimized.
The steps above seem to be the EXACT process to solve for Linear Models (Regression for e.g.)? Andrew Ng's excellent course on Coursera for Machine Learning does exactly that for Linear Regression.
So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets and why not for GLMs (Generalized Linear Models). They all seem to be doing the same thing- what might I be missing?
The main division happens to be hiding in plain sight: linearity. In fact, extend to question to continuity of the first derivative, and you'll encapsulate most of the difference.
First of all, take note of one basic principle of neural nets (NN): a NN with linear weights and linear dependencies is a GLM. Also, having multiple hidden layers is equivalent to a single hidden layer: it's still linear combinations from input to output.
A "modern' NN has non-linear layers: ReLUs (change negative values to 0), pooling (max, min, or mean of several values), dropouts (randomly remove some values), and other methods destroy our ability to smoothly apply Gradient Descent (GD) to the model. Instead, we take many of the principles and work backward, applying limited corrections layer by layer, all the way back to the weights at layer 1.
Lather, rinse, repeat until convergence.
Does that clear up the problem for you?
You got it!
A typical ReLU is
f(x) = x if x > 0,
0 otherwise
A typical pooling layer reduces the input length and width by a factor of 2; in each 2x2 square, only the maximum value is passed through. Dropout simply kills off random values to make the model retrain those weights from "primary sources". Each of these is a headache for GD, so we have to do it layer by layer.
So, I'm trying to understand if BackPropagation does anything more than gradient descent on the loss function.. and if not, why is it only referenced in the case of Neural Nets
I think (at least originally) back propagation of errors meant less than what you describe: the term "backpropagation of errors" only refered to the method of calculating derivatives of the loss function, instead of e.g. automatic differentiation, symbolic differentiation, or numerical differentiation. No matter what the gradient was then used for (e.g. Gradient Descent, or maybe Levenberg/Marquardt).
They all seem to be doing the same thing- what might I be missing?
They're using different models. If your neural network used linear neurons, it would be equivalent to linear regression.
