Should I avoid to use L2 regularization in conjuntion with RMSProp? - machine-learning

Should I avoid to use L2 regularization in conjuntion with RMSprop and NAG?
The L2 regularization term interferes with the gradient algorithm (RMSprop)?
Best reggards,

Seems that someone have sorted out (2018) the question (2017).
Vanilla adaptive gradients (RMSProp, Adagrad, Adam, etc) do not match well with L2 regularization.
Link to the paper [https://arxiv.org/pdf/1711.05101.pdf] and some intro:
In this paper, we show that a
major factor of the poor generalization of the most popular
adaptive gradient method, Adam, is due to the fact that L2
regularization is not nearly as effective for it as for SGD.
L2 regularization and weight decay are not identical.
Contrary to common belief, the two techniques are not
equivalent. For SGD, they can be made equivalent by
a reparameterization of the weight decay factor based
on the learning rate; this is not the case for Adam. In
particular, when combined with adaptive gradients, L2
regularization leads to weights with large gradients
being regularized less than they would be when using
weight decay.

Related

Which model performance metric is used for SVR model by sklearn?

I noticed the math for SVR states that SVR uses L1 penalty or epsilon insensitive loss function. But sklearn SVR model documentation mentions L2 penalty. I don't have much experience with SVR thought the community who has experience could shed some light on this.
Here is the snippet from the documentation:
C: float, default=1.0
Regularization parameter. The strength of the
regularization is inversely proportional to C. Must be strictly
positive. The penalty is a squared l2 penalty.
Check out this link: https://scikit-learn.org/stable/modules/svm.html#svm-regression. quote - Here, we are penalizing samples whose prediction is at least away from their true target

Does L1 or L2 regularization give the most sparse weights for the same loss function and optimizer?

If I consider a dataset, which regularization technique (L1 regularization or L2 regularization) will output the highest sparse weights for the same loss function and same optimizer?
By definition, L1 regularization (lasso) forces some weights to zero, thus leading to sparser solutions; according to the Wikipedia entry on regularization:
It can be shown that the L1 norm induces sparsity
See also the L1 and L2 Regularization Methods post at Towards Data Science:
The key difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.
For more details, see the following threads # Cross Validated:
Sparsity in Lasso and advantage over ridge
Why does the Lasso provide Variable Selection?

feature normalization- advantage of l2 normalization

Features are usually normalized prior to classification.
L1 and L2 normalization are usually used in the literature.
Could anybody comment on the advantages of L2 norm (or L1 norm) compared to L1 norm (or L2 norm)?
Advantages of L2 over L1 norm
As already stated by aleju in the comments, derivations of the L2 norm are easily computed. Therefore it is also easy to use gradient based learning methods.
L2 regularization
optimizes the mean cost (whereas L1 reduces the median
explanation) which is often used as a performance measurement. This is especially good if you know you don't have any outliers and you want to keep the overall error small.
The solution is more likely to be unique. This ties in with the previous point: While the mean is a single value, the median might be located in an interval between two points and is therefore not unique.
While L1 regularization can give you a sparse coefficient vector, the non-sparseness of L2 can improve your prediction performance (since you leverage more features instead of simply ignoring them).
L2 is invariant under rotation. If you have a dataset consisting of points in a space and you apply a rotation, you still get the same results (i.e. the distances between points remain the same).
Advantages of L1 over L2 norm
The L1 norm prefers sparse coefficient vectors. (explanation on Quora) This means the L1 norm performs feature selection and you can delete all features where the coefficient is 0. A reduction of the dimensions is useful in almost all cases.
The L1 norm optimizes the median. Therefore the L1 norm is not sensitive to outliers.
More sources:
The same question on Quora
Another one
If you are working with inverse problems, L1 will return a more sparse matrix and L2 will return a more correlated matrix.

What is `weight_decay` meta parameter in Caffe?

Looking at an example 'solver.prototxt', posted on BVLC/caffe git, there is a training meta parameter
weight_decay: 0.04
What does this meta parameter mean? And what value should I assign to it?
The weight_decay meta parameter govern the regularization term of the neural net.
During training a regularization term is added to the network's loss to compute the backprop gradient. The weight_decay value determines how dominant this regularization term will be in the gradient computation.
As a rule of thumb, the more training examples you have, the weaker this term should be. The more parameters you have (i.e., deeper net, larger filters, larger InnerProduct layers etc.) the higher this term should be.
Caffe also allows you to choose between L2 regularization (default) and L1 regularization, by setting
regularization_type: "L1"
However, since in most cases weights are small numbers (i.e., -1<w<1), the L2 norm of the weights is significantly smaller than their L1 norm. Thus, if you choose to use regularization_type: "L1" you might need to tune weight_decay to a significantly smaller value.
While learning rate may (and usually does) change during training, the regularization weight is fixed throughout.
Weight decay is a regularization term that penalizes big weights.
When the weight decay coefficient is big the penalty for big weights is also big, when it is small weights can freely grow.
Look at this answer (not specific to caffe) for a better explanation:
Difference between neural net "weight decay" and "learning rate".

What is the difference between K-means clustering and vector quantization?

What is the difference between K-means clustering and vector quantization?
They seem to be very similar.
I'm dealing with Hidden Markov Models and I need to extract symbols from feature vectors.
In order to extract symbols, do I do vector quantization or k-means clustering?
The way I understand it, K-means is one type of vector quantization.
The K-means algorithms is the specialization of the celebrated "Lloyd I" quantization algorithm to the case of empirical distributions. (cf. Lloyd)
The Lloyd I algorithm is proved to yield a sequence of quantizers with a decreasing quadratic distortion. However, except in the special case of one-dimensional log-concave distributions, it dos not always converge to a quadratic optimal quantizer. (There are local minimums for the quantization error, especially when dealing with empirical distribution i.e. for the clustering problem.)
A method that converges (always) toward an optimal quantizer is the so-called CLVQ algorithms, which also generalizes to the problem of more general L^p quantization. It is a kind of Stochastic Gradient method. (cf. Pagès)
There are also some approaches based on genetic algorithms. (cf. Hamida et al.), and/or classical optimization procedures for the one dimensional case that converge faster (Pagès, Printems).

Resources