decay_rate = 0.99 # decay factor for RMSProp leaky sum of grad^2
I'm perplexed by the wording of comments like the above where they talk about a "leaky" sum of squares for the RMSProp optimizer. So far I've been able to uncover that this particular line is copy-pasta'd from Andrej Karpathy's Deep Reinforcement Learning: Pong from Pixels, and that RMSProp is an unpublished optimizer proposed by Hinton in one of his Coursera Classes. Looking at the math for RMSProp from link 2, it's hard to figure out how any of this is "leaky."
Would anyone happen to know why RMSProp is described this way?
RMsprop keeps the exponentialy decaying average of squared gradients. Wording (however unfortunate) of "leaky" refers to the fact how much of the previous estimate "leaks" to the current one, since
E[g^2]_t := 0.99 E[g^2]_{t-1} + 0.01 g^2_t
\_______________/ \________/
"leaking" new data
Related
Quick question :
Is RMSProp optimizer compatible with online (stochastic, update weights every turn) learning ? All I can read of is about RMSProp being used with mini-batch or full-batch update, but none seems to explicitely state that online stochastic learning would be out of question.
Very short answer: it is. You can use it with SGD. Example: http://www.erogol.com/comparison-sgd-vs-momentum-vs-rmsprop-vs-momentumrmsprop/
Using a LogisticRegression class in scikit-learn on a version of the flight delay dataset.
I use pandas to select some columns:
df = df[["MONTH", "DAY_OF_MONTH", "DAY_OF_WEEK", "ORIGIN", "DEST", "CRS_DEP_TIME", "ARR_DEL15"]]
I fill in NaN values with 0:
df = df.fillna({'ARR_DEL15': 0})
Make sure the categorical columns are marked with the 'category' data type:
df["ORIGIN"] = df["ORIGIN"].astype('category')
df["DEST"] = df["DEST"].astype('category')
Then call get_dummies() from pandas:
df = pd.get_dummies(df)
Now I train and test my data set:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
test_set, train_set = train_test_split(df, test_size=0.2, random_state=42)
train_set_x = train_set.drop('ARR_DEL15', axis=1)
train_set_y = train_set["ARR_DEL15"]
test_set_x = test_set.drop('ARR_DEL15', axis=1)
test_set_y = test_set["ARR_DEL15"]
lr.fit(train_set_x, train_set_y)
Once I call the score method I get around 0.867. However, when I call the roc_auc_score method I get a much lower number of around 0.583
probabilities = lr.predict_proba(test_set_x)
roc_auc_score(test_set_y, probabilities[:, 1])
Is there any reason why the ROC AUC is much lower than what the score method provides?
To start with, saying that an AUC of 0.583 is "lower" than a score* of 0.867 is exactly like comparing apples with oranges.
[* I assume your score is mean accuracy, but this is not critical for this discussion - it could be anything else in principle]
According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.
The truth is that, roughly speaking, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds.
The (decision) threshold in binary classification is the value above which we decide to label a sample as 1 (recall that probabilistic classifiers actually return a value p in [0, 1], usually interpreted as a probability - in scikit-learn it is what predict_proba returns).
Now, this threshold, in methods like scikit-learn predict which return labels (1/0), is set to 0.5 by default, but this is not the only possibility, and it may not even be desirable in come cases (imbalanced data, for example).
The point to take home is that:
when you ask for score (which under the hood uses predict, i.e. labels and not probabilities), you have also implicitly set this threshold to 0.5
when you ask for AUC (which, in contrast, uses probabilities returned with predict_proba), no threshold is involved, and you get (something like) the accuracy averaged across all possible thresholds
Given these clarifications, your particular example provides a very interesting case in point:
I get a good-enough accuracy ~ 87% with my model; should I care that, according to an AUC of 0.58, my classifier does only slightly better than mere random guessing?
Provided that the class representation in your data is reasonably balanced, the answer by now should hopefully be obvious: no, you should not care; for all practical cases, what you care for is a classifier deployed with a specific threshold, and what this classifier does in a purely theoretical and abstract situation when averaged across all possible thresholds should pose very little interest for a practitioner (it does pose interest for a researcher coming up with a new algorithm, but I assume that this is not your case).
(For imbalanced data, the argument changes; accuracy here is practically useless, and you should consider precision, recall, and the confusion matrix instead).
For this reason, AUC has started receiving serious criticism in the literature (don't misread this - the analysis of the ROC curve itself is highly informative and useful); the Wikipedia entry and the references provided therein are highly recommended reading:
Thus, the practical value of the AUC measure has been called into question, raising the possibility that the AUC may actually introduce more uncertainty into machine learning classification accuracy comparisons than resolution.
[...]
One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system
Emphasis mine - see also On the dangers of AUC...
I don't know what exactly AIR_DEL15 is, which you use as your label (it is not in the original data). My guess is that it is an imbalanced feature, i.e there are much more 0's than 1's; in such a case, accuracy as a metric is not meaningful, and you should use precision, recall, and the confusion matrix instead - see also this thread).
Just as an extreme example, if 87% of your labels are 0's, you can have a 87% accuracy "classifier" simply (and naively) by classifying all samples as 0; in such a case, you would also have a low AUC (fairly close to 0.5, as in your case).
For a more general (and much needed, in my opinion) discussion of what exactly AUC is, see my other answer.
Should I avoid to use L2 regularization in conjuntion with RMSprop and NAG?
The L2 regularization term interferes with the gradient algorithm (RMSprop)?
Best reggards,
Seems that someone have sorted out (2018) the question (2017).
Vanilla adaptive gradients (RMSProp, Adagrad, Adam, etc) do not match well with L2 regularization.
Link to the paper [https://arxiv.org/pdf/1711.05101.pdf] and some intro:
In this paper, we show that a
major factor of the poor generalization of the most popular
adaptive gradient method, Adam, is due to the fact that L2
regularization is not nearly as effective for it as for SGD.
L2 regularization and weight decay are not identical.
Contrary to common belief, the two techniques are not
equivalent. For SGD, they can be made equivalent by
a reparameterization of the weight decay factor based
on the learning rate; this is not the case for Adam. In
particular, when combined with adaptive gradients, L2
regularization leads to weights with large gradients
being regularized less than they would be when using
weight decay.
I have to deal with Class Imbalance Problem and do a binary-classification of the input test data-set where majority of the class-label is 1 (the other class-label is 0) in the training data-set.
For example, following is some part of the training data :
93.65034,94.50283,94.6677,94.20174,94.93986,95.21071,1
94.13783,94.61797,94.50526,95.66091,95.99478,95.12608,1
94.0238,93.95445,94.77115,94.65469,95.08566,94.97906,1
94.36343,94.32839,95.33167,95.24738,94.57213,95.05634,1
94.5774,93.92291,94.96261,95.40926,95.97659,95.17691,0
93.76617,94.27253,94.38002,94.28448,94.19957,94.98924,0
where the last column is the class-label - 0 or 1. The actual data-set is very skewed with a 10:1 ratio of classes, that is around 700 samples have 0 as their class label, while the rest 6800 have 1 as their class label.
The above mentioned are only a few of the all the samples in the given data-set, but the actual data-set contains about 90% of samples with class-label as 1, and the rest with class-label being 0, despite the fact that more or less all the samples are very much similar.
Which classifier should be best for handling this kind of data-set ?
I have already tried logistic-regression as well as svm with class-weight parameter set as "balanced", but got no significant improvement in accuracy.
but got no significant improvement in accuracy.
Accuracy isn't the way to go (e.g. see Accuracy paradox). With a 10:1 ratio of classes you can easily get a 90% accuracy just by always predicting class-label 0.
Some good starting points are:
try a different performance metric. E.g. F1-score and Matthews correlation coefficient
"resample" the dataset: add examples from the under-represented class (over-sampling) / delete instances from the over-represented class (under-sampling; you should have a lot of data)
a different point of view: anomaly detection is a good try for an imbalanced dataset
a different algorithm is another possibility but not a silver shoot. Probably you should start with decision trees (often perform well on imbalanced datasets)
EDIT (now knowing you're using scikit-learn)
The weights from the class_weight (scikit-learn) parameter are used to train the classifier (so balanced is ok) but accuracy is a poor choice to know how well it's performing.
The sklearn.metrics module implements several loss, score and utility functions to measure classification performance. Also take a look at How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?.
Have you tried plotting a ROC curve and AUC curve to check your parameters and different thresholds? If not that should give you a good starting point.
I am currently reproducing the code for char-RNN described in http://karpathy.github.io/2015/05/21/rnn-effectiveness/. There are codes already implemented in tensorflow and the code I am referring to is at https://github.com/sherjilozair/char-rnn-tensorflow/blob/master/train.py I am having a question for the learning rate decay.In the code the optimizer is defined as an AdamOptimizer. When I went through the code, I saw a line as following:
for e in range(args.num_epochs):
sess.run(tf.assign(model.lr, args.learning_rate * (args.decay_rate ** e)))
which adjusts the learning rate by a decay constant.
My question is: isn't Adam optimizer making us able to control the learning rate? Why do we still use a decay rate with respect to learning rate here?
I think you mean RMSprop and not Adam, both of the codes you linked use RMSprop. RMSprop only scales gradients to not have too large or too small norms. So, it is important to decay the learning rate when we have to slow down training after several epochs.