L1-norm vs l2-norm as cost function when standardizing - machine-learning

I have some data where both the input and the output values are standardized, so the difference between Y and Y_pred is always gonna very small.
I feel that the l2-norm will penalize less the model than the l1-norm since squaring a number that is between 0 and 1 will always result in a lower number.
So my question is, is it ok to use the l2-norm when both the input and the output are standardized?

It does not matter.
The basic idea/motivation is how to penalize deviations. L1-norm does not care much about outliers, while L2-norm penalize these heavily. This is the basic difference and you will find a lot of pros and cons, even on wikipedia.
So in regards to your question if it makes sense when the expected deviations are small: sure, it behaves the same.
Let's make an example:
y_real 1.0 ||| y_pred 0.8 ||| y_pred 0.6
l1: |0.2| = 0.2 |0.4| = 0.4 => 2x times more error!
l2: 0.2^2 = 0.04 0.4^2 = 0.16 => 4x times more error!
You see, the basic idea still applies!

Related

How to convert TS-SS result to similarity measure between 0 - 1?

I'm currently developing a question plugin for some LMS that auto grade the answer based on the similarity between the answer and answer key with cosine similarity. But lately, I found that there is a better algorithm that promised to be more accurate called TS-SS. But, the result of the calculation 0 - infinity. Being not a machine learning guy, I was assuming that the result maybe a distance, just like Euclidean Distance, but I'm not sure. It can be a geometry or something, because the algorithm is calculating the triangle and sector, so I'm assuming that it is a geometric similarity or something, I'm not sure though.
So I have some example in my note, and then I tried to convert it with what people suggest, S = 1 / (1 + D), but the result was not what I'm looking for. With cosine similarity I got 0.77, but with TS-SS plus equation before, I got 0.4. And then I found this SO answer that uses S = 1 / (1.1 ** D). When I tried the equation, sure enough it gave me "relevant" result, 0.81. That is not far from cosine similarity, and in my opinion the result is better suited for auto grading than 0.77 one based on the answer key.
But unfortunately, I don't know where that equation come from, and I tried to google it but no luck, so that is why I'm asking this question.
How to convert the TS-SS result to similarity measure the right way? Is the S = 1 / (1.1 ** D) enough or...?
Edit:
When calculating TS-SS, it is actually using cosine similarity calculation as well. So, if the cosine similarity is 1, then the TS-SS will be 0. But, if the cosine similarity is 0, the TS-SS is not infinty. So, I think it is reasonable to compare the result between the two to know what conversion formula will be used
TS-SS Cosine Similarity
38.19 0
7.065 0.45
3.001 0.66
1.455 0.77
0.857 0.81
0.006 0.80
0 1
another random comparison from multiple answer key
36.89 0
9.818 0.42
7.581 0.45
3.910 0.63
2.278 0.77
2.935 0.75
1.329 0.81
0.494 0.84
0.053 0.75
0.011 0.80
0.003 0.98
0 1
comparison from the same answer key
38.11 0.71
4.293 0.33
1.448 0
1.203 0.17
0.527 0.62
Thank you in advance
With these new figures, the answer is simply that we can't give you an answer. The two functions give you a distance measure based on metrics that appear to be different enough that we can't simply transform between TS-SS and CS. In fact, if the two functions are continuous (which they're supposed to be for comfortable use), then the transformation between them isn't a bijection (two-way function).
For a smooth translation between the two, we need at least for the functions to be continuous and differentiable for the entire interval of application. a small change in the document results in a small change in the metric. We also need them to be monotonic over the interval, such that a rise in TS-SS would always result in a drop in CS.
Your data tables show that we can't even craft such transformation functions for a single document, let alone the metrics in general.
The cited question was a much simpler problem: there, the OP already has a transformation with all of desired properties; they needed only to alter the slopes of change and ensure the boundary properties.

High AUC and 100% recall, but precision and F1 are low

I have an imbalanced dataset which has 43323 rows and 9 of them belong to 'failure' class, other rows belong to 'normal' class. I trained a classifier with 100% recall and 94.89% AUC for test data (0.75/0.25 split with stratify = y). However, the classifier has 0.18% precision & 0.37% F1 score. I assumed I can find better F1 score by changing the threshold but I failed (I checked the threshold between 0 to 1 with step = 0.01). Also, it seems weired to me that usually when dealing with imbalanced dataset, it is hard to get a high recall. The goal is to get a better F1 score. What can I do for the next step? Thanks!
(To be clear, I used SMOTE to upsample the failure samples in training dataset)
Getting 100% recall is trivial in fact: just classify everything as 1.
Is the precision/recall curve any good? Perhaps a more thorough scan could yield a better result:
probabilities = model.predict_proba(X_test)
precision, recall, thresholds = sklearn.metrics.precision_recall_curve(y_test, probabilities)
f1_scores = 2 * recall * precision / (recall + precision)
best_f1 = np.max(f1_scores)
best_thresh = thresholds[np.argmax(f1_scores)]

How L1 norm select features?

It has been often said that L1 regularization helps in feature selection? How the L1 norm does that?
And also why L2 normalization is not able to do that?
At the beginning please notice L1 and L2 regularization may not always work like that, there are various quirks and it depends on applied strength and other factors.
First of all, we will consider Linear Regression as the simplest case.
Secondly it's easiest to consider only two weights for this problem to get some intuition.
Now, let's introduce a simple constraint: sum of both weights has to be equal to 1.0 (e.g. first weight w1=0.2 and second w2=0.8 or any other combination).
And the last assumptions:
x1 feature has perfect positive correlation with target (e.g. 1.0 * x1 = y, where y is our target)
x2 has almost perfect positive correlation with target (e.g. 0.99 * x2 = y)
(alpha (meaning regularization strength) will be set to 1.0 for both L1 and L2 in order no to clutter the picture further).
L2
Weights values
For two variables (weights) and L2 regularization we would have the following formula:
alpha * (w1^2 + w2^2)/2 (mean of their squares)
Now, we would like to minimize the above equation as it's part of the cost function.
One can easily see both has to be set to 0.5 (remember, their sum has to be equal 1.0!), because 0.5 ^ 2 + 0.5 ^ 2 = 0.5. For any other two values summing to 1 we would get a greater value (e.g. 0.8 ^ 2 + 0.2 ^ 2 = 0.64 + 0.68), hence 0.5 and 0.5 is optimal solution.
Target predictions
In this case we are pretty close for all data points, because:
0.5 * 1.0 + 0.5 + 0.99 = 0.995 (of `y`)
So we are "off" only by 0.005 for each sample. What this means is that regularization on weights has greater effect on cost function than this small difference (that's why w1 wasn't chosen as the only variable and the values were "split").
BTW. Exact values above will differ slightly (e.g. w1 ~0.49 but it's easier to follow along this way I think).
Final insight
With L2 regularization two similar weights tend to be "split" in half as it minimizes the regularization penalty
L1
Weights values
This time it will be even easier: for two variables (weights) and L1 regularization we would have the following formula:
alpha * (|w1| + |w2|)/2 (mean of their absolute values)
This time it doesn't matter what w1 or w2 is set to (as long as their sum has to be equal to 1.0), so |0.5| + |0.5| = |0.2| + |0.8| = |1.0| + |0.0|... (and so on).
In this case L1 regularization will prefer 1.0, the reason below
Target predictions
As the distribution of weights does not matter in this case it's loss value we are after (under the 1.0 sum constraint). For perfect predictions it would be:
1.0 * 1.0 + 0.0 * 0.99 = 1.0
This time we are not "off" at all and it's "best" to choose just w1, no need for w2 in this case.
Final insight
With L1 regularization similar weights tend to be zeroed out in favor of the one connected to feature being better at predicting final target with lowest coefficient.
BTW. If we had x3 which would once again be correlated positively with our values to predict and described by equation
0.1 * x3 = y
Only x3 would be chosen with weight equal to 0.1
Reality
In reality there is almost never "perfect correlation" of variables, there are many features interacting with each other, there are hyperparameters and imperfect optimizers amongst many other factors.
This simplified view should give you an intuition to "why" though.
A common application of your question is in different types of regression. Here is a link that explains the difference between Ridge (L2) and Lasso (L1) regression:
https://stats.stackexchange.com/questions/866/when-should-i-use-lasso-vs-ridge

Can intercept and regression coefficients (Beta values) be very high?

I have 38 variables, like oxygen, temperature, pressure, etc and have a task to determine the total yield produced every day from these variables. When I calculate the regression coefficients and intercept value, they seem to be abnormal and very high (Impractical). For example, if 'temperature' coefficient was found to be +375.456, I could not give a meaning to them saying an increase in one unit in temperature would increase yield by 375.456g. That's impractical in my scenario. However, the prediction accuracy seems right. I would like to know, how to interpret these huge intercept( -5341.27355) and huge beta values shown below. One other important point is that I removed multicolinear columns and also, I am not scaling the variables/normalizing them because I need beta coefficients to have meaning such that I could say, increase in temperature by one unit increases yield by 10g or so. Your inputs are highly appreciated!
modl.intercept_
Out[375]: -5341.27354961415
modl.coef_
Out[376]:
array([ 1.38096017e+00, -7.62388829e+00, 5.64611255e+00, 2.26124164e-01,
4.21908571e-01, 4.50695302e-01, -8.15167717e-01, 1.82390184e+00,
-3.32849969e+02, 3.31942553e+02, 3.58830763e+02, -2.05076898e-01,
-3.06404757e+02, 7.86012402e+00, 3.21339318e+02, -7.00817205e-01,
-1.09676321e+04, 1.91481734e+00, 6.02929848e+01, 8.33731416e+00,
-6.23433431e+01, -1.88442804e+00, 6.86526274e+00, -6.76103795e+01,
-1.11406021e+02, 2.48270706e+02, 2.94836048e+01, 1.00279016e+02,
1.42906659e-02, -2.13019683e-03, -6.71427100e+02, -2.03158515e+02,
9.32094007e-03, 5.56457014e+01, -2.91724945e+00, 4.78691176e-01,
8.78121854e+00, -4.93696073e+00])
It's very unlikely that all of these variables are linearly correlated, so I would suggest that you have a look at simple non-linear regression techniques, such as Decision Trees or Kernel Ridge Regression. These are however more difficult to interpret.
Going back to your issue, these high weights might well be due to there being some high amount of correlation between the variables, or that you simply don't have very much training data.
If you instead of linear regression use Lasso Regression, the solution is biased away from high regression coefficients, and the fit will likely improve as well.
A small example on how to do this in scikit-learn, including cross validation of the regularization hyper-parameter:
from sklearn.linear_model LassoCV
# Make up some data
n_samples = 100
n_features = 5
X = np.random.random((n_samples, n_features))
# Make y linear dependent on the features
y = np.sum(np.random.random((1,n_features)) * X, axis=1)
model = LassoCV(cv=5, n_alphas=100, fit_intercept=True)
model.fit(X,y)
print(model.intercept_)
If you have a linear regression, the formula looks like this (y= target, x= features inputs):
y= x1*b1 +x2*b2 + x3*b3 + x4*b4...+ c
where b1,b2,b3,b4... are your modl.coef_. AS you already realized one of your bigges number is 3.319+02 = 331 and the intercept is also quite big with -5431.
As you already mentioned the coeffiecient variables means how much the target variable changes, if the coeffiecient feature changes with 1 unit and all others features are constant.
so for your interpretation, the higher the absoult coeffienct, the higher the influence of your analysis. But it is important to note that the model is using a lot of high coefficient, that means your model is not depending only of one variable

PCA method after StandardScaler produces poor results

I tried to classification problem for fun with the scikit-learn library. I got 10000x10 dimension data, and I found very weird phenomenon (for me).
pca = PCA(n_components = 2)
ss = StandardScaler()
X = pca.fit_transform(X) # explained_variance_ratio_ = 0.8
X = ss.fit_transform(X)
in this case, i got a wonderfull explained_variance_ratio_ almost 99%. but when I apply scaling first, suddely PCA's performence is dropped drastically and explained_variance_ratio decreased to 20%.
pca = PCA(n_components = 2)
ss = StandardScaler()
X = ss.fit_transform(X)
X = pca.fit_transform(X) # explained_variance_ratio_ = 0.2
What makes this difference? Standard Scaler is just rescaling process, so I suppose no information loss. Can I apply the PCA before for visualizing conveniency? Or I must select Standardization for mathematical insurance?
Suppose, you have two features A and B that measure distance and both are in metres. Feature A has a greater range of numbers in it (suppose, 1 - 1000) as compared to a Feature B , which has a range( suppose, 1-10).
Then, the feature A will capture greater variance in the data as compared to B, and hence it is not a good idea to scale the features in this case .
But if , the features are having two different units,(say, kg and metre), then it will be wise to scale the features.
P.S: PCA preserves those components along which there is max. variance.

Resources