Why Regularization can't accept Φ as penalizing parameter? [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Why Regularization can't accept Φ as penalizing parameter? but accept Φ^2 (L2) AND |Φ| (L1).
are there Any other penalizing parameter forms?

The Regularization term is usually a vector norm because it outputs an scalar value that means the length of the vector in a certain space. What it's important here is that it's a scalar value, not a vector. You cannot use a single vector as regularisation term, because the regularization value is added to the loss function, which is also a scalar. Thus, you need to compute some scalar from this vector and use it, and that's exactly what a vector norm does. As you can image, having a scalar value as regularisation term is pretty intuitive: the higher, the more regularization.
And yes, there are some other regularization methods. For instance, the elastic net method, which combines L1 and L2 norm functions. But usually, the most popular ones are single L1 (Lasso regression) and single L2 (ridge regression). I encourage you also to look for the Bayesian assumptions that are behind the regularization terms, they are pretty interesting and a completely different point of view ;)
In other family of algorithms such as Neural Networks, the regularization is done by early stopping, or in Variational Autoencoders with the prior belief of a isotropic Gaussian in the latent variables distribution.

Related

Is Random Forest a linear or non linear regression model [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
As decision trees are non linear models so Random Forest should also be nonlinear methods in my opinion. But at some articles i have read otherwise. Can anyone explain how are they nonlinear or not .
or in other words Is Random Forest for linear or non linear data .
If i have a variable A (dependent) and other independent variables B and C and so on . How would RF fit a regression on these variables in the data.
What RF does is to devide your data into square boxes.
When you then get a new datapoint it follows the yes/no-answers and ends up in a box.
In classification, it counts how many of each class thats in each box, and the majority of the classes is the prediciton.
When doing regression, it takes the mean of the values in each box.
In a regression setting you have the following equation
y = b0 + x1*b1 + x2*b2 +.. + xn*bn
where xi is your feature "i" and bi is the coefficient to xi.
A linear regression is linear in the coefficients but say we have the following regression
y=x0 +x1*b1 + x2*cos(b2)
that is not a linear regression since it is not linear in the coefficient b2.
To check if it is linear then the derivative of y with respect to bi should be independent of bi for all bi, i.e take the first example (the linear one):
dy/db1 = x1
which is independent of b1 (this give the same answer for all dy/dbi) but the second example
# y=x0 +x1*b1 + x2*cos(b2)
dy/db2 = x2*(-sin(b2))
which is not independent of b2 thus not a linear regression.
As you can see RF and linear regression is two different things and the linearity of a regression has nothing to do with a RF (or the other way round that matter)

Why should we use Temperature in softmax? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I'm recently working on CNN and I want to know what is the function of temperature in softmax formula? and why should we use high temperatures to see a softer norm in probability distribution?Softmax Formula
One reason to use the temperature function is to change the output distribution computed by your neural net. It is added to the logits vector according to this equation :
𝑞𝑖 =exp(𝑧𝑖/𝑇)/ ∑𝑗exp(𝑧𝑗/𝑇)
where 𝑇 is the temperature parameter.
You see, what this will do is change the final probabilities. You can choose T to be anything (the higher the T, the 'softer' the distribution will be - if it is 1, the output distribution will be the same as your normal softmax outputs). What I mean by 'softer' is that is that the model will basically be less confident about it's prediction. As T gets closer to 0, the 'harder' the distribution gets.
a) Sample 'hard' softmax probs : [0.01,0.01,0.98]
b) Sample 'soft' softmax probs : [0.2,0.2,0.6]
'a' is a 'harder' distribution. Your model is very confident about its predictions. However, in many cases, you don't want your model to do that. For example, if you are using an RNN to generate text, you are basically sampling from your output distribution and choosing the sampled word as your output token(and next input). IF your model is extremely confident, it may produce very repetitive and uninteresting text. You want it to produce more diverse text which it will not produce because when the sampling procedure is going on, most of the probability mass will be concentrated in a few tokens and thus your model will keep selecting a select number of words over and over again. In order to give other words a chance of being sampled as well, you could plug in the temperature variable and produce more diverse text.
With regards to why higher temperatures lead to softer distributions, that has to do with the exponential function. The temperature parameter penalizes bigger logits more than the smaller logits. The exponential function is an 'increasing function'. So if a term is already big, penalizing it by a small amount would make it much smaller (% wise) than if that term was small.
Here's what I mean,
exp(6) ~ 403
exp(3) ~ 20
Now let's 'penalize' this term with a temperature of let's say 1.5:
exp(6/1.5) ~ 54
exp(3/1.5) ~ 7.4
You can see that in % terms, the bigger the term is, the more it shrinks when the temperature is used to penalize it. When the bigger logits shrink more than your smaller logits, more probability mass (to be computed by the softmax) will be assigned to the smaller logits.

Ridge regression vs Lasso Regression [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Is Lasso regression or Elastic-net regression always better than the ridge regression?
I've conducted these regressions on a few data sets and I've always got the same result that the mean squared error is the least in lasso regression. Is this a mere coincidence or is this true in any case?
On the topic, James, Witten, Hastie and Tibshirani write in their book "An Introduktion to Statistical Learning":
These two examples illustrate that neither ridge regression nor the
lasso will universally dominate the other. In general, one might expect
the lasso to perform better in a setting where a relatively small
number of predictorshave substantial coefficients, and the remaining
predictors have coefficients that are very small or that equal zero.
Ridge regression will perform better when the response is a function of
many predictors, all with coefficients of roughly equal size. However,
the number of predictors that is related to the response is never
known apriori for real data sets. A technique such as cross-validation
can be used in order to determine which approach is betteron a
particular data set. (chapter 6.2)
It's different for each problem. In lasso regression, algorithm is trying to remove the extra features that doesn't have any use which sounds better because we can train with less data very nicely as well but the processing is a little bit harder, but in ridge regression the algorithm is trying to make those extra features less effective but not removing them completely which is easier to process.

Support Vector Machine : What are C & Gamma? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am new to Machine Learning 7 I have started following Udacity's Intro to Machine Learning
I was following Simple Vector Machine's when this concept of C and Gamma came along. I did some digging around and found the following:
C - A high C tries to minimize the misclassification of training data
and a low value tries to maintain a smooth classification. This makes sense to me.
Gamma - I am unable to understand this one.
Can someone explain this to me in layman terms?
When you are using SVM, you are necessarily using one of the kernels: linear, polynomial or RBF=Radial Base Function (also called Gaussian Kernel) or anything else . The latter is
K(x,x') = exp(-gamma * ||x-x'||^2)
which explicitly contains your gamma. The larger the gamma, the narrower the gaussian "bell" is.
I believe, as you go with the course, you will learn more about such "kernel trick".
Intuitively, the gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. The gamma parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors.
The C parameter trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly by giving the model freedom to select more samples as support vectors.
http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html
-C parameter: C determines how many data samples are allowed to be placed in different classes. If the value of C is set to a low value, the probability of the outliers is increased, and the general decision boundary is found. If the value of C is set high, the decision boundary is found more carefully.
C is used in the soft margin, which requires understanding of slack variables.
-Soft margin classifier:
-slack variables determine how much margin to adjust.
gamma parameter: gamma determines the distance a single data sample exerts influence. That is, the gamma parameter can be said to adjust the curvature of the decision boundary.

Understanding the probabilistic interpretation of logistic regression [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am having problem developing intuition about the probabilistic interpretation of logistic regression. Specifically, why is it valid to consider the output of logistic regression function as a probability?
Any type of classification can be seen as a probabilistic generative model by modeling the class-conditional densities p(x|C_k) (i.e. given the class C_k, what's the probability of x belonging to that class), and the class priors p(C_k) (i.e. what's the probability of class C_k), so that we can apply Bayes' theorem to obtain the posterior probabilities p(C_k|x) (i.e. given x, what's the probability that it belongs to class C_k). It is called generative because, as Bishop says in his book, you could use the model to generate synthetic data by drawing values of x from the marginal distribution p(x).
This all just means that every time you want to classify something into a specific class (e.g. size of a tumor being malignant of benign), there will be a probability of that being right or wrong.
Logistic regression uses a sigmoid function (or logistic function) in order to classify the data. Since this type of function ranges from 0 to 1, you can easily use it to think of it as probability distributions. Ultimately, you're looking for p(C_k|x) (in the example, xcould be the size of the tumor, and C_0 the class that represents benign and C_1 malignant), and in the case of logistic regression, this is modeled by:
p(C_k|x) = sigma( w^t x )
where sigmais the sigmoid function, w^t is the transposed set of weights w, and xis your feature vector.
I highly recommend you read Chapter 4 of Bishop's book.
• Probabilistic interpretation of Logistic regression is based on below 3 assumptions :
Features are real-valued Gaussian distributed.
Response variable is a Bernoulli random variable. For example, in a binary class problem, yi = 0 or 1.
For all i and j!=i, xi and xj are conditionally independent given y. (Naive Bayes assumption)
So essentially,
Logistic-Reg = Gaussian Naive Bayes + Bernoulli class labels
• The optimization equation that is shown in below image :
• And the equations for P(y=1 or 0/X) are show in below picture :
• If we do a little math, we can see that both geometric and probabilistic interpretations of logistic regression boils down to same thing.
• This link can be useful to learn more regarding Logistic regression and Naive Bayes.

Resources