In Scikit-Learn (for some reason I am still on 0.18.x), the sklearn.linear_model.LassoLars class after fitting, has the coefficient LARS path as coef_path_ attribute and the coefficient as coef_ attribute. I am wondering why the coef_ values are not the same as the values at the last step of coef_path_. Did I misunderstand anything about LARS? By looking at the source code of _fit() method of Lars class in scikit-learn, coef_ should be coef_path_[:,-1] but some how I have different numbers for these two attributes.
Answering my own question: it is all about the normalization. The difference is if normalize is set as True, then coef_path_ are the coefficients based on normalized features and coef_ will be the last slice of coef_path_ after removing the normalization factor.
Related
In Sklearn, roc curve requires (y_true, y_scores). Generally, for y_scores, I feed in probabilities as outputted by a classifier's predict_proba function. But in the sklearn example, I see both predict_prob and decision_fucnction are used.
I wonder what is the difference in terms of real life model evaluation?
The functional form of logistic regression is -
f(x)=11+e−(β0+β1x1+⋯+βkxk)
This is what is returned by predict_proba.
The term inside the exponential i.e.
d(x)=β0+β1x1+⋯+βkxk
is what is returned by decision_function. The "hyperplane" referred to in the documentation is
β0+β1x1+⋯+βkxk=0
My Understanding after reading few resources:
Decision Function: Gives the distances from the hyperplane. These are therefore unbounded. This can not be equated to probabilities. For getting probabilities, there are 2 solutions - Platt Scaling & Multi-Attribute Spaces to calibrate outputs using Extreme Value Theory.
Predict Proba: Gives the actual probabilities (0 to 1) however attribute 'probability' has to be set to True while fitting the model itself. It uses Platt scaling which is known to have theoretical issues.
Refer to this in documentation.
Keras model.fit supports per-sample weights. What is the range of acceptable values for these weights? Must they sum to 1 across all training samples? Or does keras accept any weight values and then perform some sort of normalization? The keras source includes, e.g. training_utils.standardize_weights but that does not appear to be doing statistical standardization.
After looking at the source here, I've found that you should be able to pass any acceptable numerical values (within overflow bounds) for both sample weights and class weights. They do not need to sum to 1 across all training samples and each weight may be greater than one. The only sort of normalization that appears to be happening is taking the max of 2D class weight inputs.
If both class weights and samples weights are provided, it provides the product of the two.
I think the unspoken component here is that the activation function should be dealing with normalization.
In logestic regression algorithm in scikit-learn library of python, there is a "class_weight" argument. I wish to know what is the mathematical principal of realizing setting class_weight during model fitting. Is it related to modify the target function:
https://drive.google.com/open?id=16TKZFCwkMXRKx_fMnn3d1rvBWwsLbgAU
And what is the specific modification?
Thank you in advance!
I will appreciate any help from you!
Yes, it affects the loss function and it is very commonly used when your labels are imbalanced. Mathematically, the loss function just becomes a weighted average of per sample losses where the weights depend on the class of the given sample. If no class_weight is used, then all samples are weighted uniformly (as in the picture you attached).
The idea is to punish mistakes on predictions of underrepresented classes more than mistakes on the overrepresented classes.
See a more detailed discussion here.
The output of PCA are the eigenvectors and eigenvalues of the covariance (or correlation) matrix of the original data. Let's say the are $x_1,...,x_n$ columns, then, there are $z_1,...,z_n$ eigenvalues and $\tilde{z_1},...,\tilde{z_n}$ eigenvectors. My question are:
can I use the value $\tilde{z_1}^{(1)},...,\tilde{z_1}^{(n)}$ of the first (or also the other) eigenvector as weight of my model? for example as the weight of the columns $x_1,...,x_n$, a kind of Unsupervised method.
I understand the weight $\tilde{z_1}^{(1)},...,\tilde{z_1}^{(n)}$ as the contribution value of every column. Is it correct?
Can I use Spearman or Kendall correlation instead of covariance? Is it going to change the results?
I know that is it not a conventional way to use PCA but I would like to know if it makes sense.
First of all, you can do whatever you want my friend. It's your model and you can use it in the way you want.
However, I wouldn't do that. PCA is a way to reduce dimensionality, you lose data to train your model faster. PCA can be seen as if you reduce a sphere of 3 dimensions to a circle of 2 dimensions. In some situations it can be useful.
With the output of PCA you can train your model and see what you get. However, have you tried to train your model without PCA? Maybe you don't need it and you are losing information that would improve your model
Okay I have a lot of confusion in regards to the way likelihood functions are defined in the context of different machine learning algorithms. For the context of this discussion, I will reference Andrew Ng 229 lecture notes.
Here is my understanding thus far.
In the context of classification, we have two different types of algorithms: discriminative and generative. The goal in both of these cases is to determine the posterior probability, that is p(C_k|x;w), where w is parameter vector and x is feature vector and C_k is kth class. The approaches are different as in discriminative we are trying to solve for the posterior probability directly given x. And in the generative case, we are determining the conditional distributions p(x|C_k), and prior classes p(C_k), and using Bayes theorem to determine P(C_k|x;w).
From my understanding Bayes theorem takes the form: p(parameters|data) = p(data|parameters)p(parameters)/p(data) where the likelihood function is p(data|parameters), posterior is p(parameters|data) and prior is p(parameters).
Now in the context of linear regression, we have the likelihood function:
p(y|X;w) where y is the vector of target values, X is design matrix.
This makes sense in according to how we defined the likelihood function above.
Now moving over to classification, the likelihood is defined still as p(y|X;w). Will the likelihood always be defined as such ?
The posterior probability we want is p(y_i|x;w) for each class which is very weird since this is apparently the likelihood function as well.
When reading through a text, it just seems the likelihood is always defined to different ways, which just confuses me profusely. Is there a difference in how the likelihood function should be interpreted for regression vs classification or say generative vs discriminative. I.e the way the likelihood is defined in Gaussian discriminant analysis looks very different.
If anyone can recommend resources that go over this in detail I would appreciate this.
A quick answer is that the likelihood function is a function proportional to the probability of seeing the data conditional on all the parameters in your model. As you said in linear regression it is p(y|X,w) where w is your vector of regression coefficients and X is your design matrix.
In a classification context, your likelihood would be proportional to P(y|X,w) where y is your vector of observed class labels. You do not have a y_i for each class, because your training data was observed to be in one particular class. Given your model specification and your model parameters, for each observed data point you should be able to calculate the probability of seeing the observed class. This is your likelihood.
The posterior predictive distribution, p(y_new_i|X,y), is the probability you want in paragraph 4. This is distinct from the likelihood because it is the probability for some unobserved case, rather than the likelihood, which relates to your training data. Note that I removed w because typically you would want to marginalize over it rather than condition on it because there is still uncertainty in the estimate after training your model and you would want your predictions to marginalize over that rather than condition on one particular value.
As an aside, the goal of all classification methods is not to find a posterior distribution, only Bayesian methods are really concerned with a posterior and those methods are necessarily generative. There are plenty of non-Bayesian methods and plenty of non-probabilistic discriminative models out there.
Any function proportional to p(a|b) where a is fixed is a likelihood function for b. Note that p(a|b) might be called something else, depending on what's interesting at the moment. For example, p(a|b) can also be called the posterior for a given b. The names don't really matter.