h2o GLM with regularization - glm

I am using the h2o package in R to fit a GLM via the h2o.glm() fucntion. One reasonable way to assess feature importance in a GLM with the l1 regularization penalty is to monitor the order that parameters enter the linear predictor (i.e. the model) as the l1 penalty weight decreases over each successive lambda. I cannot find in the h2o documentation if it possible to extract this information from a returned model object.
Does anyone know if it is possible to view the fitted model form after each successive lambda?
Thanks,

you can get the full regularization path with h2o.getGLMFullRegularizationPath(my_glm) where my_glm is the glm you trained, just remember to set lambda_search equal to TRUE (i.e. my_glm = h2o.glm(x,y,training_frame, lambda_search = TRUE)

Related

Regularization Coefficient in Polynomial Regression

Regularization Term
Overfitting in Polynomial Regression, Comparing the training set's root mean squared error and the validation set's root-mean-squared-error.
Graph of the root-mean-square-error vs lnλ for the M=9 polynomial
I didn't understand this graph properly. While training the model to learn the parameters, we have to set λ = 0 since it doesn't make sense to already select the value of λ and then proceed with the training. So How is the training error varying as we vary the value of λ?. We divided the dataset into the valid and the train so that we train the model in the training set, and then verify the validation through the valid set.
You may confuse concepts here.
Why the loss function?
You apply a shrinkage penalty to your loss function.
This will nudge your model towards finding weights closer to zero. Which can e.g. be helpful to do regularization by trading variance with bias (see e.g. the Ridge Regression in “An Introduction to Statistical Learning” by Witten et al.)
How to train?
Setting the term λ after training should not affect your trained model. Think about it in terms of linear regression: as soon as you have fitted the linear regression function, your error function does not matter anymore. You just apply the linear regression function.
Thus, you have to set the λ parameter during training.
Otherwise, your model will optimize the parameters without the regularization term and will thus not shrink the sum of weights. Hence, you are actually training your model without regularization.
How to find a good value for λ?
You have to distinguish multiple steps:
Training: You train a model setting λ to a fixed value on your training data. The λ does not change during training but must not be zero ( if you want to have regularization).
This training will yield a model; let’s call it Model λ1.
Single Validation Run: You validate how well Model λ1 performs on the validation set. This tells you how and if λ improved your model, but not if there is a better λ.
Cross-Validation Idea: Train multiple models and use validation runs to evaluate their performance to find the best λ. In other words, you train multiple models (e.g. Model λ2 .. λ10) with different λ values. Then you can compare their validation set performance among each other, to see which value for λ is best (on the validation set).
Why having a 3 set split ( train / validation / test): If you pick a final model with this procedure (e.g. Model λ3) you still don’t know how well your model generalizes because you have been using the validation set to find a good value for λ. Thus, you would expect your model to perform rather well on the validation set.
Hence, you evaluate your final model on the test set which your model has never seen, and where you never performed any kind of parameter optimization. The performance you measure is the final performance of your model. It is important that you do not evaluate multiple models on the training set, and then select the best because, then, again you would be optimizing the performance on the training set.
How to interpret the plot?
This is actually hard without some more knowledge about the problem you are tackling.
On the first look, it seems like your model is overfitting for small values of λ and improves its performance on the validation set for larger values due to regularization.

Should I find regularization parameter or degree first for Support Vector Regression algorithm?

I am working on a problem to predict revenue generated by a film. I am using sklearn's support vector regression algorithm with polynomial kernel. I tried to find the degree which gives best accuracy using default value of regularization parameter. But, I got error percentage in the 7 digit range. So, I decided to increase variance, by tuning the regularization parameter.
So should I first assume a degree and find the regularization parameter which gives best result or vice versa? Or is there something else that I should consider?
Generally, doing a grid search with degree and regularization parameter is a common practice. There is some information about this is sklearn here:
https://scikit-learn.org/stable/modules/grid_search.html
This will allow you to make a dictionary of the kernels that you can try (rbf, poly, etc) and their respective hyperparameters and the regularization parameter and attempt to find the best one.

What determines the capacity of a machine learning model?

For a KNN model, is the value of K inversely proportional to its capacity/complexity? How can one determine whether a curve is a training curve or a validation curve based on its Loss Vs Capacity graph? Thanks! ML newbie here.
K in KNN is a hyperparameter that determines how many (K) votes from the 'nearest neighbors' are supposed to be considered to make a prediction against a sample. What do you mean by the capacity of the model? Also, the larger K is, the more complex the model.

Can a machine learning model provide information about mean and standard deviation of data on which it was trained?

Consider a parametric binary classifier (such as Logistic Regression, SVM etc.) trained on a dataset (say containing two features for e.g. Blood Pressure and Cholesterol level). The dataset is thrown away and the trained model can only be used as a black box (no tweaks and inside information can be gathered from the trained model). Only a set of data points can be provided and their labels predicted.
Is it possible to get information about the mean and/or standard deviation and/or range of the features of the dataset on which this model was trained? If yes, how so? and If no, then why can't we?
Thank you for your response! :)
SVM does not provide any information about the data statistics, it is a maximum margin classifier and it finds the best separating hyperplane between two datasets in the feature space, as a linear combination of "support vectors". If you use kernel functions, then this combination is in the kernel space, it is not even in the original feature space. SVM does not have a straightforward probabilistic interpretation whatsoever.
Logistic regression is a discriminative classifer and models the conditional probability p (y|x,w) where y is your label, x is your data and w are the features. After maximum likelihood training you are left with w and it is again a discriminator (hyperplane) in the feature space, so you don't have the features again.
The following can be considered. Use a Gaussian classifier. Assume that your class is produced by the prior class probability p (y). Then a class conditional density p (x|y,w) produces your data. Then by the Bayes rule, you will have: p (y|x,w) = (p (y)p (x|y,w))/p (x). If you define the class conditional density p (x|y,w) as Gaussian, its parameter set w will consists of the mean vector m and covariance matrix C of x, assuming it is being produced by the class y. But remember that, this will work only based on the assumption that the current data vector belongs to a specific class. Conditioned on w, a better option would be for mean vector: E [x|w]. This the expectation of x with respect to p (x|w). It comes down to a weighted average of mean vectors for the class y=0 and y=1, with respect to their prior class probabilities. Same should work for covariance as well, but it needs to be derived properly, I am not %100 sure right now.

Output of the logit function as addtional input to a Neural Network

This is w.r.t a hybrid of ANN and logistic regression in a binary classification problem. For example in one of the papers I came across they state that "A hybrid model type is constructed by using the logistic regression model to calculate the probability of failure and then adding that value as an additional input variable into the ANN. This type of model is defined as a Plogit-ANN model".
So, for n input variables, I'm trying to understand how the additional input n+1 to a ANN is treated by the activation function (eg. a logit function) and in the summation of weights multiplied by inputs. Do we treat this probability variable n+1 as one of the standalone weights like a special type of b0 that we add in the sum of weights multiplied by inputs e.g. Summation for each Neuron = (Sum (Wi*Xi))+additional variable.
Thank you for your assistance.
According to the description provided the easiest way is to treat this as additional feature of your data. So you have a model that predicts something about your original dataset (probability of some additional thing), thus you get x -> f(x). You simply concatenate it to your feature vector so x' = [x1 x2 ... xk f(x)], and push it through the network.
However the described approach is quite naive, since you are doing these two things (training f and training neural net) completely independently), what might be more beneficial is to instead treat fitting f as an auxiliary loss and train your model jointly.

Resources