Feature Significance Test for Regression Models e.g. Lasso - machine-learning

I'm not talking about feature selection here. Imagine you have selected your model (any regression model) and your features and happy with what you have, and you want to do regression. After doing regression then you want to do a significance test on your features to see which one has been more important overall in your regression model.
To my knowledge, there is no command in Scikit-Learn/Tensorflow for doing that, for any of the regression models, please correct me if you know of any. Only for random forest you can get it as below:
regr_rf = RandomForestRegressor(n_estimators=10, max_depth=max_depth, random_state=10)
regr_rf.fit(X_train, Y_train)
importances = list(regr_rf.feature_importances_)
print("features importance is", importance)
So what would be the best way of doing that manually for other regression models? Would like removing features one-by-one and be a way to go?

Related

Application and Deployment of K-Fold Cross-Validation

K-Fold Cross Validation is a technique applied for splitting up the data into K number of Folds for testing and training. The goal is to estimate the generalizability of a machine learning model. The model is trained K times, once on each train fold and then tested on the corresponding test fold.
Suppose I want to compare a Decision Tree and a Logistic Regression model on some arbitrary dataset with 10 Folds. Suppose after training each model on each of the 10 folds and obtaining the corresponding test accuracies, Logistic Regression has a higher mean accuracy across the test folds, indicating that it is the better model for the dataset.
Now, for application and deployment. Do I retrain the Logistic Regression model on all the data, or do I create an ensemble from the 10 Logistic Regression models that were trained on the K-Folds?
The main goal of CV is to validate that we did not get the numbers by chance. So, I believe you can just use a single model for deployment.
If you are already satisfied with hyper-parameters and model performance one option is to train on all data that you have and deploy that model.
And, the other option is obvious that you can deploy one of the CV models.
About the ensemble option, I believe it should not give significant better results than a model trained on all data; as each model train for same amount of time with similar paparameters and they have similar architecture; but train data is slightly different. So, they shouldn't show different performance. In my experience, ensemble helps when the output of models are different due to architecture or input data (like different image sizes).
The models trained during k-fold CV should never be reused. CV is only used for reliably estimating the performance of a model.
As a consequence, the standard approach is to re-train the final model on the full training data after CV.
Note that evaluating different models is akin to hyper-parameter tuning, so in theory the performance of the selected best model should be reevaluated on a fresh test set. But with only two models tested I don't think this is important in your case.
You can find more details about k-fold cross-validation here and there.

Regression Model Comparrison

I'm looking for metrics to compare various regressions models (e.g. SVM, Decision Tree, Neural Network etc), to decide the merits of each for solving a specific problem.
For my problem I have just over 80,000 training samples with 12 variables, all of which are independent and identically distributed.
I've done most of my research into neural networks but I'm drawing a blank when trying to compare them against other models.
Any input (including reading suggestions) would be greatly appreciated, thanks!
You can compare regression models by calculating the mean squared error for each model over a test set. The best model will simply be the one with the least error.
Sadly, there ist nothing like roc curves for regression models. Except your output is a binary variable like with logistic regression.

In stacking for machine learning which order should you train the models in?

I am currently learning to do stacking in a machine learning problem. I am going to get the outputs of the first model and use these outputs as features for the second model.
My question is: Does the order matter? I am using a lasso regression model and a boosted tree. In my problem the regression model outperforms the boosted tree. I am thinking therefore that I should use the regression tree second and the boosted tree first.
What are the factors I need to think about when making this decision?
Why don't you try feature engineering to create more features?
Don't try to use predictions from one model as features for another model.
You can try using K-means to cluster similar training samples.
For stacking, just use different models and then average the results (assuming that you have a continuous y variable).

How to evaluate the performance of different model on one dataset?

I want to evaluate the performance different model such as SVM, RandForest, CNN etc, I only have one dataset. So I split the dataset to training set and testing set and train different model on this dataset with training data and test with testing dataset.
My question: Can I get the real performance of different model on only one dataset? For example: I found SVM model get the best result, So Should I select the SVM as my final classification model?
Its probably a better idea to cross validate your models with different test samples through cross validation to avoid biases. Also check your models against different evaluation metrics depending upon your application type. For instance use recall, accuracy and AUC for each model if its a classification problem.
Evaluation results can be pretty deceptive and require extensive validation.
You can Plot ROC curve for all the models.The model for which AUC is highest will be best model.

Multinomial logistic regression steps in SPSS

I have data suited to multinomial logistic regression but I don't know how to formulate the model in predicting my Y.
How do I perform Multinomial Logistic Regression using SPSS?
How does stepwise method work?
There are plenty of examples of annotated output for SPSS multinomial logistic regression:
UCLA example
My own list of links and resources
Stepwise method provides a data driven approach to selection of your predictor variables. In general the decision to use data-driven or direct entry or hierarchical approaches is related to whether you want to test theory (i.e., direct entry or hierarchical) or you want to simply optimise prediction (i.e., stepwise and related methods).

Resources