Build one more model after dropping the features with importance - machine-learning

We build a Data Science model and watch feature importance. If we drop the features and build a new model , will there be any improvement in the accuracy?. I see only one advantage is that consumer of model can pass only limited parameters to get the prediction. Are there any other advantages?

Yes!
Fewer parameters mean a faster learning and a faster prediction. Done correctly, it also means a smaller chance of overfitting your data.

The non-relevant features act as a noise, thus it will reduce the accuracy of the model.
While training the model it makes the convergence towards global minima harder, as it minimizes a more complex function.

Related

What is meant by stability in relation to neural networks

I hear the terms stability/instability thrown around a lot when reading up on Deep Q Networks. I understand that stability is improved with the addition of a target network and replay buffer but I fail to understand exactly what it's refering to.
What would the loss graph look like for an instable vs stable neural network?
What does it mean when a neural network converges/diverges?
Stability, also known as algorithmic stability, is a notion in
computational learning theory of how a machine learning algorithm is
perturbed by small changes to its inputs. A stable learning algorithm
is one for which the prediction does not change much when the training
data is modified slightly.
Here Stability means suppose you have 1000 training data that you use to train the model and it performs well. So in terms of model stability if you train the same model with 900 training data the model should still perform well , thats why it is also called as algorithmic stability.
As For the loss Graph if the model is stable the loss graph probably should be same for both size of training data (1000 & 900). And different in case of unstable model.
As in Machine learning we want to minimize loss so when we say a model converges we mean to say that the model's loss value is within acceptable margin and the model is at that stage where no additional training would improve the model.
Divergence is a non-symmetric metrics which is used to measure the difference between continuous value. For example you want to calculate difference between 2 graphs you would use Divergence instead of traditional symmetric metrics like Distance.

Train/Test Datasets in Machine Learning

I just have a general question:
In a previous job, I was tasked with building a series of non-linear models to quantify the impact of certain factors on the number of medical claims filed. We had a set of variables we would use in all models (eg: state, year, Sex, etc.). We used all of our data to build these models; meaning we never split the data into training and test data sets.
If I were to go back in time to this job and split the data into training and test data sets, what would the advantages of that approach be besides assessing the prediction accuracy of our models. What is an argument for not splitting the data and then fitting the model? Never really thought about it too much until now - curious as to why we didn't take that approach.
Thanks!
The sole purpose of setting aside a test set is to assess prediction accuracy. However, there is more to this than just checking the number and thinking "huh, that's how my model performs"!
Knowing how your model performs at a given moment gives you an important benchmark for potential improvements of the model. How will you know otherwise whether adding a feature increases model performance? Moreover, how do you know otherwise whether your model is at all better than mere random guessing? Sometimes, extremely simple models outperform the more complex ones.
Another thing is removal of features or observations. This depends a bit on the kind of models you use, but some models (e.g., k-Nearest-Neighbors) perform significantly better if you remove unimportant features from the data. Similarly, suppose you add more training data and suddenly your model's test performance drops significantly. Perhaps there is something wrong with the new observations? You should be aware of these things.
The only argument I can think of for not using a test set is that otherwise you'd have too little training data for the model to perform optimally.

Do we need to care about target variable distribution in train and validation set in regression problem?

In a classification problem, we care about the distribution of the labels in train and validation set. In sklearn, there is stratify option in train_test_split to ensure that the distribution of the labels in train and validation set are similar.
In a regression problem, let's say we want to predict the housing price based on a bunch of features. Do we need to care about the distribution of the housing price in train and validation set?
If yes, how to we achieve this in sklearn?
Forcing features to have similar distributions in your training and in your validation set assumes highly trusting the data you have to be representative of the data you will encounter in real life (ie. in a production environment), which is often not the case.
Also, doing so may virtually increase your validation score, compared to your test score.
Instead of adjusting feature distributions in train and validation sets, I would suggest you to perform cross-validation (in sklearn), which may be more representative of a testing situation.
This book ('A. Geron, Hands-On Machine Learning with Scikit-Learn & TensorFlow, O'Reilly, 2017) provides an excellent introductory discussion of this in chapter 2. To paraphrase:
Generally for large datasets you don't need to perform stratified sampling: You training set should be a fair representation of the range of observed instances (there are of course exceptions to this). For smaller datasets you could introduce sampling bias (I.e., disproportionately recording data from only a particular region of the expected range of target attributes) if you performed random sampling and stratified sampling is properly required.
Practically, you will need to create a new categorical feature by binning this continuous feature. You can then perform stratified sampling of this categorical feature. Make sure to remove this new categorical feature before training your data!
However, to do this you will need to have a good understanding of your features, I doubt there will be much point in performing stratified sampling of features of weak predictive power. I guess it could even do harm if you introduce some unintentional bias in the data by performing non-random sampling.
Take home message:
My instinct is that stratified sampling of a continuous feature should always be information and understanding lead. I.e, if you know a feature is a strong predictor of the target variable and you also know the sampling across its values is not uniform, you probably want to perform stratified sampling to make sure the range of values are properly represented in both the training and validation set.

Machine Learning - Feature Ranking by Algorithms

I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome. I have 5 algorithms:
Neural Networks
Logistics
Naive
Random Forest
Adaboost
I read a lot about Information Gain technique and it seems it is independent of the machine learning algorithm used. It is like a preprocess technique.
My question follows, is it best practice to perform feature importance for each algorithm dependently or just use Information Gain. If yes what are the technique used for each ?
First of all, it's worth stressing that you have to perform the feature selection based on the training data only, even if it is a separate algorithm. During testing, you then select the same features from the test dataset.
Some approaches that spring to mind:
Mutual information based feature selection (eg here), independent of the classifier.
Backward or forward selection (see stackexchange question), applicable to any classifier but potentially costly since you need to train/test many models.
Regularisation techniques that are part of the classifier optimisation, eg Lasso or elastic net. The latter can be better in datasets with high collinearity.
Principal components analysis or any other dimensionality reduction technique that groups your features (example).
Some models compute latent variables which you can use for interpretation instead of the original features (e.g. Partial Least Squares or Canonical Correlation Analysis).
Specific classifiers can aid interpretability by providing extra information about the features/predictors, off the top of my head:
Logistic regression: you can obtain a p-value for every feature. In your interpretation you can focus on those that are 'significant' (eg p-value <0.05). (same for two-classes Linear Discriminant Analysis)
Random Forest: can return a variable importance index that ranks the variables from most to least important.
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome.
This will depend on the algorithm. If you have 5 algorithms, you will likely get 5 slightly different answers, unless you perform the feature selection prior to classification (eg using mutual information). One reason is that Random Forests and neural networks would pick up nonlinear relationships while logistic regression wouldn't. Furthermore, Naive Bayes is blind to interactions.
So unless your research is explicitly about these 5 models, I would rather select one model and proceed with it.
Since your purpose is to get some intuition on what's going on, here is what you can do:
Let's start with Random Forest for simplicity, but you can do this with other algorithms too. First, you need to build a good model. Good in the sense that you need to be satisfied with its performance and it should be Robust, meaning that you should use a validation and/or a test set. These points are very important because we will analyse how the model takes its decisions, so if the model is bad you will get bad intuitions.
After having built the model, you can analyse it at two level : For the whole dataset (understanding your process), or for a given prediction. For this task I suggest you to look at the SHAP library which computes features contributions (i.e how much does a feature influences the prediction of my classifier) that can be used for both puproses.
For detailled instructions about this process and more tools, you can look fast.ai excellent courses on the machine learning serie, where lessons 2/3/4/5 are about this subject.
Hope it helps!

How can I know training data is enough for machine learning

For example: If I want to train a classifier (maybe SVM), how many sample do I need to collect? Is there a measure method for this?
It is not easy to know how many samples you need to collect. However you can follow these steps:
For solving a typical ML problem:
Build a dataset a with a few samples, how many? it will depend on the kind of problem you have, don't spend a lot of time now.
Split your dataset into train, cross, test and build your model.
Now that you've built the ML model, you need to evaluate how good it is. Calculate your test error
If your test error is beneath your expectation, collect new data and repeat steps 1-3 until you hit a test error rate you are comfortable with.
This method will work if your model is not suffering "high bias".
This video from Coursera's Machine Learning course, explains it.
Unfortunately, there is no simple method for this.
The rule of thumb is the bigger, the better, but in practical use, you have to gather the sufficient amount of data. By sufficient I mean covering as big part of modeled space as you consider acceptable.
Also, amount is not everything. The quality of test samples is very important too, i.e. training samples should not contain duplicates.
Personally, when I don't have all possible training data at once, I gather some training data and then train a classifier. Then I classifier quality is not acceptable, I gather more data, etc.
Here is some piece of science about estimating training set quality.
This depends a lot on the nature of the data and the prediction you are trying to make, but as a simple rule to start with, your training data should be roughly 10X the number of your model parameters. For instance, while training a logistic regression with N features, try to start with 10N training instances.
For an empirical derivation of the "rule of 10", see
https://medium.com/#malay.haldar/how-much-training-data-do-you-need-da8ec091e956

Resources