Many features in one model or in several models to ensemble? - machine-learning

If I have 1000 features (or more) with pairwise corrleation below 0.7, and I plan to build neural networks for predictions. Should I build one model to incorporate all features or two models with 500 features in each and then ensemble? That is:
Option 1: Model with all features. The model structure may be changed in the future if I have more features generated. For example, 100 features require 3 hidden layers and 1000 features require 6 hidden layers
Option 2: Model with fix number of features (e.g. 500). For every 500 new features that I get in the future, I just feed the data into the model without modifyting the model structure
From my perspective, if I choose the option 2, I can build a model with proper capacity to handle 500 features, and thus whenever I generate new features, I can just feed features to the existing model structure with the same network structure and even hyperparamters for ensembling. However, I have not heard of such measures in practice. I am not sure if my idea is valid or not, and I am confused which option could be better

From my past experience and also a lot of high ranking solution on kaggle, you usually get the best by training multiple models with all features.
But if we have to choose between the two options, option 1 is better.
Models learn better if more features is provided.
What if feature a and b is the most useful to the final answer, but feature a is used to train model 1 and feature b is used to train model2?

In my opinion, go for option 2. Option 1 may sometimes overfit the model. It is true that the models learn better with more features but it may also make the model more complex. Models developed using second model is more accurate too (Even though it is based on the features selected).

Related

How to adjust feature importance in Azure AutoML

I am hoping to have some low code model using Azure AutoML, which is really just going to the AutoML tab, running a classification experiment with my dataset, after it's done, I deploy the best selected model.
The model kinda works (meaning, I publish the endpoint and then I do some manual validation, seems accurate), however, I am not confident enough, because when I am looking at the explanation, I can see something like this:
4 top features are not really closely important. The most "important" one is really not the one I prefer it to use. I am hoping it will use the Title feature more.
Is there such a thing I can adjust the importance of individual features, like ranking all features before it starts the experiment?
I would love to do more reading, but I only found this:
Increase feature importance
The only answer seems to be about how to measure if a feature is important.
Hence, does it mean, if I want to customize the experiment, such as selecting which features to "focus", I should learn how to use the "designer" part in Azure ML? Or is it something I can't do, even with the designer. I guess my confusion is, with ML being such a big topic, I am looking for a direction of learning, in this case of what I am having, so I can improve my current model.
Here is link to the document for feature customization.
Using the SDK you can specify "feauturization": 'auto' / 'off' / 'FeaturizationConfig' in your AutoMLConfig object. Learn more about enabling featurization.
Automated ML tries out different ML models that have different settings which control for overfitting. Automated ML will pick which overfitting parameter configuration is best based on the best score (e.g. accuracy) it gets from hold-out data. The kind of overfitting settings these models has includes:
Explicitly penalizing overly-complex models in the loss function that the ML model is optimizing
Limiting model complexity before training, for example by limiting the size of trees in an ensemble tree learning model (e.g. gradient boosting trees or random forest)
https://learn.microsoft.com/en-us/azure/machine-learning/concept-manage-ml-pitfalls

Train/Test Datasets in Machine Learning

I just have a general question:
In a previous job, I was tasked with building a series of non-linear models to quantify the impact of certain factors on the number of medical claims filed. We had a set of variables we would use in all models (eg: state, year, Sex, etc.). We used all of our data to build these models; meaning we never split the data into training and test data sets.
If I were to go back in time to this job and split the data into training and test data sets, what would the advantages of that approach be besides assessing the prediction accuracy of our models. What is an argument for not splitting the data and then fitting the model? Never really thought about it too much until now - curious as to why we didn't take that approach.
Thanks!
The sole purpose of setting aside a test set is to assess prediction accuracy. However, there is more to this than just checking the number and thinking "huh, that's how my model performs"!
Knowing how your model performs at a given moment gives you an important benchmark for potential improvements of the model. How will you know otherwise whether adding a feature increases model performance? Moreover, how do you know otherwise whether your model is at all better than mere random guessing? Sometimes, extremely simple models outperform the more complex ones.
Another thing is removal of features or observations. This depends a bit on the kind of models you use, but some models (e.g., k-Nearest-Neighbors) perform significantly better if you remove unimportant features from the data. Similarly, suppose you add more training data and suddenly your model's test performance drops significantly. Perhaps there is something wrong with the new observations? You should be aware of these things.
The only argument I can think of for not using a test set is that otherwise you'd have too little training data for the model to perform optimally.

Will correlation impact feature importance of ML models?

I am building a xgboost model with hundreds of features. For features that highly correlated(pearson correlation) with each other, I am thinking to use feature importance(measuring by Gain) to drop the one with low importance.
My question:
1: Will correlation impact/biase feature importance (measuring by Gain)?
2: Is there any good way to remove highly correlated feature for ML models?
example: a's importance=120, b's importance=14, corr(a,b)=0.8. I am thinking to drop b because its importance=14. But is it correct?
Thank you.
Correlation definitely impacts feature importance. Meaning that if the features are highly correlated, there would be a high level of redundancy if you keep them all. Because two features are correlated means change in one will change the another. So there is no need to keep all of them right? As they are surely representative of one another and using a few of them you can hopefully classify your data well.
So in order to remove highly correlated features you can:
Use PCA to reduce dimensionality, or,
Use decision tree to find the important features, or,
You may manually choose features from your knowledge (if it is
possible) which features are more promising to help you to classify
your data, or,
You can combine some features to a new feature manually such that
saying one feature may eliminate the necessity to tell another set
of features as those are likely can be inferred from that single
feature.

Feature selection using correlation

I'm doing feature selection to train my Machine Learning (ML) models using correlation. I trained the each model(SVM, NN,RF) with all features and did a 10-fold cross validation to obtain mean accuracy score value.
Then I removed features which has a zero correlation coefficient (which implies there is no relationship between feature and class) and trained the each model(SVM, NN,RF) with all features and did a 10-fold cross validation to obtain mean accuracy score value.
Basically my objective is to do feature selection based on accuracy scores I get in above two scenarios. But I'm not sure whether this is a good approach for feature selection.
Also I want to do a grid search to identify best model parameters. but I'm getting confused with GridSearchCV in Scikit learn API. Since it also do a cross validation (default 3 folds) can I use best_score_ value obtained doing a grid search in above two scenarios to determine what are the good features for model training?
Please advice me on this confusion, or please suggest me with a good reference to read.
Thanks in advance
As a page 51 of this thesis says,
In other words, a feature is useful if it is correlated with or
predictive of the class; otherwise it is irrelevant.
The report goes on to say that not only should you remove the features that are not correlated with the targets, you should also watch out for features that correlate heavily with each other. Also see this.
In other words, it seems to be a good thing to look at correlation of features with the classes (targets) and remove the features that have little to no correlation.
Basically my objective is to do feature selection based on accuracy
scores I get in above two scenarios. But I'm not sure whether this is
a good approach for feature selection.
Yes, you can totally run experiments with different feature sets and look at the test accuracy to select the features that perform the best. It's really important that you only look at the test accuracy i.e. performance of the model on unseen data.
Also I want to do a grid search to identify best model parameters.
Grid search is performed for finding the best hyper-parameters. Model parameters are learned during training.
Since it also do a cross validation (default 3 folds) can I use
best_score_ value obtained doing a grid search in above two scenarios
to determine what are the good features for model training?
If the set of hyper-parameters is fixed, the best score value will be affected only by the feature set, and thus can be used to compare effectiveness of the features.

How do you add new categories and training to a pretrained Inception v3 model in TensorFlow?

I'm trying to utilize a pre-trained model like Inception v3 (trained on the 2012 ImageNet data set) and expand it in several missing categories.
I have TensorFlow built from source with CUDA on Ubuntu 14.04, and the examples like transfer learning on flowers are working great. However, the flowers example strips away the final layer and removes all 1,000 existing categories, which means it can now identify 5 species of flowers, but can no longer identify pandas, for example. https://www.tensorflow.org/versions/r0.8/how_tos/image_retraining/index.html
How can I add the 5 flower categories to the existing 1,000 categories from ImageNet (and add training for those 5 new flower categories) so that I have 1,005 categories that a test image can be classified as? In other words, be able to identify both those pandas and sunflowers?
I understand one option would be to download the entire ImageNet training set and the flowers example set and to train from scratch, but given my current computing power, it would take a very long time, and wouldn't allow me to add, say, 100 more categories down the line.
One idea I had was to set the parameter fine_tune to false when retraining with the 5 flower categories so that the final layer is not stripped: https://github.com/tensorflow/models/blob/master/inception/README.md#how-to-retrain-a-trained-model-on-the-flowers-data , but I'm not sure how to proceed, and not sure if that would even result in a valid model with 1,005 categories. Thanks for your thoughts.
After much learning and working in deep learning professionally for a few years now, here is a more complete answer:
The best way to add categories to an existing models (e.g. Inception trained on the Imagenet LSVRC 1000-class dataset) would be to perform transfer learning on a pre-trained model.
If you are just trying to adapt the model to your own data set (e.g. 100 different kinds of automobiles), simply perform retraining/fine tuning by following the myriad online tutorials for transfer learning, including the official one for Tensorflow.
While the resulting model can potentially have good performance, please keep in mind that the tutorial classifier code is highly un-optimized (perhaps intentionally) and you can increase performance by several times by deploying to production or just improving their code.
However, if you're trying to build a general purpose classifier that includes the default LSVRC data set (1000 categories of everyday images) and expand that to include your own additional categories, you'll need to have access to the existing 1000 LSVRC images and append your own data set to that set. You can download the Imagenet dataset online, but access is getting spotier as time rolls on. In many cases, the images are also highly outdated (check out the images for computers or phones for a trip down memory lane).
Once you have that LSVRC dataset, perform transfer learning as above but including the 1000 default categories along with your own images. For your own images, a minimum of 100 appropriate images per category is generally recommended (the more the better), and you can get better results if you enable distortions (but this will dramatically increase retraining time, especially if you don't have a GPU enabled as the bottleneck files cannot be reused for each distortion; personally I think this is pretty lame and there's no reason why distortions couldn't also be cached as a bottleneck file, but that's a different discussion and can be added to your code manually).
Using these methods and incorporating error analysis, we've trained general purpose classifiers on 4000+ categories to state-of-the-art accuracy and deployed them on tens of millions of images. We've since moved on to proprietary model design to overcome existing model limitations, but transfer learning is a highly legitimate way to get good results and has even made its way to natural language processing via BERT and other designs.
Hopefully, this helps.
Unfortunately, you cannot add categories to an existing graph; you'll basically have to save a checkpoint and train that graph from that checkpoint onward.

Resources