Will correlation impact feature importance of ML models? - machine-learning

I am building a xgboost model with hundreds of features. For features that highly correlated(pearson correlation) with each other, I am thinking to use feature importance(measuring by Gain) to drop the one with low importance.
My question:
1: Will correlation impact/biase feature importance (measuring by Gain)?
2: Is there any good way to remove highly correlated feature for ML models?
example: a's importance=120, b's importance=14, corr(a,b)=0.8. I am thinking to drop b because its importance=14. But is it correct?
Thank you.

Correlation definitely impacts feature importance. Meaning that if the features are highly correlated, there would be a high level of redundancy if you keep them all. Because two features are correlated means change in one will change the another. So there is no need to keep all of them right? As they are surely representative of one another and using a few of them you can hopefully classify your data well.
So in order to remove highly correlated features you can:
Use PCA to reduce dimensionality, or,
Use decision tree to find the important features, or,
You may manually choose features from your knowledge (if it is
possible) which features are more promising to help you to classify
your data, or,
You can combine some features to a new feature manually such that
saying one feature may eliminate the necessity to tell another set
of features as those are likely can be inferred from that single
feature.

Related

How to adjust feature importance in Azure AutoML

I am hoping to have some low code model using Azure AutoML, which is really just going to the AutoML tab, running a classification experiment with my dataset, after it's done, I deploy the best selected model.
The model kinda works (meaning, I publish the endpoint and then I do some manual validation, seems accurate), however, I am not confident enough, because when I am looking at the explanation, I can see something like this:
4 top features are not really closely important. The most "important" one is really not the one I prefer it to use. I am hoping it will use the Title feature more.
Is there such a thing I can adjust the importance of individual features, like ranking all features before it starts the experiment?
I would love to do more reading, but I only found this:
Increase feature importance
The only answer seems to be about how to measure if a feature is important.
Hence, does it mean, if I want to customize the experiment, such as selecting which features to "focus", I should learn how to use the "designer" part in Azure ML? Or is it something I can't do, even with the designer. I guess my confusion is, with ML being such a big topic, I am looking for a direction of learning, in this case of what I am having, so I can improve my current model.
Here is link to the document for feature customization.
Using the SDK you can specify "feauturization": 'auto' / 'off' / 'FeaturizationConfig' in your AutoMLConfig object. Learn more about enabling featurization.
Automated ML tries out different ML models that have different settings which control for overfitting. Automated ML will pick which overfitting parameter configuration is best based on the best score (e.g. accuracy) it gets from hold-out data. The kind of overfitting settings these models has includes:
Explicitly penalizing overly-complex models in the loss function that the ML model is optimizing
Limiting model complexity before training, for example by limiting the size of trees in an ensemble tree learning model (e.g. gradient boosting trees or random forest)
https://learn.microsoft.com/en-us/azure/machine-learning/concept-manage-ml-pitfalls

Feature selection & significant features in kNN

I am developing a recommendation engine with the help of kNN. Data, though, is sparse, have around 1500 samples and around 200 features. I have an ordinal target having values 1 or 0.
What would be the techniques to do feature selection for it? I am assuming that if i choose Random forest for feature selection then the selected features may be different that what kNN assume important features are.
Also, is there any restriction on the number of features containing i have so less number of samples?
Features selection techniques want either to exclude irrelevant features, or/and to exclude redundant ones. One proven technique is to use Supervized discretization based on entropy (some more generic explanation can be found here) to meaningfully reduce the size of your data, and then use Info Gain to get top k most correlated features with the target variable. There are at least 5 various methods that you can try, it also depends on the ml library/framework that you are using to implement your app.
I would try with Relief algorithm, since its core part is the nearest neighbour search.

Working with inaccurate (incorrect) dataset

This is my problem description:
"According to the Survey on Household Income and Wealth, we need to find out the top 10% households with the most income and expenditures. However, we know that these collected data is not reliable due to many misstatements. Despite these misstatements, we have some features in the dataset which are certainly reliable. But these certain features are just a little part of information for each household wealth."
Unreliable data means that households tell lies to government. These households misstate their income and wealth in order to unfairly get more governmental services. Therefore, these fraudulent statements in original data will lead to incorrect results and patterns.
Now, I have below questions:
How should we deal with unreliable data in data science?
Is there any way to figure out these misstatements and then report the top 10% rich people with better accuracy using Machine Learning algorithms?
-How can we evaluate our errors in this study? Since we have unlabeled dataset, should I look for labeling techniques? Or, should I use unsupervised methods? Or, should I work with semi-supervised learning methods?
Is there any idea or application in Machine Learning which tries to improve the quality of collected data?
Please introduce me any ideas or references which can help me in this issue.
Thanks in advance.
Q: How should we deal with unreliable data in data science
A: Use feature engineering to fix unreliable data (make some transformations on unreliable data to make it reliable) or drop them out completely - bad features could significantly decrease the quality of the model
Q: Is there any way to figure out these misstatements and then report the top 10% rich people with better accuracy using Machine Learning algorithms?
A: ML algorithms are not magic sticks, they can't figure out anything unless you tell them what you are looking for. Can you describe what means 'unreliable'? If yes, you can, as I mentioned, use feature engineering or write a code which will fix the data. Otherwise no ML algorithm will be able to help you, without the description of what exactly you want to achieve
Q: Is there any idea or application in Machine Learning which tries to improve the quality of collected data?
A: I don't think so just because the question itself is too open-ended. What means 'the quality of the data'?
Generally, here are couple of things for you to consider:
1) Spend some time on googling feature engineering guides. They cover how to prepare your data for you ML algorithms, refine it, fix it. Good data with good features dramatically increase the results.
2) You don't need to use all of features from original data. Some of features of original dataset are meaningless and you don't need to use them. Try to run gradient boosting machine or random forest classifier from scikit-learn on your dataset to perform classification (or regression, if you do regression). These algorithms also evaluate importance of each feature of original dataset. Part of your features will have extremely low importance for classification, so you may wish to drop them out completely or try to combine unimportant features together somehow to produce something more important.

How to best deal with a feature relating to what type of expert labelled the data that becomes unavailable at point of classification?

Essentially I have a data set, that has a feature vector, and label indicating whether it is spam or non-spam.
To get the labels for this data, 2 distinct types of expert were used each using different approaches to evaluate the item, the type of expert used then also became a feature in the vector.
Training and then testing on a separate portion of the data has achieved a high degree accuracy using a Random Forest algorithm.
However, it is clear now that, the feature describing the expert who made the label will not be available in a live environment. So I have tried a number of approaches to reflect this:
Remove the feature from the set and retrain and test
Split the data into 2 distinct sets based on the feature, and then train and test 2 separate classifiers
For the test data, set the feature in question all to the same value
With all 3 approaches, the classifiers have dropped from being highly accurate, to being virtually useless.
So I am looking for any advice or intuitions as to why this has occurred and how I might approach resolving it so as to regain some of the accuracy I was previously seeing?
To be clear I have no background in machine learning or statistics and am simply using a third party c# code library as a black box to achieve these results.
Sounds like you've completely overfit to the "who labeled what" feature (and combinations of this feature with other features). You can find out for sure by inspecting the random forest's feature importances and checking whether the annotator feature ranks high. Another way to find out is to let the annotators check each other's annotations and compute an agreement score such as Cohen's kappa. A low value, say less than .5, indicates disagreement among the annotators, which makes machine learning very hard.
Since the feature will not be available at test time, there's no easy way to get the performance back.

Machine Learning: Should I choose classification or recommendation?

I don't know how I should approach this problem:
I have a data set. A user may or may not be part of a funded scheme.
I want to use machine learning to deduce that users that are not part of the scheme were susceptible to certain conditions e.g. 1,2,3 and 4. Those in the scheme were susceptible to 1,2 and 4. Therefore it can be deduced that if you are part of the scheme you won't be susceptible to condition 3.
I have a second related problem as well. Within the funded scheme the user can have two plans (cost different amounts). I would like to see whether those on the cheaper plan were susceptible to more conditions than those on the more expensive plan.
Can anyone help me as to whether this a recommendation or a classification problem and what specific algorithms I should look at?
Thanks.
Neither. It's a statistics problem. Your dataset is complete and you don't mention any need to predict attributes of future subjects or schemes, so training a classifier or a recommender wouldn't seem to serve it's usual goals.
You could use a person's conditions as features and their scheme stats as the target, classify them with SVM and then use the classification performance/accuracy as a measure of the separability of the classes. You could also consider clustering. However, a t-test would do the same thing and is a much more accepted tool to justify the validity of claims like this.
It looks like you are trying to build a system that would classify a user as funded or not funded, and if not funded, reason why they were not funded.
If this is the case, what you need is a machine learning classifier that is interpretable, i.e., the reasoning behind why a classifier makes a certain decision can be conveyed to users. You may want to look at Decisions trees and (to a lesser extent) RandomForest and Gradient Boosted Trees.

Resources