Machine Learning: Should I choose classification or recommendation? - machine-learning

I don't know how I should approach this problem:
I have a data set. A user may or may not be part of a funded scheme.
I want to use machine learning to deduce that users that are not part of the scheme were susceptible to certain conditions e.g. 1,2,3 and 4. Those in the scheme were susceptible to 1,2 and 4. Therefore it can be deduced that if you are part of the scheme you won't be susceptible to condition 3.
I have a second related problem as well. Within the funded scheme the user can have two plans (cost different amounts). I would like to see whether those on the cheaper plan were susceptible to more conditions than those on the more expensive plan.
Can anyone help me as to whether this a recommendation or a classification problem and what specific algorithms I should look at?
Thanks.

Neither. It's a statistics problem. Your dataset is complete and you don't mention any need to predict attributes of future subjects or schemes, so training a classifier or a recommender wouldn't seem to serve it's usual goals.
You could use a person's conditions as features and their scheme stats as the target, classify them with SVM and then use the classification performance/accuracy as a measure of the separability of the classes. You could also consider clustering. However, a t-test would do the same thing and is a much more accepted tool to justify the validity of claims like this.

It looks like you are trying to build a system that would classify a user as funded or not funded, and if not funded, reason why they were not funded.
If this is the case, what you need is a machine learning classifier that is interpretable, i.e., the reasoning behind why a classifier makes a certain decision can be conveyed to users. You may want to look at Decisions trees and (to a lesser extent) RandomForest and Gradient Boosted Trees.

Related

Is there something like an ad-hoc approach in machine learning?

When we use a machine learning approach, we divide the data set into test and training data and, in effect, we always use a post hoc approach by using all the data and then calculating the y-value for a new query.
But is there such a thing as an ad hoc approach where we can go through feature by feature for a new query and see how our prediction changes?
The advantage of this would be that we know exactly which feature has changed the predictions and how.
I would be grateful for any advice, including literature references, as I don't really know how to google it. It is also possible that the term ad-hoc approach is not chosen correctly.
Very vague question. Also, why would you know how the prediction changes? You usually want to know which feature contributes most towards finding the 'best' prediction/correct classification. That is approached by looking at Feature Importance which comes in different flavors for different algorithms.
In case that is kind of what you were looking for take a look at Permutation Feature Importance, Boruta Algorithm, SHAP Feature Importance, Feature Importance for tree-based algorithms, ...

How to adjust feature importance in Azure AutoML

I am hoping to have some low code model using Azure AutoML, which is really just going to the AutoML tab, running a classification experiment with my dataset, after it's done, I deploy the best selected model.
The model kinda works (meaning, I publish the endpoint and then I do some manual validation, seems accurate), however, I am not confident enough, because when I am looking at the explanation, I can see something like this:
4 top features are not really closely important. The most "important" one is really not the one I prefer it to use. I am hoping it will use the Title feature more.
Is there such a thing I can adjust the importance of individual features, like ranking all features before it starts the experiment?
I would love to do more reading, but I only found this:
Increase feature importance
The only answer seems to be about how to measure if a feature is important.
Hence, does it mean, if I want to customize the experiment, such as selecting which features to "focus", I should learn how to use the "designer" part in Azure ML? Or is it something I can't do, even with the designer. I guess my confusion is, with ML being such a big topic, I am looking for a direction of learning, in this case of what I am having, so I can improve my current model.
Here is link to the document for feature customization.
Using the SDK you can specify "feauturization": 'auto' / 'off' / 'FeaturizationConfig' in your AutoMLConfig object. Learn more about enabling featurization.
Automated ML tries out different ML models that have different settings which control for overfitting. Automated ML will pick which overfitting parameter configuration is best based on the best score (e.g. accuracy) it gets from hold-out data. The kind of overfitting settings these models has includes:
Explicitly penalizing overly-complex models in the loss function that the ML model is optimizing
Limiting model complexity before training, for example by limiting the size of trees in an ensemble tree learning model (e.g. gradient boosting trees or random forest)
https://learn.microsoft.com/en-us/azure/machine-learning/concept-manage-ml-pitfalls

Will correlation impact feature importance of ML models?

I am building a xgboost model with hundreds of features. For features that highly correlated(pearson correlation) with each other, I am thinking to use feature importance(measuring by Gain) to drop the one with low importance.
My question:
1: Will correlation impact/biase feature importance (measuring by Gain)?
2: Is there any good way to remove highly correlated feature for ML models?
example: a's importance=120, b's importance=14, corr(a,b)=0.8. I am thinking to drop b because its importance=14. But is it correct?
Thank you.
Correlation definitely impacts feature importance. Meaning that if the features are highly correlated, there would be a high level of redundancy if you keep them all. Because two features are correlated means change in one will change the another. So there is no need to keep all of them right? As they are surely representative of one another and using a few of them you can hopefully classify your data well.
So in order to remove highly correlated features you can:
Use PCA to reduce dimensionality, or,
Use decision tree to find the important features, or,
You may manually choose features from your knowledge (if it is
possible) which features are more promising to help you to classify
your data, or,
You can combine some features to a new feature manually such that
saying one feature may eliminate the necessity to tell another set
of features as those are likely can be inferred from that single
feature.

Machine Learning - Feature Ranking by Algorithms

I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome. I have 5 algorithms:
Neural Networks
Logistics
Naive
Random Forest
Adaboost
I read a lot about Information Gain technique and it seems it is independent of the machine learning algorithm used. It is like a preprocess technique.
My question follows, is it best practice to perform feature importance for each algorithm dependently or just use Information Gain. If yes what are the technique used for each ?
First of all, it's worth stressing that you have to perform the feature selection based on the training data only, even if it is a separate algorithm. During testing, you then select the same features from the test dataset.
Some approaches that spring to mind:
Mutual information based feature selection (eg here), independent of the classifier.
Backward or forward selection (see stackexchange question), applicable to any classifier but potentially costly since you need to train/test many models.
Regularisation techniques that are part of the classifier optimisation, eg Lasso or elastic net. The latter can be better in datasets with high collinearity.
Principal components analysis or any other dimensionality reduction technique that groups your features (example).
Some models compute latent variables which you can use for interpretation instead of the original features (e.g. Partial Least Squares or Canonical Correlation Analysis).
Specific classifiers can aid interpretability by providing extra information about the features/predictors, off the top of my head:
Logistic regression: you can obtain a p-value for every feature. In your interpretation you can focus on those that are 'significant' (eg p-value <0.05). (same for two-classes Linear Discriminant Analysis)
Random Forest: can return a variable importance index that ranks the variables from most to least important.
I have a dataset that contains around 30 features and I want to find out which features contribute the most to the outcome.
This will depend on the algorithm. If you have 5 algorithms, you will likely get 5 slightly different answers, unless you perform the feature selection prior to classification (eg using mutual information). One reason is that Random Forests and neural networks would pick up nonlinear relationships while logistic regression wouldn't. Furthermore, Naive Bayes is blind to interactions.
So unless your research is explicitly about these 5 models, I would rather select one model and proceed with it.
Since your purpose is to get some intuition on what's going on, here is what you can do:
Let's start with Random Forest for simplicity, but you can do this with other algorithms too. First, you need to build a good model. Good in the sense that you need to be satisfied with its performance and it should be Robust, meaning that you should use a validation and/or a test set. These points are very important because we will analyse how the model takes its decisions, so if the model is bad you will get bad intuitions.
After having built the model, you can analyse it at two level : For the whole dataset (understanding your process), or for a given prediction. For this task I suggest you to look at the SHAP library which computes features contributions (i.e how much does a feature influences the prediction of my classifier) that can be used for both puproses.
For detailled instructions about this process and more tools, you can look fast.ai excellent courses on the machine learning serie, where lessons 2/3/4/5 are about this subject.
Hope it helps!

How to best deal with a feature relating to what type of expert labelled the data that becomes unavailable at point of classification?

Essentially I have a data set, that has a feature vector, and label indicating whether it is spam or non-spam.
To get the labels for this data, 2 distinct types of expert were used each using different approaches to evaluate the item, the type of expert used then also became a feature in the vector.
Training and then testing on a separate portion of the data has achieved a high degree accuracy using a Random Forest algorithm.
However, it is clear now that, the feature describing the expert who made the label will not be available in a live environment. So I have tried a number of approaches to reflect this:
Remove the feature from the set and retrain and test
Split the data into 2 distinct sets based on the feature, and then train and test 2 separate classifiers
For the test data, set the feature in question all to the same value
With all 3 approaches, the classifiers have dropped from being highly accurate, to being virtually useless.
So I am looking for any advice or intuitions as to why this has occurred and how I might approach resolving it so as to regain some of the accuracy I was previously seeing?
To be clear I have no background in machine learning or statistics and am simply using a third party c# code library as a black box to achieve these results.
Sounds like you've completely overfit to the "who labeled what" feature (and combinations of this feature with other features). You can find out for sure by inspecting the random forest's feature importances and checking whether the annotator feature ranks high. Another way to find out is to let the annotators check each other's annotations and compute an agreement score such as Cohen's kappa. A low value, say less than .5, indicates disagreement among the annotators, which makes machine learning very hard.
Since the feature will not be available at test time, there's no easy way to get the performance back.

Resources