Cross Validation is Feature Selecting in Classification - machine-learning

I was watching this video from the well-famed Intro to Stat Learning course on doing cross validation in feature selecting.
The professors said that we should form the folds before doing any model fitting and feature selecting . They also said that in each split, we may end up with a different set of "best predictors"
My question is, if that is the case, how can we determine the overall best predictors for future use. In other words, if I have a new set of data, how do I know which predictors I should use?

Use the same set of features for future use. Yes there is a trade-off that the features selected might change with time. But usually ones goes with the features selected before. But important thing is that the initial data used for feature selection should be good enough with sufficient number of samples so that it reflects almost all cases of the problem. If this is the case, usually, the features selected would not change that much for a new test data as well.

Related

How to adjust feature importance in Azure AutoML

I am hoping to have some low code model using Azure AutoML, which is really just going to the AutoML tab, running a classification experiment with my dataset, after it's done, I deploy the best selected model.
The model kinda works (meaning, I publish the endpoint and then I do some manual validation, seems accurate), however, I am not confident enough, because when I am looking at the explanation, I can see something like this:
4 top features are not really closely important. The most "important" one is really not the one I prefer it to use. I am hoping it will use the Title feature more.
Is there such a thing I can adjust the importance of individual features, like ranking all features before it starts the experiment?
I would love to do more reading, but I only found this:
Increase feature importance
The only answer seems to be about how to measure if a feature is important.
Hence, does it mean, if I want to customize the experiment, such as selecting which features to "focus", I should learn how to use the "designer" part in Azure ML? Or is it something I can't do, even with the designer. I guess my confusion is, with ML being such a big topic, I am looking for a direction of learning, in this case of what I am having, so I can improve my current model.
Here is link to the document for feature customization.
Using the SDK you can specify "feauturization": 'auto' / 'off' / 'FeaturizationConfig' in your AutoMLConfig object. Learn more about enabling featurization.
Automated ML tries out different ML models that have different settings which control for overfitting. Automated ML will pick which overfitting parameter configuration is best based on the best score (e.g. accuracy) it gets from hold-out data. The kind of overfitting settings these models has includes:
Explicitly penalizing overly-complex models in the loss function that the ML model is optimizing
Limiting model complexity before training, for example by limiting the size of trees in an ensemble tree learning model (e.g. gradient boosting trees or random forest)
https://learn.microsoft.com/en-us/azure/machine-learning/concept-manage-ml-pitfalls

In a regression task, how do I find which independent variables are to be ignored or are not important?

In the regression problem I'm working with, there are five independent columns and one dependent column. I cannot share the data set details directly due to privacy, but one of the independent variables is an ID field which is unique for each example.
I feel like I should not be using ID field in estimating the dependent variable. But this is just a gut feeling. I have no strong reason to do this.
What shall I do? Is there any way I decide which variables to use and which to ignore?
Well, I agree with #desertnaut. Id attribute does not seem relevant when creating a model and provides no help in prediction.
The term you are looking for is feature selection. Since it's a comprehensive section so I would just tell you the methods that are mostly used by data scientists.
As for regression problems you can try correlation heatmap to find the features that are highly correlated with the target.
sns.heatmap(df.corr())
There are several other ways too like PCA,using tree inbuilt feature selection methods to find the right features for your model.
You can also try James Phillips method. This approach is limited since model time complexity will increase linearly with the features. But in your case where you've only four features to compare you can try it out. You can compare the regression model trained with all the four features with the model trained with only three features by dropping one of the four features recursively. This would mean training four regression models and comparing them.
According to you, the ID variable is unique for each example. So the model won't be able to learn anything from this variable as with every example, you get a new ID and hence no general patterns to learn as every ID only occurs once.
Regarding feature elimination, it depends. If you have domain knowledge, based on that alone you can engineer/ remove features as needed. If you don't know much about the domain, you can try out some basic techniques like Backward Selection, Forward Selection, etc via Cross Validation to get the model with the best value of the metric that you're working with.

Feature selection using correlation

I'm doing feature selection to train my Machine Learning (ML) models using correlation. I trained the each model(SVM, NN,RF) with all features and did a 10-fold cross validation to obtain mean accuracy score value.
Then I removed features which has a zero correlation coefficient (which implies there is no relationship between feature and class) and trained the each model(SVM, NN,RF) with all features and did a 10-fold cross validation to obtain mean accuracy score value.
Basically my objective is to do feature selection based on accuracy scores I get in above two scenarios. But I'm not sure whether this is a good approach for feature selection.
Also I want to do a grid search to identify best model parameters. but I'm getting confused with GridSearchCV in Scikit learn API. Since it also do a cross validation (default 3 folds) can I use best_score_ value obtained doing a grid search in above two scenarios to determine what are the good features for model training?
Please advice me on this confusion, or please suggest me with a good reference to read.
Thanks in advance
As a page 51 of this thesis says,
In other words, a feature is useful if it is correlated with or
predictive of the class; otherwise it is irrelevant.
The report goes on to say that not only should you remove the features that are not correlated with the targets, you should also watch out for features that correlate heavily with each other. Also see this.
In other words, it seems to be a good thing to look at correlation of features with the classes (targets) and remove the features that have little to no correlation.
Basically my objective is to do feature selection based on accuracy
scores I get in above two scenarios. But I'm not sure whether this is
a good approach for feature selection.
Yes, you can totally run experiments with different feature sets and look at the test accuracy to select the features that perform the best. It's really important that you only look at the test accuracy i.e. performance of the model on unseen data.
Also I want to do a grid search to identify best model parameters.
Grid search is performed for finding the best hyper-parameters. Model parameters are learned during training.
Since it also do a cross validation (default 3 folds) can I use
best_score_ value obtained doing a grid search in above two scenarios
to determine what are the good features for model training?
If the set of hyper-parameters is fixed, the best score value will be affected only by the feature set, and thus can be used to compare effectiveness of the features.

How to build a good training data set for machine learning and predictions?

I have a school project to make a program that uses the Weka tools to make predictions on football (soccer) games.
Since the algorithms are already there (the J48 algorithm), I need just the data. I found a website that offers football game data for free and I tried it in Weka but the predictions were pretty bad so I assume my data is not structured properly.
I need to extract the data from my source and format it another way in order to make new attributes and classes for my model. Does anyone know of a course/tutorial/guide on how to properly create your attributes and classes for machine learning predictions? Is there a standard that describes the best way of choosing the attributes of a data set for training a machine learning algorithm? What's the approach on this?
here's an example of the data that I have at the moment: http://www.football-data.co.uk/mmz4281/1516/E0.csv
and here is what the columns mean: http://www.football-data.co.uk/notes.txt
The problem may be that the data set you have is too small. Suppose you have ten variables and each variable has a range of 10 values. There are 10^10 possible configurations of these variables. It is unlikely your data set will be this large let alone cover all of the possible configurations. The trick is to narrow down the variables to the most relevant to avoid this large potential search space.
A second problem is that certain combinations of variables may be more significant than others.
The J48 algorithm attempts to to find the most relevant variable using entropy at each level in the tree. each path through the tree can be thought of as an AND condition: V1==a & V2==b ...
This covers the significance due to joint interactions. But what if the outcome is a result of A&B&C OR W&X&Y? The J48 algorithm will find only one and it will be the one where the the first variable selected will have the most overall significance when considered alone.
So, to answer your question, you need to not only find a training set which will cover the most common variable configurations in the "general" population but find an algorithm which will faithfully represent these training cases. Faithful meaning it will generally apply to unseen cases.
It's not an easy task. Many people and much money are involved in sports betting. If it were as easy as selecting the proper training set, you can be sure it would have been found by now.
EDIT:
It was asked in the comments how to you find the proper algorithm. The answer is the same way you find a needle in a haystack. There is no set rule. You may be lucky and stumble across it but in a large search space you won't ever know if you have. This is the same problem as finding the optimum point in a very convoluted search space.
A short-term answer is to
Think about what the algorithm can really accomplish. The J48 (and similar) algorithms are best suited for classification where the influence of the variables on the result are well known and follow a hierarchy. Flower classification is one example where it will likely excel.
Check the model against the training set. If it does poorly with the training set then it will likely have poor performance with unseen data. In general, you should expect the model to performance against the training to exceed the performance against unseen data.
The algorithm needs to be tested with data it has never seen. Testing against the training set, while a quick elimination test, will likely lead to overconfidence.
Reserve some of your data for testing. Weka provides a way to do this. The best case scenario would be to build the model on all cases except one (Leave On Out Approach) then see how the model performs on the average with these.
But this assumes the data at hand are not in some way biased.
A second pitfall is to let the test results bias the way you build the model.For example, trying different models parameters until you get an acceptable test response. With J48 it's not easy to allow this bias to creep in but if it did then you have just used your test set as an auxiliary training set.
Continue collecting more data; testing as long as possible. Even after all of the above, you still won't know how useful the algorithm is unless you can observe its performance against future cases. When what appears to be a good model starts behaving poorly then it's time to go back to the drawing board.
Surprisingly, there are a large number of fields (mostly in the soft sciences) which fail to see the need to verify the model with future data. But this is a matter better discussed elsewhere.
This may not be the answer you are looking for but it is the way things are.
In summary,
The training data set should cover the 'significant' variable configurations
You should verify the model against unseen data
Identifying (1) and doing (2) are the tricky bits. There is no cut-and-dried recipe to follow.

How to best deal with a feature relating to what type of expert labelled the data that becomes unavailable at point of classification?

Essentially I have a data set, that has a feature vector, and label indicating whether it is spam or non-spam.
To get the labels for this data, 2 distinct types of expert were used each using different approaches to evaluate the item, the type of expert used then also became a feature in the vector.
Training and then testing on a separate portion of the data has achieved a high degree accuracy using a Random Forest algorithm.
However, it is clear now that, the feature describing the expert who made the label will not be available in a live environment. So I have tried a number of approaches to reflect this:
Remove the feature from the set and retrain and test
Split the data into 2 distinct sets based on the feature, and then train and test 2 separate classifiers
For the test data, set the feature in question all to the same value
With all 3 approaches, the classifiers have dropped from being highly accurate, to being virtually useless.
So I am looking for any advice or intuitions as to why this has occurred and how I might approach resolving it so as to regain some of the accuracy I was previously seeing?
To be clear I have no background in machine learning or statistics and am simply using a third party c# code library as a black box to achieve these results.
Sounds like you've completely overfit to the "who labeled what" feature (and combinations of this feature with other features). You can find out for sure by inspecting the random forest's feature importances and checking whether the annotator feature ranks high. Another way to find out is to let the annotators check each other's annotations and compute an agreement score such as Cohen's kappa. A low value, say less than .5, indicates disagreement among the annotators, which makes machine learning very hard.
Since the feature will not be available at test time, there's no easy way to get the performance back.

Resources