Feature Subset Selection - machine-learning

Before we reduce the dimension of a data set, we apply a learning algorithm to that set, and we obtain an objective function which generates a result for a data sample. This may be our classifier or regressor.
After that, we apply feature selection or feature extraction approach. What I am wondering is about the subset selection algorithm ,which is the algorithm adapting feature selection approach :
According to resources I have read so far, "you start with an empty feature set, and in each step of the algorithm, the feature which increases performance of your objective function is selected and added to your feature set. This operation continues until adding new feature does not improve the performance of your classifier or regressor."
What if adding new feature continues to improve performance of my objective function ? In this case, I have to add all features to my feature set, which means I choose all features. However, I am trying to reduce the dimension of data samples.

It depends on the problem and your dataset; but in general, with the feature selection strategy that you are describing (Sequential Forward Selection) it is quite unlikely that the end result would be to keep all the variables. You are either going to find local minima or irrelevant variables in most cases.
However, in the rare case that this happens, this would be basically telling you that all the features in your dataset are important - i.e. removing any of them would harm the accuracy of your model.
If the above is not a problem for you, you can either modify your objective function (so it considers both the current accuracy and the % of features eliminated - maybe as a weighted objective) or change your feature selection heuristic (you can use, for example, Sequential Backward Selection - which is very similar but starts considering all the features initially and then tries to remove them one by one).

Related

Feature Selection Techniques

I am completely new to statistical modelling.I wanted to know what are the feature selection techniques.
Say I have 10 variables but I need to what are actual important one's among them.
I have read about feature selection on internet and came to know few of the techniques:
Correlation
Forward Selection
Backward Elimination
But I am not getting how can I use them. How can a correlation be used in feature selection. How to perform a Forward Selection/Backward Elimination etc.
What models I can use for feature selection. I just want a high level overview of it. When to use what
Some one help me to get started
Correlation - In this approach we see how the target variable is correlated with the predictors and choose the ones which are highly correlated and ignore the others.
Forward Selection - In this we start with 0 predictors and check the model performance. And then at every stage we add one of the predictor which gives the best model performance.
Backward Selection - In this we start with all the predictors. And then at every stage we remove one of the predictors which gives the better model performance.

Machine learning, Do unbalanced non-numeric variable classes matter

If I have a non-numeric variable in my data set that contains many of one class but few of another does this cause the same issues as when the target classes are unbalanced?
For example if one of my variables was title and the aim was to identify whether a person is obese. The data obese class is split 50:50 but there is only one row with the title 'Duke' and this row is in the obese class. Does this mean that an algorithm like logistic regression (after numeric encoding) would start predicting that all Dukes are obese (or have a disproportionate weighting for the title 'Duke')? If so, are some algorithms better/worse at handling this case? Is there a way to prevent this issue?
Yes, any vanilla machine learning algorithm will treat categorical data the same way as numerical data in terms of information entropy from a specific feature.
Consider this, before applying any machine learning algorithm you should analyze your input features and identify the explained variance each cause on the target. In your case if the label Duke always gets identified as obese, then given that specific dataset that is an extremely high information feature and should be weighted as such.
I would mitigate this issue by adding a weight to that feature, thus minimizing the impact it will have on the target. However, this would be a shame if this is an otherwise very informative feature for other instances.
An algorithm which could easily circumvent this problem is random forest (decision trees). You can eliminate any rule that is based on this feature being Duke.
Be very careful in mapping this feature to numbers as this will have an impact on the importance attributed to this feature with most algorithms.

Randomly feature mapping

I'm taking machine learning course In the second part of exercise 2, we are supposed to use feature map. And they added new features by mapping the features into all polynomial terms of x1 and x2 up to the sixth power. However, my instructor told me I shouldn't use this algorithm and instead I should randomly add features. However, we add new features in order to better classify.So wouldn't adding features randomly make this more complicated? So can we add features randomly or should we follow some rule?
Adding new features (e.g. polynomial of the existing features) helps to reduce the error by using complex hypothesis. But this may lead to overfitting the training data and may not produce efficient results on the test set.
So, in order to add new features, following should be considered:
1) Manually select which feature to keep by analyzing the results.
2) Other way is use all features and then use regularization which will automatically give less importance to less contributing features and more importance to more contributing features towards target variable.
3) Randomly selecting features may or may not help always. It is always required to choose those features which contribute more towards target variable. Random selection may not be the appropriate solution.
Important Note
Always use validation set to check the error during training.
While working with polynomial features, always check the learning curve to see model should not overfitting the train data. If such happens, try to increase the regularization parameter (lambda). Regularization helps in reducing over fitting.

Decision tree entropy calculation target

I found several examples of two types.
Single feature
Given a data with only two items classes. For example only blue and yellow balls. I.e. we have only one feature in this case is color. This is clear example to show "divide and conquer" rule applicable to entropy. But this is senselessly for any prediction or classification problems because if we have an object with only one feature and the value is known we don't need a tree to decide that "this ball is yellow".
Multiple features
Given a data with multiple features and a feature to predict (known for training data). We can calculate a predicate based on minimum average entropy for each feature. Closer to life, isn't it? It was clear to me until I haven't tried to implement the algorithm.
And now I have a collision in my mind.
If we calculate entropy relatively to a known features (one per node) we will have meaningful results at classification with a tree only if unknown feature is strictly dependent from every known feature. Otherwise a single unbound known feature could break all prediction driving a decision in a wrong way. But if we calculate entropy relatively to a values of the feature which we want to predict at classification we are returned to the first senseless example. In this way there is no difference which of a known feature to use for a node...
And a question about a tree building process.
Should I calculate entropy only for known features and just believe that all the known features are bound to an unknown? Or maybe I should calculate entropy for unknown feature (known for training data) TOO to determine which feature more affects result?
I had the same problem (in maybe a similar programing task) some years ago: Do I calculate the entropy against the complete set of features, the relevant features for a branch or the relevant features for a level?
Turned out like this: In a decision tree it comes down to comparing entropies between different branches to determine the optimal branch. Comparison requires equal base sets, i.e. whenever you want to compare two entropie values, they must be based on the same feature set.
For your problem you can go with the features relevant to the set of branches you want to compare, as long as you are aware that with this solution you cannot compare entropies between different branch sets.
Otherwise go with the whole feature set.
(Disclaimer: Above solution is a mind protocol from a problem that lead to about an hour of thinking some years ago. Hopefully I got everything right.)
PS: Beware of the car dataset! ;)

Is that make sense to construct a learning Model using only one feature?

in order to improve the accuracy of an adaboost classifier (for image classification), I am using genetic programming to derive new statistical Measures. Every Time when a new feature is generated, i evaluate its fitness by training an adaboost Classifier and by testing its performances. But i want to know if that procedure is correct; I mean the use of a single feature to train a learning model.
You can build a model on one feature. I assume, that by "one feature" you mean simply one number in R (otherwise, it would be completely "traditional" usage). However this means, that you are building a classifier in one-dimensional space, and as such - many classifiers will be redundant (as it is really a simple problem). What is more important - checking whether you can correctly classify objects using one particular dimensions does not mean that it is a good/bad feature once you use combination of them. In particular it may be the case that:
Many features may "discover" the same phenomena in data, and so - each of them separatly can yield good results, but once combined - they won't be any better then each of them (as they simply capture same information)
Features may be useless until used in combination. Some phenomena can be described only in multi-dimensional space, and if you are analyzing only one-dimensional data - you won't ever discover their true value, as a simple example consider four points (0,0),(0,1),(1,0),(1,1) such that (0,0),(1,1) are elements of one class, and rest of another. If you look separatly on each dimension - then the best possible accuracy is 0.5 (as you always have points of two different classes in exactly same points - 0 and 1). Once combined - you can easily separate them, as it is a xor problem.
To sum up - it is ok to build a classifier in one dimensional space, but:
Such problem can be solved without "heavy machinery".
Results should not be used as a base of feature selection (or to be more strict - this can be very deceptive).

Resources