Feature Selection Techniques - machine-learning

I am completely new to statistical modelling.I wanted to know what are the feature selection techniques.
Say I have 10 variables but I need to what are actual important one's among them.
I have read about feature selection on internet and came to know few of the techniques:
Correlation
Forward Selection
Backward Elimination
But I am not getting how can I use them. How can a correlation be used in feature selection. How to perform a Forward Selection/Backward Elimination etc.
What models I can use for feature selection. I just want a high level overview of it. When to use what
Some one help me to get started

Correlation - In this approach we see how the target variable is correlated with the predictors and choose the ones which are highly correlated and ignore the others.
Forward Selection - In this we start with 0 predictors and check the model performance. And then at every stage we add one of the predictor which gives the best model performance.
Backward Selection - In this we start with all the predictors. And then at every stage we remove one of the predictors which gives the better model performance.

Related

Feature Subset Selection

Before we reduce the dimension of a data set, we apply a learning algorithm to that set, and we obtain an objective function which generates a result for a data sample. This may be our classifier or regressor.
After that, we apply feature selection or feature extraction approach. What I am wondering is about the subset selection algorithm ,which is the algorithm adapting feature selection approach :
According to resources I have read so far, "you start with an empty feature set, and in each step of the algorithm, the feature which increases performance of your objective function is selected and added to your feature set. This operation continues until adding new feature does not improve the performance of your classifier or regressor."
What if adding new feature continues to improve performance of my objective function ? In this case, I have to add all features to my feature set, which means I choose all features. However, I am trying to reduce the dimension of data samples.
It depends on the problem and your dataset; but in general, with the feature selection strategy that you are describing (Sequential Forward Selection) it is quite unlikely that the end result would be to keep all the variables. You are either going to find local minima or irrelevant variables in most cases.
However, in the rare case that this happens, this would be basically telling you that all the features in your dataset are important - i.e. removing any of them would harm the accuracy of your model.
If the above is not a problem for you, you can either modify your objective function (so it considers both the current accuracy and the % of features eliminated - maybe as a weighted objective) or change your feature selection heuristic (you can use, for example, Sequential Backward Selection - which is very similar but starts considering all the features initially and then tries to remove them one by one).

Randomly feature mapping

I'm taking machine learning course In the second part of exercise 2, we are supposed to use feature map. And they added new features by mapping the features into all polynomial terms of x1 and x2 up to the sixth power. However, my instructor told me I shouldn't use this algorithm and instead I should randomly add features. However, we add new features in order to better classify.So wouldn't adding features randomly make this more complicated? So can we add features randomly or should we follow some rule?
Adding new features (e.g. polynomial of the existing features) helps to reduce the error by using complex hypothesis. But this may lead to overfitting the training data and may not produce efficient results on the test set.
So, in order to add new features, following should be considered:
1) Manually select which feature to keep by analyzing the results.
2) Other way is use all features and then use regularization which will automatically give less importance to less contributing features and more importance to more contributing features towards target variable.
3) Randomly selecting features may or may not help always. It is always required to choose those features which contribute more towards target variable. Random selection may not be the appropriate solution.
Important Note
Always use validation set to check the error during training.
While working with polynomial features, always check the learning curve to see model should not overfitting the train data. If such happens, try to increase the regularization parameter (lambda). Regularization helps in reducing over fitting.

How many and/or what criteria does CfsSubsetEvaluator use in selecting features in each step of cross-validation while doing feature selection?

I am quite new to WEKA, and I have a dataset of 111 cases with 109 attributes. I am using feature selection tab in WEKA with CfsSubsetEval and BestFirst search method for feature selection. I am using leave-one-out cross-validation.
So, how many features does WEKA pick or what is the stopping criteria for number of features this method selects in each step of cross-validation
Thanks,
Gopi
The CfsSubsetEval algorithm is searching for a subset of features that work well together (have low correlation between the features and a high correlation to the target label). The score of the subset is called merit (you can see it in the output).
The BestFirst search won't allow you to determine the number of features to select. However, you can use other methods such as the GreedyStepWise or using InformationGain/GainRatio algorithms with Rankerand define the size of the feature set.
Another option you can use to influence the size of the set is the direction of the search (forward, backward...).
Good luck

Decision tree entropy calculation target

I found several examples of two types.
Single feature
Given a data with only two items classes. For example only blue and yellow balls. I.e. we have only one feature in this case is color. This is clear example to show "divide and conquer" rule applicable to entropy. But this is senselessly for any prediction or classification problems because if we have an object with only one feature and the value is known we don't need a tree to decide that "this ball is yellow".
Multiple features
Given a data with multiple features and a feature to predict (known for training data). We can calculate a predicate based on minimum average entropy for each feature. Closer to life, isn't it? It was clear to me until I haven't tried to implement the algorithm.
And now I have a collision in my mind.
If we calculate entropy relatively to a known features (one per node) we will have meaningful results at classification with a tree only if unknown feature is strictly dependent from every known feature. Otherwise a single unbound known feature could break all prediction driving a decision in a wrong way. But if we calculate entropy relatively to a values of the feature which we want to predict at classification we are returned to the first senseless example. In this way there is no difference which of a known feature to use for a node...
And a question about a tree building process.
Should I calculate entropy only for known features and just believe that all the known features are bound to an unknown? Or maybe I should calculate entropy for unknown feature (known for training data) TOO to determine which feature more affects result?
I had the same problem (in maybe a similar programing task) some years ago: Do I calculate the entropy against the complete set of features, the relevant features for a branch or the relevant features for a level?
Turned out like this: In a decision tree it comes down to comparing entropies between different branches to determine the optimal branch. Comparison requires equal base sets, i.e. whenever you want to compare two entropie values, they must be based on the same feature set.
For your problem you can go with the features relevant to the set of branches you want to compare, as long as you are aware that with this solution you cannot compare entropies between different branch sets.
Otherwise go with the whole feature set.
(Disclaimer: Above solution is a mind protocol from a problem that lead to about an hour of thinking some years ago. Hopefully I got everything right.)
PS: Beware of the car dataset! ;)

How to remove redundant features using weka

I have around 300 features and I want to find the best subset of features by using feature selection techniques in weka. Can someone please tell me what method to use to remove redundant features in weka :)
There are mainly two types of feature selection techniques that you can use using Weka:
Feature selection with wrapper method:
"Wrapper methods consider the selection of a set of features as a search problem, where different combinations are prepared, evaluated and compared to other combinations. A predictive model us used to evaluate a combination of features and assign a score based on model accuracy.
The search process may be methodical such as a best-first search, it may stochastic such as a random hill-climbing algorithm, or it may use heuristics, like forward and backward passes to add and remove features.
An example if a wrapper method is the recursive feature elimination algorithm." [From http://machinelearningmastery.com/an-introduction-to-feature-selection/]
Feature selection with filter method:
"Filter feature selection methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset. The methods are often univariate and consider the feature independently, or with regard to the dependent variable.
Example of some filter methods include the Chi squared test, information gain and correlation coefficient scores." [From http://machinelearningmastery.com/an-introduction-to-feature-selection/]
If you are using Weka GUI, then you can take a look at two of my video casts here and here.

Resources