I am developing a recommendation engine with the help of kNN. Data, though, is sparse, have around 1500 samples and around 200 features. I have an ordinal target having values 1 or 0.
What would be the techniques to do feature selection for it? I am assuming that if i choose Random forest for feature selection then the selected features may be different that what kNN assume important features are.
Also, is there any restriction on the number of features containing i have so less number of samples?
Features selection techniques want either to exclude irrelevant features, or/and to exclude redundant ones. One proven technique is to use Supervized discretization based on entropy (some more generic explanation can be found here) to meaningfully reduce the size of your data, and then use Info Gain to get top k most correlated features with the target variable. There are at least 5 various methods that you can try, it also depends on the ml library/framework that you are using to implement your app.
I would try with Relief algorithm, since its core part is the nearest neighbour search.
Related
I am building a xgboost model with hundreds of features. For features that highly correlated(pearson correlation) with each other, I am thinking to use feature importance(measuring by Gain) to drop the one with low importance.
My question:
1: Will correlation impact/biase feature importance (measuring by Gain)?
2: Is there any good way to remove highly correlated feature for ML models?
example: a's importance=120, b's importance=14, corr(a,b)=0.8. I am thinking to drop b because its importance=14. But is it correct?
Thank you.
Correlation definitely impacts feature importance. Meaning that if the features are highly correlated, there would be a high level of redundancy if you keep them all. Because two features are correlated means change in one will change the another. So there is no need to keep all of them right? As they are surely representative of one another and using a few of them you can hopefully classify your data well.
So in order to remove highly correlated features you can:
Use PCA to reduce dimensionality, or,
Use decision tree to find the important features, or,
You may manually choose features from your knowledge (if it is
possible) which features are more promising to help you to classify
your data, or,
You can combine some features to a new feature manually such that
saying one feature may eliminate the necessity to tell another set
of features as those are likely can be inferred from that single
feature.
Building a classifier for classical problems, like image classification, is quite straightforward, since by visualization on the image we know the pixel values do contain the information about the target.
However, for the problems in which there is no obvious visualizable pattern, how should we evaluate or to see if the features collected are good enough for the target information? Or if there are some criterion by which we can conclude the collected features does not work at all. Otherwise, we have to try different algorithms or classifiers to verify the predictability of the collected data. Or if there is a thumb rule saying that if apply classical classifiers, like SVM, random forest and adaboost, we cannot get a classifier with a reasonable accuracy (70%) then we should give up and try to find some other more related features.
Or by some high dim visualization tool, like t-sne, if there is no clear pattern presented in some low dim latent space, then we should give up.
First of all, there might be NO features that explain the data well enough. The data may simply be pure noise without any signal. Therefore speaking about "reasonable accuracy" of any level e.g. 70% is improper. For some data sets a model that explains 40 % of its variance will be fantastic.
Having said that, the simplest practical way to evaluate the input features is to calculate correlations between each of them and the target.
Models have their own ways of evaluating features importance.
I am still exploring this area of Machine Learning and although I know what's the difference between Feature Selection and Dimensionality Reduction, I am finding some difficulties grasping the concepts of when to do Feature Selection or Dimensionality Reduction (or both together).
Assuming that I have a dataset with around 40 features, is it good practice to perform Dimensionality Reduction alone or Feature Selection alone? Or should there be a hybrid of both approaches (i.e. Do feature selection first and then dimensionality reduction - or vice versa)?
The term feature selection is a bit misleading. It can have two meanings:
Selecting features by incorporating the domain knowledge (this involves constructing new features as well).
For example, finding the rotation invariant points in an image data set or creating BMI as a new feature when you have height and weight as features.
Keeping only the features of high importance according to a some measure
This is a one step of the dimensionality reduction process. The so-called dimensionality reduction process actually involves two steps:
Transforming the original features to new (artificial) features by changing the basis.
eg. PCA does so by finding a set of orthogonal features so that the variance along each axis is maximized.
Keeping only the most important (importance is defined by some measure) features resulted in the above step. This is actually a feature selection step.
eg. In PCA, this is achieved by keeping only the top-k number of features that have the highest explained variances.
As for the order of above (1) and (2) should happen: I think this is problem dependent.
If there's enough domain knowledge to construct/select features to cater the problem at hand, we should do the manual feature engineering (plus selection) first. If this feature engineering/selection process still results in a large number of features, then the so-called dimensionality reduction can be done to find a subspace that can represent the data with an even lesser number of totally new features that have almost no meaning in real life.
If the domain knowledge can't add anything to the data set, just doing dimensionality reduction would be fine which actually contain a feature selection step in it.
In a broad sense, we can think that feature selection is actually a special case of dimensionality reduction where no basis change occurs to the original data set.
Essentially I have a data set, that has a feature vector, and label indicating whether it is spam or non-spam.
To get the labels for this data, 2 distinct types of expert were used each using different approaches to evaluate the item, the type of expert used then also became a feature in the vector.
Training and then testing on a separate portion of the data has achieved a high degree accuracy using a Random Forest algorithm.
However, it is clear now that, the feature describing the expert who made the label will not be available in a live environment. So I have tried a number of approaches to reflect this:
Remove the feature from the set and retrain and test
Split the data into 2 distinct sets based on the feature, and then train and test 2 separate classifiers
For the test data, set the feature in question all to the same value
With all 3 approaches, the classifiers have dropped from being highly accurate, to being virtually useless.
So I am looking for any advice or intuitions as to why this has occurred and how I might approach resolving it so as to regain some of the accuracy I was previously seeing?
To be clear I have no background in machine learning or statistics and am simply using a third party c# code library as a black box to achieve these results.
Sounds like you've completely overfit to the "who labeled what" feature (and combinations of this feature with other features). You can find out for sure by inspecting the random forest's feature importances and checking whether the annotator feature ranks high. Another way to find out is to let the annotators check each other's annotations and compute an agreement score such as Cohen's kappa. A low value, say less than .5, indicates disagreement among the annotators, which makes machine learning very hard.
Since the feature will not be available at test time, there's no easy way to get the performance back.
for my final thesis i am trying to build up an 3d face recognition system by combining color and depth information. the first step i did, is to realign the data-head to an given model-head using the iterative closest point algorithm. for the detection step i was thinking about using the libsvm. but i dont understand how to combine the depth and the color information to one feature vector? they are dependent information (each point consist of color (RGB), depth information and also scan quality).. what do you suggest to do? something like weighting?
edit:
last night i read an article about SURF/SIFT features i would like to use them! could it work? the concept would be the following: extracting this features out of the color image and the depth image (range image), using each feature as a single feature vector for the svm?
Concatenation is indeed a possibility. However, as you are working on 3d face recognition you should have some strategy as to how you go about it. Rotation and translation of faces will be hard to recognize using a "straightforward" approach.
You should decide whether you attempt to perform a detection of the face as a whole, or of sub-features. You could attempt to detect rotation by finding some core features (eyes, nose, etc).
Also, remember that SVMs are inherently binary (i.e. they separate between two classes). Depending on your exact application you will very likely have to employ some multi-class strategy (One-against-all or One-against-many).
I would recommend doing some literature research to see how others have attacked the problem (a google search will be a good start).
It sounds simple, but you can simply concatenate the two vectors into one. Many researchers do this.
What you arrived at is an important open problem. Yes, there are some ways to handle it, as mentioned here by Eamorr. For example you can concatenate and do PCA (or some non linear dimensionality reduction method). But it is kind of hard to defend the practicality of doing so, considering that PCA takes O(n^3) time in the number of features. This alone might be unreasonable for data in vision that may have thousands of features.
As mentioned by others, the easiest approach is to simply combine the two sets of features into one.
SVM is characterized by the normal to the maximum-margin hyperplane, where its components specify the weights/importance of the features, such that higher absolute values have a larger impact on the decision function. Thus SVM assigns weights to each feature all on its own.
In order for this to work, obviously you would have to normalize all the attributes to have the same scale (say transform all features to be in the range [-1,1] or [0,1])