When to do feature scaling? Before or after extraction? - mapping

I am planning to do SVM classification on multidimensional sensor data. There are two classes and 13 sensors. Suppose that I want to extract features e.g. average, standard deviation etc. I read from somewhere that we need to do feature scaling before apply to SVM. I am wondering when I should do the scaling, before extracting features or after extracting features?

As you have read, and as already pointed out, you would:
do feature derivation
do feature normalization (scaling, deskewing if necessary, etc)
hand data to training/evaluating model(s).
For the example you mentioned, just to be clear: I assume you mean that you want to derive (the same) features for each sample, so that you have e.g. a mean feature, standard deviation feature, etc. for each sample - which is how it should be done. Normalization, in turn, has to be done per feature over all samples.

Related

How to know if the feature variables are "good" enough for building a classifer

Building a classifier for classical problems, like image classification, is quite straightforward, since by visualization on the image we know the pixel values do contain the information about the target.
However, for the problems in which there is no obvious visualizable pattern, how should we evaluate or to see if the features collected are good enough for the target information? Or if there are some criterion by which we can conclude the collected features does not work at all. Otherwise, we have to try different algorithms or classifiers to verify the predictability of the collected data. Or if there is a thumb rule saying that if apply classical classifiers, like SVM, random forest and adaboost, we cannot get a classifier with a reasonable accuracy (70%) then we should give up and try to find some other more related features.
Or by some high dim visualization tool, like t-sne, if there is no clear pattern presented in some low dim latent space, then we should give up.
First of all, there might be NO features that explain the data well enough. The data may simply be pure noise without any signal. Therefore speaking about "reasonable accuracy" of any level e.g. 70% is improper. For some data sets a model that explains 40 % of its variance will be fantastic.
Having said that, the simplest practical way to evaluate the input features is to calculate correlations between each of them and the target.
Models have their own ways of evaluating features importance.

Feature selection & significant features in kNN

I am developing a recommendation engine with the help of kNN. Data, though, is sparse, have around 1500 samples and around 200 features. I have an ordinal target having values 1 or 0.
What would be the techniques to do feature selection for it? I am assuming that if i choose Random forest for feature selection then the selected features may be different that what kNN assume important features are.
Also, is there any restriction on the number of features containing i have so less number of samples?
Features selection techniques want either to exclude irrelevant features, or/and to exclude redundant ones. One proven technique is to use Supervized discretization based on entropy (some more generic explanation can be found here) to meaningfully reduce the size of your data, and then use Info Gain to get top k most correlated features with the target variable. There are at least 5 various methods that you can try, it also depends on the ml library/framework that you are using to implement your app.
I would try with Relief algorithm, since its core part is the nearest neighbour search.

Principal component analysis vs feature removal

I'm quite new to machine learning and just got introduced to principle component analysis as a dimensionality reduction method. What I don't understand, in which circumstances is PCA any better than simply removing some features from the model? If the aim is to obtain lower dimensional data, why don't we just group those features that are correlated and retain one single feature from each group?
There is a fundamental difference between feature reduction (such as PCA) and feature selection (which you describe). The crucial difference is that feature reduction (PCA) maps your data to lower dimensional through some projection of all original dimensions, for example PCA uses linear combination of each. So final data embedding has information from all features. If you perform feature selection you discard information, you completely loose anything that was present there. Furthermore, PCA guarantees you retaining given fraction of the data variance.

How to do machine learning when the inputs are of different sizes?

In standard cookbook machine learning, we operate on a rectangular matrix; that is, all of our data points have the same number of features. How do we cope with situations in which all of our data points have different numbers of features? For example, if we want to do visual classification but all of our pictures are of different dimensions, or if we want to do sentiment analysis but all of our sentences have different amounts of words, or if we want to do stellar classification but all of the stars have been observed a different number of times, etc.
I think the normal way would be to extract features of regular size from these irregularly sized data. But I attended a talk on deep learning recently where the speaker emphasized that instead of hand-crafting features from data, deep learners are able to learn the appropriate features themselves. But how do we use e.g. a neural network if the input layer is not of a fixed size?
Since you are asking about deep learning, I assume you are more interested in end-to-end systems, rather then feature design. Neural networks that can handle variable data inputs are:
1) Convolutional neural networks with pooling layers. They are usually used in image recognition context, but recently were applied to modeling sentences as well. ( I think they should also be good at classifiying stars ).
2) Recurrent neural networks. (Good for sequential data, like time series,sequence labeling tasks, also good for machine translation).
3) Tree-based autoencoders (also called recursive autoencoders) for data arranged in tree-like structures (can be applied to sentence parse trees)
Lot of papers describing example applications can readily be found by googling.
For uncommon tasks you can select one of these based on the structure of your data, or you can design some variants and combinations of these systems.
You can usually make the number of features the same for all instances quite easily:
if we want to do visual classification but all of our pictures are of different dimensions
Resize them all to a certain dimension / number of pixels.
if we want to do sentiment analysis but all of our sentences have different amounts of words
Keep a dictionary of the k words in all of your text data. Each instance will consist of a boolean vector of size k where the i-th entry is true if word i from the dictionary appears in that instance (this is not the best representation, but many are based on it). See the bag of words model.
if we want to do stellar classification but all of the stars have been observed a different number of times
Take the features that have been observed for all the stars.
But I attended a talk on deep learning recently where the speaker emphasized that instead of hand-crafting features from data deep learners are able to learn the appropriate features themselves.
I think the speaker probably referred to higher level features. For example, you shouldn't manually extract the feature "contains a nose" if you want to detect faces in an image. You should feed it the raw pixels, and the deep learner will learn the "contains a nose" feature somewhere in the deeper layers.

machine learning - svm feature fusion techique

for my final thesis i am trying to build up an 3d face recognition system by combining color and depth information. the first step i did, is to realign the data-head to an given model-head using the iterative closest point algorithm. for the detection step i was thinking about using the libsvm. but i dont understand how to combine the depth and the color information to one feature vector? they are dependent information (each point consist of color (RGB), depth information and also scan quality).. what do you suggest to do? something like weighting?
edit:
last night i read an article about SURF/SIFT features i would like to use them! could it work? the concept would be the following: extracting this features out of the color image and the depth image (range image), using each feature as a single feature vector for the svm?
Concatenation is indeed a possibility. However, as you are working on 3d face recognition you should have some strategy as to how you go about it. Rotation and translation of faces will be hard to recognize using a "straightforward" approach.
You should decide whether you attempt to perform a detection of the face as a whole, or of sub-features. You could attempt to detect rotation by finding some core features (eyes, nose, etc).
Also, remember that SVMs are inherently binary (i.e. they separate between two classes). Depending on your exact application you will very likely have to employ some multi-class strategy (One-against-all or One-against-many).
I would recommend doing some literature research to see how others have attacked the problem (a google search will be a good start).
It sounds simple, but you can simply concatenate the two vectors into one. Many researchers do this.
What you arrived at is an important open problem. Yes, there are some ways to handle it, as mentioned here by Eamorr. For example you can concatenate and do PCA (or some non linear dimensionality reduction method). But it is kind of hard to defend the practicality of doing so, considering that PCA takes O(n^3) time in the number of features. This alone might be unreasonable for data in vision that may have thousands of features.
As mentioned by others, the easiest approach is to simply combine the two sets of features into one.
SVM is characterized by the normal to the maximum-margin hyperplane, where its components specify the weights/importance of the features, such that higher absolute values have a larger impact on the decision function. Thus SVM assigns weights to each feature all on its own.
In order for this to work, obviously you would have to normalize all the attributes to have the same scale (say transform all features to be in the range [-1,1] or [0,1])

Resources