Why in preprocessing image data for a neural network, we need to zero-centered data. Why is this?
Mean-subtraction or zero-centering is a common pre-processing technique that involves subtracting mean from each of the data point to make it zero-centered. Consider a case where inputs to a neuron are all positive or all negative. In that case the gradient calculated during back propagation will either be positive or negative (of the same sign as inputs). And hence parameter updates are only restricted to specific directions which in turn will make it difficult to converge. As a result, the gradient updates go too far in different directions which makes optimization harder. Many algorithms show better performances when the dataset is symmetric (with a zero-mean).
Related
I know that feature selection helps me remove features that may have low contribution. I know that PCA helps reduce possibly correlated features into one, reducing the dimensions. I know that normalization transforms features to the same scale.
But is there a recommended order to do these three steps? Logically I would think that I should weed out bad features by feature selection first, followed by normalizing them, and finally use PCA to reduce dimensions and make the features as independent from each other as possible.
Is this logic correct?
Bonus question - are there any more things to do (preprocess or transform)
to the features before feeding them into the estimator?
If I were doing a classifier of some sort I would personally use this order
Normalization
PCA
Feature Selection
Normalization: You would do normalization first to get data into reasonable bounds. If you have data (x,y) and the range of x is from -1000 to +1000 and y is from -1 to +1 You can see any distance metric would automatically say a change in y is less significant than a change in X. we don't know that is the case yet. So we want to normalize our data.
PCA: Uses the eigenvalue decomposition of data to find an orthogonal basis set that describes the variance in data points. If you have 4 characteristics, PCA can show you that only 2 characteristics really differentiate data points which brings us to the last step
Feature Selection: once you have a coordinate space that better describes your data you can select which features are salient.Typically you'd use the largest eigenvalues(EVs) and their corresponding eigenvectors from PCA for your representation. Since larger EVs mean there is more variance in that data direction, you can get more granularity in isolating features. This is a good method to reduce number of dimensions of your problem.
of course this could change from problem to problem, but that is simply a generic guide.
Generally speaking, Normalization is needed before PCA.
The key to the problem is the order of feature selection, and it's depends on the method of feature selection.
A simple feature selection is to see whether the variance or standard deviation of the feature is small. If these values are relatively small, this feature may not help the classifier. But if you do normalization before you do this, the standard deviation and variance will become smaller (generally less than 1), which will result in very small differences in std or var between the different features.If you use zero-mean normalization, the mean of all the features will equal 0 and std equals 1.At this point, it might be bad to do normalization before feature selection
Feature selection is flexible, and there are many ways to select features. The order of feature selection should be chosen according to the actual situation
Good answers here. One point needs to be highlighted. PCA is a form of dimensionality reduction. It will find a lower dimensional linear subspace that approximates the data well. When the axes of this subspace align with the features that one started with, it will lead to interpretable feature selection as well. Otherwise, feature selection after PCA, will lead to features that are linear combinations of the original set of features and they are difficult to interpret based on the original set of features.
I am trying to implement a binary classifier using logistic regression for data drawn from 2 point sets (classes y (-1, 1)). As seen below, we can use the parameter a to prevent overfitting.
Now I am not sure, how to choose the "good" value for a.
Another thing I am not sure about is how to choose a "good" convergence criterion for this sort of problem.
Value of 'a'
Choosing "good" things is a sort of meta-regression: pick any value for a that seems reasonable. Run the regression. Try again with a values larger and smaller by a factor of 3. If either works better than the original, try another factor of 3 in that direction -- but round it from 9x to 10x for readability.
You get the idea ... play with it until you get in the right range. Unless you're really trying to optimize the result, you probably won't need to narrow it down much closer than that factor of 3.
Data Set Partition
ML folks have spent a lot of words analysing the best split. The optimal split depends very much on your data space. As a global heuristic, use half or a bit more for training; of the rest, no more than half should be used for testing, the rest for validation. For instance, 50:20:30 is a viable approximation for train:test:validate.
Again, you get to play with this somewhat ... except that any true test of the error rate would be entirely new data.
Convergence
This depends very much on the characteristics of your empirical error space near the best solution, as well as near local regions of low gradient.
The first consideration is to choose an error function that is likely to be convex and have no flattish regions. The second is to get some feeling for the magnitude of the gradient in the region of a desired solution (normalizing your data will help with this); use this to help choose the convergence radius; you might want to play with that 3x scaling here, too. The final one is to play with the learning rate, so that it's scaled to the normalized data.
Does any of this help?
I have features with 18 dimensions after doing feature selection and will be used to train classifier, RNN, HMM, etc.
The features contain stddev, mean and derivative of accelerometer and gyroscope.
These features have different units and normalization/standardization will lose the true meaning of features.
For example, the unit of one feature vector is rotational velocity (degree/sec), the value in that feature is between -120 and 120.
Another is stddev of acceleration of x-axis, the value is mainly between 0 and 2.
If I want to do standardization, all the feature vectors will be centered near 0, with negative/positive values spread around zero. --> Even the stddev will have negative values! It totally loses actual meaning?
Am I on the wrong track? Any information is appreciated! Thanks!
It is always strongly recomendated to perform feature scaling and normalization as preprocessing step, and it will even benefit gradient descent(the most common learning algorithm),even in your case it would be useful but if you are in doubt you can perform cross validation. For example when using images and neural networks, sometimes after normalization the features(pixels) get negative values, that doesnt make the training data to lose meaning.
I have an indefinitely large training set to train a neural network.
Does it make any sense in this scenario to use regularization techniques like dropout?
Yes, it probably still does. Dropout is regularization in a sense, but much subtler than something like L1 norm. It prevents excessive co-adaptation of feature detectors as described in the original paper.
You probably don't want the network to learn to depend on just one feature or just a small combo of features, even if that is the best feature in your training set, because it may not be the case in new data. Intuitively, a network with dropout trained to recognize people in images will likely still recognize them if the face is obscured, even if there was no example image like that in the training set (because the face high level feature would have been dropped out some fraction of the time); a network trained without dropout may not (because the face feature is probably one of the best single features for detecting people). You can think of dropout as a certain degree of forced concept generalization.
Empirically, the feature detectors that are produced with dropout are much more structured (eg, for images: closer to Gabor filters, for the first few layers) when dropout is used; without dropout they are closer to random (probably because that network approximates the Gabor filter it is converging towwards using a specific linear combo of random filters, if it can rely on the elements of that combo not being dropped out then there is no gradient towards separating the filters). This is also probably a good thing since it forces features which are independent to be implemented as independent early on, which may result in lower cross-talk later on.
I have some basic conceptual queries on SVM - it will be great if any one can guide me on this. I have been studying books and lectures for a while but have not been able to get answers for these queries correctly
Suppose I have m featured data points - m > 2. How will I know if the data points are linearly separable or not?. If I have understood correctly, linearly separable data points - will not need any special kernel for finding the hyper plane as there is no need to increase the dimension.
Say, I am not sure whether the data is linearly separable or not. I try to get a hyper plane with linear kernel, once with slackness and once without slackness on the lagrange multipliers. What difference will I see on the error rates on training and test data for these two hyper planes. If I understood correctly, if the data is not linearly separable, and if I am not using slackness then there cannot be any optimal plane. If that is the case, should the svm algorithm give me different hyper planes on different runs. Now when I introduce slackness - should I always get the same hyper plane, every run ? And how exactly can I find out from the lagrange multipliers of a hyper plane, whether the data was linearly separable or not.
Now say from 2 I came to know somehow that the data was not linearly separable at m dimensions. So I will try to increase the dimensions and see if it is separable at a higher dimension. How do I know how high I will need to go ? I know the calculations do not go into that space - but is there any way to find out from 2 what should be the best kernel for 3 (i.e I want to find a linearly separating hyper plane).
What is the best way to visualize hyper planes and data points in Matlab where the feature dimensions can be as big as 60 - and the hyperplane is at > 100 dimensions (i,e data points in few hundreds and using Gaussian Kernels the feature vector changes to > 100 dimensions).
I will really appreciate if someone clears these doubts
Regards
I'm going to try to focus on your questions (1), (2) and (3). In practice the most important concern is not if the problem becomes linearly separable but how well the classifier performs on unseen data (i.e. how well it classifies). It seems you want to find a good kernel for which data is linearly separable, and you will always be able to do this (consider putting at each training point an extremely narrow gaussian RBF), but what you really want is good performance on unseen data. That being said:
If the problem is not linearly separable and not using slacks the optimization will fail. It depends on the implementation and the specific optimization algorithm how it fails, does it not converge?, does it not find a descent direction? does it run into numerical difficulties? Even if you wanted to determine cases with slacks, you can still run into numerical difficulties that would make and that alone would make your algorithm of linear separability unreliable
How high do you need to go? Well that is a fundamental question. It is called the problem of data representation. For straight forward solutions people use held out data (people don't care about linear separability they care about good performance on held out data) and do parameter search (for example an RBF kernel can is strictly more expressive than a linear kernel) under the correct gammas. So the problem becomes finding a good gamma for your data. See for example this paper: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.141.880
I don't think there is a trivial connection between the values of the lagrangian multipliers and linear separability. You can try an high alphas whose value is C, but I'm not sure you'll be able to say much.