How features scaling for training and classification? - machine-learning

I'm using SVM for data classification. I've scaled the training data using the formula (x-mean)/(standard deviation), but I do not know how to scaling new data to classify.
Then I need to know, which mean and standard deviation I can using for these data? that is, I use its own mean and standard deviation (of the classification data) or I use the mean and standard deviation of training data for each component of the vector.

Related

SMOTE oversampling for anomaly detection using a classifier

I have sensor data and I want to do live anomaly detection using LOF on the training set to detect anomalies and then apply the labeled data to a classifier to do classification for new data points. I thought about using SMOTE because I want more anamolies points in the training data to overcome the imbalanced classification problem but the issue is that SMOTE created many points which are inside the normal range.
how can I do oversampling without creating samples in the normal data range?
the graph for the data before applying SMOTE.
data after SMOTE
SMOTE is going to linearly interpolate synthetic points between a minority class sample's k-nearest neighbors. This means that you're going to end up with points between a sample and its neighbors. When samples are all over the place like this, it makes sense that you're going to create synthetic points in the middle.
SMOTE should really be used to identify more specific regions in the feature space as the decision region for the minority class. This doesn't seem to be your use case. You want to know which points "don't belong," per se.
This seems like a fairly nice use case for DBSCAN, a density-based clustering algorithm that will identify points beyond some distance, eps, as not belonging to the same neighborhood.

How to use Weight vector of SVM and logistic regression for feature importance?

I have trained a SVM and logistic regression classifier on my dataset. Both classifier provide a weight vector which is of the size of the number of features. I can use this weight vector to select the 10 most important features by just selecting the 10 features with the highest weights.
Should I use the absolute values of the weights, i.e. selecting the 10 features with the highest absolute values?
Second, this only works for SVM with linear kernel but not with RBF kernel as I have read. For non-linear kernel the weights are somehow no more linear. What is the exact reason that the weight vector cannot be used to determine the importance of features in case of non-linear kernel SVM?
As I answered to similar question, weight vector of any linear classifier indicates feature importance: simply because final value is a linear combination of feature values with weights as coefficients, so the bigger weight, the more impact to the final value is caused by the corresponding summand.
Thus, for linear classifier you can take features with biggest weights (not with biggest values of the feature itself, or the biggest product of weight and feature value).
It also explains why SVM with non-linear kernels like RBF don't have such a property: both feature values and weights are transformed into another space and you can't say that the bigger weight leads to bigger impact, see wiki.
If you need to select most important features for non-linear SVM, use special methods for feature selection, namely wrapper methods.

One-class Support Vector Machine Sensitivity Drops when the number of training sample increase

I am using One-Class SVM for outlier detections. It appears that as the number of training samples increases, the sensitivity TP/(TP+FN) of One-Class SVM detection result drops, and classification rate and specificity both increase.
What's the best way of explaining this relationship in terms of hyperplane and support vectors?
Thanks
The more training examples you have, the less your classifier is able to detect true positive correctly.
It means that the new data does not fit correctly with the model you are training.
Here is a simple example.
Below you have two classes, and we can easily separate them using a linear kernel.
The sensitivity of the blue class is 1.
As I add more yellow training data near the decision boundary, the generated hyperplane can't fit the data as well as before.
As a consequence we now see that there is two misclassified blue data point.
The sensitivity of the blue class is now 0.92
As the number of training data increase, the support vector generate a somewhat less optimal hyperplane. Maybe because of the extra data a linearly separable data set becomes non linearly separable. In such case trying different kernel, such as RBF kernel can help.
EDIT: Add more informations about the RBF Kernel:
In this video you can see what happen with a RBF kernel.
The same logic applies, if the training data is not easily separable in n-dimension you will have worse results.
You should try to select a better C using cross-validation.
In this paper, the figure 3 illustrate that the results can be worse if the C is not properly selected :
More training data could hurt if we did not pick a proper C. We need to
cross-validate on the correct C to produce good results

Gaussian basis function selection - Linear Regression

I'm looking to set up a linear regression using 2D Gaussian basis functions. My input training variables cover a two dimensional space. Before applying the machine learning (Bayesian linear regression), I need to select parameters for the Gaussians - mean and variance and also decide how many basis functions to use.
I am currently spacing the means (of a preallocated number of basis Gaussians) evenly over a grid, and just assuming constant variance. This is obviously not the best approach.
Any ideas on how to calculate these variables?

Hog descriptor traininun using SVMs

I am trying to classify road signs. For this reason I want to train Hog descriptors with the use of SVMs. I have extracted the hog descriptors for training data with dimensions 64x64. The positive training data are 60% and the negative 40% of the whole sample.When I am traing using SVM of opencv (with a linear kernel) evereything seems fine, but when I am trying to predict, the results fail and show only one class (the result is always 1). I have tried to feed my data into SVMlight as well, and all the negatives are missclassified.Any ideas what could be possible wrong? Maybe the small number of training data? (I am just trying to implement the code and see that everything is fine without using a the training data).

Resources