Is there a need to normalise input vector for prediction in SVM? - machine-learning

For input data of different scale I understand that the values used to train the classifier has to be normalized for correct classification(SVM).
So does the input vector for prediction also needs to be normalized?
The scenario that I have is that the training data is normalized and serialized and saved in the database, when a prediction has to be done the serialized data is deserialized to get the normalized numpy array, and the numpy array is then fit on the classifier and the input vector for prediction is applied for prediction. So does this input vector also needs to be normalized? If so how to do it, since at the time of prediction I don't have the actual input training data to normalize?
Also I am normalizing along axis=0 , i.e. along the column.
my code for normalizing is :
preprocessing.normalize(data, norm='l2',axis=0)
is there a way to serialize preprocessing.normalize

In SVMs it is recommended a scaler for several reasons.
It is better to have the same scale in many optimization methods.
Many kernel functions use internally an euclidean distance to compare two different samples (in the gaussian kernel the euclidean distance is in the exponential term), if every feature has a different scale, the euclidean distance only take into account the features with highest scale.
When you put the features in the same scale you must remove the mean and divide by the standard deviation.
xi - mi
xi -> ------------
sigmai
You must storage the mean and standard deviation of every feature in the training set to use the same operations in future data.
In python you have functions to do that for you:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
To obtain means and standar deviations:
scaler = preprocessing.StandardScaler().fit(X)
To normalize then the training set (X is a matrix where every row is a data and every column a feature):
X = scaler.transform(X)
After the training, you must normalize of future data before the classification:
newData = scaler.transform(newData)

Related

Data Science Scaling/Normalization real case

When do data pre-processing, it is suggested to do either scaling or normalization. It is easy to do it when you have data on your hand. You have all the data and can do it right away. But after the model built and run, does the first data that comes in need to be scaled or normalized? If it needed, it only one single row how to scale or normalize it? How do we know what is the min/max/mean/stdev from each feature? And how is the incoming data is the min/max/mean each feature?
Please advise
First of all you should know when to use scaling and normalization.
Scaling - scaling is nothing but to transform your features to comparable magnitudes.Let say if you have features like person's income and you noticed that some have value of order 10^3 and some have 10^6.Now if you model your problem with this features then algorithms like KNN, Ridge Regression will give higher weight to higher magnitude of such attributes.To prevent this you need to first scale your features.Min-Max scaler is one of the most used scaling.
Mean Normalisation -
If after examining the distribution of the feature and you found that feature is not centered around zero then for the algorithm like svm where objective function already assumes zero mean and same order variance, we could have problem in modeling.So here you should do Mean Normalisation.
Standardization - For the algorithm like svm, neural network, logistic regression it is necessary to have a variance of the feature in the same order.So why don't we make it to one.So in standardization, we make the distribution of features to zero mean and unit variance.
Now let's try to answer your question in terms of training and testing set.
So let's say you are training your model on 50k dataset and testing on 10k dataset.
For the above three transformations, the standard approach says that you should fit any normalizer or scaler to only training dataset and use only transform for the testing dataset.
In our case, if we want to use standardization then we will first fit our standardizer on 50k training dataset and then used to transform it 50k training dataset and also testing dataset.
Note - We shouldn't fit our standardizer to test dataset, in place of we will use already fitted standardizer to transform testing dataset.
Yes, you need to apply normalization to the input data, else the model will predict nonsense.
You also have to save the normalization coefficients that were used during training, or from training data. Then you have to apply the same coefficients to incoming data.
For example if you use min-max normalization:
f_n = (f - min(f)) / (max(f) - min_(f))
Then you need to save the min(f) and max(f) in order to perform normalization for new data.

Neural Networks normalizing output data

I have a training data for NN along with expected outputs. Each input is 10 dimensional vector and has 1 expected output.I have normalised the training data using Gaussian but I don't know how to normalise the outputs since it only has single dimension. Any ideas?
Example:
Raw Input Vector:-128.91, 71.076, -100.75,4.2475, -98.811, 77.219, 4.4096, -15.382, -6.1477, -361.18
Normalised Input Vector: -0.6049, 1.0412, -0.3731, 0.4912, -0.3571, 1.0918, 0.4925, 0.3296, 0.4056, -2.5168
The raw expected output for the above input is 1183.6 but I don't know how to normalise that. Should I normalise the expected output as part of the input vector?
From the looks of your problem, you are trying to implement some sort of regression algorithm. For regression problems you don't normally normalize the outputs. For the training data you provide for a regression system, the expected output should be within the range you're expecting, or simply whatever data you have for the expected outputs.
Therefore, you can normalize the training
inputs to allow the training to go faster, but you typically don't normalize the target outputs. When it comes to testing time or providing new inputs, make sure you normalize the data in the same way that you did during training. Specifically, use exactly the same parameters for normalization during training for any test inputs into the network.
One important remark is that you normalized elements of a single input vector. Having one-dimensional output space, you could not normalize the output.
The correct way is, indeed, to take a complete batch of training data, say N input (and output) vectors, and normalize each dimension (variable) individually (using N samples). Thus, for one-dimensional output, you will have N samples for normalization. In this way, the vector space of your input will not be distorted.
The normalization of the output dimension is usually required when the scale-space of output variables significantly different. After training, you should use the same set normalization parameters (e.g., for zscore it is "mean" and "std") as you obtain from the training data. In this case, you will put new (unseen) data into the same scale space as you in training.

How to classify MNIST data set using k-means clustering?

I am applying K-Means clustering on MNIST dataset. How can I then predict the values of my test set according to this ?
well k-means is an unsupervised technique, so technically speaking you don't use it to "classify"--ie, a k-means model isn't supplied with labeled data (if it is then it doesn't use the class labels) and more so it doesn't return a prediction as a class label (eg, "1")
so to use k-means to predict the single digit encoded in a given data instance:
your k-means model is comprised of a set of centroids (i assume
you chose 26 centroids to correspond to the numbers 0 - 9 in base 10
each centroid represents the geometric center of one cluster--one
cluster per number
calculate the pairwise Euclidean distance (vector norm) between
your unknown data point and each centroid in your k-means model (the
centroid values from the final iteration, obviously)
the cluster whose centroid that is the least distance from the
unknown data point is the cluster to which the unknown data point
belongs

Support Vector Machine: Feature Transformation

How do I do the transformation in the test data when I have the trained SVM model in hand? I am trying to simulate the SVM output from mathematical equations and the trained SVM model (using RBF kernel). How do I do that?
In SVM, some of the common kernels used are:
Here xi and xj represent two samples. Now if the data has, say 5 samples, does this transformation include all the combination of two samples to generate the transformed feature space, like, x1 and x1, x1 and x2, x1 and x3,..., x4 and x5, x5 and x5.
If data has two features, then a polynomial transformation of order 2 transforms the input to 3 dimensions, as explained her in slide 15 http://www.robots.ox.ac.uk/~az/lectures/ml/lect3.pdf
Now how can be find the similar explantion for the transformation using the RBF kernel? I am trying to write a code for transforming the test data so that i can apply the trained SVM model on it.
This is way more complex than that. In short - you do not map your data directly into feature space. You simply change the dot product to the one induced by the kernel. What happens "inside" SVM when you work with polynomial kernel, each point is actually (indirectly) transformed to O(d^p) dimensional space (where d-input data dimension, p-degree of polynomial kernel). From mathematical perspective you work with some (often unknown) projection phi_K(x) which has the property that K(x, y) = <phi_K(x), phi_K(y)>, and nothing more. In SVM implementations, you do not need actual data representation (as phi_K(x) is usually huge, sometimes even infinite, like in RBF case) but instead it needs vector of dot product of your point will each element of the training set.
Thus what you do (in implementations, not from math perspective) is you provide:
During training whole Gram matrix, G defined as G_ij = K(x_i, x_j) where x_i is i'th training sample
During testing, when you get new point y you provide it to SVM as a vector of dot products H such that H_i = K(y, x_i), where again x_i are your training points (in fact you just need values for support vectors, but many implementations, like libsvm, actually require vector of the size of the training set - you can simply put 0's for K(y, x_j) if x_j is not a training vector)
Just remember, that this is not the same as training linear SVM "on top" of the above representation. This is just a way implementations usually accept your data, as they need a definition of dot product (function) and it is often easier to pass numbers than functions (but some of them, like scikit-learn SVC module, actually accepts functions as kernel parameter).
So what is RBF kernel? It is actually a mapping from points to functions space of normal distributions with means in your training points. And then dot product is just an integral from -inf to +inf from the product of such two functions. Sounds complex? It is at first sight, but it is a really nice trick, worth understanding!

Libsvm: SVM normalizing starts from 0 or 0.001

I am using libsvm for my document classification.
I use svm.h and svm.cc only in my project.
Its struct svm_problem requires array of svm_node that are non-zero thus using sparse.
I get a vector of tf-idf words with lets say in range [5,10]. If i normalize it to [0,1], all the 5's would become 0.
Should i remove these zeroes when sending it to svm_train ?
Does removing these would not reduce the information and lead to poor results ?
should i start the normalization from 0.001 rather than 0 ?
Well, in general, in SVM does normalizing in [0,1] not reduces information ?
SVM is not a Naive Bayes, feature's values are not counters, but dimensions in multidimensional real valued space, 0's have exactly the same amount of information as 1's (which also answers your concern regarding removing 0 values - don't do it). There is no reason to ever normalize data to [0.001, 1] for the SVM.
The only issue here is that column-wise normalization is not a good idea for the tf-idf, as it will degenerate yout features to the tf (as for perticular i'th dimension, tf-idf is simply tf value in [0,1] multiplied by a constant idf, normalization will multiply by idf^-1). I would consider one of the alternative preprocessing methods:
normalizing each dimension, so it has mean 0 and variance 1
decorrelation by making x=C^-1/2*x, where C is data covariance matrix

Resources