Unlike other algorithms like linear regressions ,KNN doesn't seems to perform any calculation in the training phase. Like in case of linear regressions it finds the coefficients in the training phase.But what about KNN?
During training phase, KNN arranges the data (sort of indexing process) in order to find the closest neighbors efficiently during the inference phase. Otherwise, it would have to compare each new case during inference with the whole dataset making it quite inefficient.
You can read more about it at: https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbor-algorithms
KNN belongs to the group of lazy learners. As opposed to eager learners such as logistic regression, svms, neural nets, lazy learners just store the training data in memory. Then, during inference, it find the K nearest neighbours from the training data in order to classify the new instance.
KNN is an instance based method, which completely relies on training examples, in other words, it memorizes all the training examples So in case of classification, whenever any examples appears, it compute euclidean distance between the input example and all the training examples, and returns the label of the closest training example based on the distance.
Knn is lazy learner . It means that , like other algorithms learn in their training phase (Linear regression etc) , Knn learn in training phase . It actually just store data points in RAM at time of training .
Like in case of linear regressions it finds the coefficients in the training phase.But what about KNN?--> In case of KNN it tunes its parameter in testing phase . In testing phase it finds its optimal solution of parameters (K value , Distance calculating technique etc).
Unlike other algorithms which learn in training phase and get tested in testing phase , KNN learn and get tested(K fold CV) for parameters in testing phase .
Distance calculation->https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbor-algorithms
KNN python docs->https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
Related
I'm following the course on Machine Learning from Coursera and I just had an interrogation.
Multiple classifier making a xor classifier
On this picture we can see that in order to make a xor classifier we build other smaller classifiers which are trained with linearly separable gate.
So each classifier has a job (for example AND, OR, etc) defined and the network must be trained for this task.
But in a bigger neural net it's impossible to define a task for each neuron (or classifier).
So my question is : Is this the task of the Back-Propogation algorithm (in addition to the fact that it is used to update the weight) ?
If someone is wondering the same thing, yes it is the case.
The backprop algorithm makes "smaller linear solvable" per each neuron (or classifier).
I have a cyclic method running which collects a data set of 15.000 feature vectors with 30 dimensions (every 200ms). My current setup simply feeds all raw feature vectors to a SVM with RBF (Radial basis function). The classification result is rather unconvincing as being costly in terms of time. I know that the dataset isn't that big, so classification in real-time could be possible with the right subsampling feature vector or so. The goal is to speed up the entire classification process (training/prediction) to reach a few milliseconds. To obtain an unsupervised classification approach, I currently run k-means to label the feature vectors. I pick a few cluster results and assign them class 1 and all others class 0.
The idea now the following:
collect all 15.000 (N) feature vectors with 30 (D) dimensions
PCA on all N feature vectors
use the eigenvalues to determine a feature vector with (d) dimensions (d < D)
Fed the new set of (n < N)
feature vectors
or: the eigenvectors ?
to train the svm
Maybe instead of SVM a KNN approach would result in similar result?
Does this approach makes sense?
Any ideas to improve the process or change it in order to speed it up?
How do I determine the best number of d?
The classification accuracy shouldn't suffer too much from the time reduction.
EDIT: Data stream mining
I was just reading about Data Stream Mining. I think this topic fits my setup quite well since I have to extract knowledge structures from continuous, rapid data records. Maybe I should replace the SVM with a Gradient Boosted Tree?
Thanks!
I built a classifier with 13 features ( no binary ones ) and normalized individually for each sample using scikit tool ( Normalizer().transform).
When I make predictions it predicts all training sets as positives and all test sets as negatives ( irrespective of fact whether it is positive or negative )
What anomalies I should focus on in my classifier, feature or data ???
Notes: 1) I normalize test and training sets (individually for each sample) separately.
2) I tried cross validation but the performance is same
3) I used both SVM linear and RBF Kernels
4) I tried without normalizing too. But same poor results
5) I have same number of positive and negative datasets ( 400 each) and 34 samples of positive and 1000+ samples of negative test sets.
If you're training on balanced data the fact that "it predicts all training sets as positive" is probably enough to conclude that something has gone wrong.
Try building something very simple (e.g. a linear SVM with one or two features) and look at the model as well as a visualization of your training data; follow the scikit-learn example: http://scikit-learn.org/stable/auto_examples/svm/plot_iris.html
There's also a possibility that your input data has many large outliers impacting the transform process...
Try doing feature selection on the training data (Seperately from your test/validation data).
Feature selection on your whole dataset can easily lead to overfitting.
We all know that the objective function of SVM is iteratively trained. In order to continue training, at least we can store all the variables used in the iterations if we want to continue on the same training dataset.
While, if we want to train on a slightly different dataset, what should we do to make full use of the previously trained model? Or does this kind of thought make sense? I think it is quite reasonable if we train a K-means model. But I am not sure if it still makes sense for the SVM problem.
There are some literature on this topic:
alpha-seeding, in which the training data is divided into chunks. After you train a SVM on the ith chunk, you take those and use them to train your SVM with the (i+1)th chunk.
Incremental SVM serves as an online learning in which you update a classifier with new examples rather than retrain the entire data set.
SVM heavy package with online SVM training as well.
What you are describing is what an online learning algorithm does and unfortunately the classic definition for SVM is done in a batch fashion.
However, there are several solvers for SVM that produces quasy optimal hypothesis to the underneath optimization problem in an online learning way. In particular my favourite is Pegasos-SVM which can find a good near optimal solution in linear time:
http://ttic.uchicago.edu/~nati/Publications/PegasosMPB.pdf
In general this doesn't make sense. SVM training is an optimization process with regard to every training set vector. Each training vector has an associated coefficient, which as a result is either 0 (irrelevant) or > 0 (support vector). Adding another training vector imposes another, different, optimization problem.
The only way to reuse information from previous training I can think of is to choose support vectors from the previous training and add them to the new training set. I'm not sure, but this probably will negatively affect generalization - VC dimension of an SVM is related to the number of support vectors, so adding previous support vectors to the new dataset is likely to increase the support vector count.
Apparently, there are more possibilities, as noted in lennon310's answer.
I should decide between SVM and neural networks for some image processing application. The classifier must be fast enough for near-real-time application and accuracy is important too. Since this is a medical application, it is important that the classifier has the low failure rate.
which one is better choice?
A couple of provisos:
performance of a ML classifier can refer to either (i) performance of the classifier itself; or (ii) performance of the predicate step: execution speed of the model-building algorithm. Particularly in this case, the answer is quite different depending on which of the two is intended in the OP, so i'll answer each separately.
second, by Neural Network, i'll assume you're referring to the most common implementation--i.e., a feed-forward, back-propagating single-hidden-layer perceptron.
Training Time (execution speed of the model builder)
For SVM compared to NN: SVMs are much slower. There is a straightforward reason for this: SVM training requires solving the associated Lagrangian dual (rather than primal) problem. This is a quadratic optimization problem in which the number of variables is very large--i.e., equal to the number of training instances (the 'length' of your data matrix).
In practice, two factors, if present in your scenario, could nullify this advantage:
NN training is trivial to parallelize (via map reduce); parallelizing SVM training is not trivial, but it's also not impossible--within the past eight or so years, several implementations have been published and proven to work (https://bibliographie.uni-tuebingen.de/xmlui/bitstream/handle/10900/49015/pdf/tech_21.pdf)
mult-class classification problem SVMs are two-class classifiers.They can be adapted for multi-class problems, but this is never straightforward because SVMs use direct decision functions. (An excellent source for modifying SVMs to multi-class problems is S. Abe, Support Vector Machines for Pattern Classification, Springer, 2005). This modification could wipe out any performance advantage SVMs have over NNs: So for instance, if your data has
more than two classes and you chose to configure the SVM using
successive classificstaion (aka one-against-many classification) in
which data is fed to a first SVM classifier which classifiers the
data point either class I or other; if the class is other then
the data point is fed to a second classifier which classifies it
class II or other, etc.
Prediction Performance (execution speed of the model)
Performance of an SVM is substantially higher compared to NN. For a three-layer (one hidden-layer) NN, prediction requires successive multiplication of an input vector by two 2D matrices (the weight matrices). For SVM, classification involves determining on which side of the decision boundary a given point lies, in other words a cosine product.
Prediction Accuracy
By "failure rate" i assume you mean error rate rather than failure of the classifier in production use. If the latter, then there is very little if any difference between SVM and NN--both models are generally numerically stable.
Comparing prediction accuracy of the two models, and assuming both are competently configured and trained, the SVM will outperform the NN.
The superior resolution of SVM versus NN is well documented in the scientific literature. It is true that such a comparison depends on the data, the configuration, and parameter choice of the two models. In fact, this comparison has been so widely studied--over perhaps all conceivable parameter space--and the results so consistent, that even the existence of a few exceptions (though i'm not aware of any) under impractical circumstances shouldn't interfere with the conclusion that SVMs outperform NNs.
Why does SVM outperform NN?
These two models are based on fundamentally different learing strategies.
In NN, network weights (the NN's fitting parameters, adjusted during training) are adjusted such that the sum-of-square error between the network output and the actual value (target) is minimized.
Training an SVM, by contrast, means an explicit determination of the decision boundaries directly from the training data. This is of course required as the predicate step to the optimization problem required to build an SVM model: minimizing the aggregate distance between the maximum-margin hyperplane and the support vectors.
In practice though it is harder to configure the algorithm to train an SVM. The reason is due to the large (compared to NN) number of parameters required for configuration:
choice of kernel
selection of kernel parameters
selection of the value of the margin parameter