I understand Knn has a problem knows a "curse of dimensionality" when dealing with high dimension data and it justification is that it includes all features while calculating distance i.e. Euclidean distance where non important feature act as a noise and bias the results however i don't understand a few things
1) How cosine distance metric will be effected by this curse of dimensionality problem i.e. we define cosine distance as cosDistance = 1- cosSimilarity where cosSimilarity is favourable for high dimension data so how cosine distance may be effected by curse of dimensionality problem ?
2) Can we assign any weights to features in weka or can i apply feature selection locally to KNN ? Local to knn means i write my own class of K-NN where in classification i first convert training instance to lower dimension and then calculate test instance neighbors ?
Cosine does not fundamentally differ from Euclidean distance.
In fact it is trivial to show that on normalized data with Euclidean length 1, Cosine and Euclidean distance are the same. In other words, Cosine is computing the Euclidean distance on L2 normalized vectors...
Thus, cosine is not more robust to the curse of dimensionality than Euclidean distance. However, cosine is popular with e.g. text data that has a high apparent dimensionality - often thousands of dimensions - but the intrinsic dimensionality must be much lower. Plus, it's mostly used for ranking; the actual distance value is ignored.
Related
I am not sure whether I am applying PCA correctly or not! I have p features and n observations (instances). I put these in an nxp matrix X. I perform mean normalization and I get the normalized matrix B. I calculate the eigenvalues and eigenvectors of the pxp covariance matrix C=(1/(n-1))B*.B where * denotes the conjugate transpose.
The eigenvectors corresponding to the descendingly ordered eigenvalues are in a pxp matrix E. Let's say I want to reduce the number of attributes from p to k. I use the equation X_new=B.E_reduced where E_reduced is produced by choosing the first k columns of E. Here are my questions:
1) Should it be X_new=B.E_reduced or X_new=X.E_reduced?
2) Should I repeat the above calculations in the testing phase? If testing phase is similar to training phase, then no speed-up is gained because I have to calculate all the p features for each instance in the testing phase and PCA makes the algorithm slower because of eigenvector calculation overhead.
3) After applying PCA, I noticed that the accuracy decreased. Is this related to the number k (I set k=p/2) or the fact that I am using linear PCA instead of kernel PCA? What is the best way to choose the number k? I read that I can find the ratio of summation of k eigenvalues over the summation of all eigenvalues and make a decision based on this ratio.
You apply the multiplication to the centered data usually, so your projected data is also centered.
Never re-run PCA during testing. Only usenit on training data, and keep the shift vector and projection matrix. You need to apply exactly the same projection as during training, not recompute a new projection.
Decreased performance can have many reasons. E.g. did you also apply scaling using the roots of the eigenvalues? And what method did you use the first place?
As you probably know, in K-NN, the decision is usually taken according to the "majority vote", and not according to some threshold - i.e. there is no parameter to base a ROC curve on.
Note that in the implementation of my K-NN classifier, the votes don't have equal weights. I.e. the weight for each "neighbor" is e^(-d), where d is the distance between the tested sample and the neighbor. This measure gives higher weights for the votes of the nearer neighbors among the K neighbors.
My current decision rule is that if the sum of the scores of the positive neighbors is higher than the sum of the scores of the negative samples, then my classifier says POSITIVE, else, it says NEGATIVE.
But - There is no threshold.
Then, I thought about the following idea:
Deciding on the class of the samples which has a higher sum of votes, could be more generally described as using the threshold 0, for the score computed by: (POS_NEIGHBORS_SUMMED_SCORES - NEG_NEIGHBORS_SUMMED_SCORES)
So I thought changing my decision rule to be using a threshold on that measure, and plotting a ROC curve basing on thresholds on the values of
(POS_NEIGHBORS_SUMMED_SCORES - NEG_NEIGHBORS_SUMMED_SCORES)
Does it sound like a good approach for this task?
Yes, it is more or less what is typically used. If you take a look at scikit-learn it has weights in knn, and they also have predit_proba, which gives you a clear decision threshold. Typically you do not want to condition on a difference, however, but rather ratio
votes positive / (votes negative + votes positive) < T
this way, you know that you just have to "move" threshold from 0 to 1, and not arbitrary values. it also now has a clear interpretation - as an internal probability estimate that you consider "sure enough". By default T = 0.5, if the probability is above 50% you classify as positive, but as said before - you can do anything wit it.
I have several laplacian spectra of graphs (networks) to compare, and I'm looking for a distance measure between each pair of spectrum.
I have tried Hellinger distance and euclidian distance (the n-dimensionnal euclidian distance between two vectors of eigenvalues, where n is the number of bins), which give me different distances, obviously, but also different rankings, which leads me to think that the method used is quite important in the quality of the results.
I also saw that histograms distances can be measured through other methods (correlation, chi-square, intersection, heart mover's distance, etc.)
Is there an accepted distance measure for this kind of histogram comparison problem, where relatively small features are quite important (like the peaks around eigenvalue 0.3) ?
The spectra:
1) I am using the following for measuring the cosine distance between two vectors (let's say A and B).
Lets assume we have two vectors for e.g vector A and vector B,
cosine distance between A & B = (dot(A, B) / (Magnitude (A) * Magnitude (B)))
is this formula right ? if not than kindly suggest me the right formulae ?
2) Is K-NN always better in accuracy than Rocchio or there are some situations when Rocchio performs better than K-NN ? K-NN looks like an enhancement of Rocchio and theoretical concepts suggests that K-NN will perform much better than Rocchio but i am finding vice versa in practical implementation in which Rocchio is performing much better than K-NN ?
(1) Cosine distance is one of the similarity measures. Others may include the Euclidean distance or weighted Euclidean distance. You implementation is correct.
(2) The main difference between KNN and Rocchio is there is no training in the former, but prototype vectors are generated during training process in the latter. During test process, all the training instances are used in KNN, but only the prototype vectors are used in Rocchio (usually one vector per class). So the Rocchio is more efficient in both training and test. However it lacks sufficient theoretical validity to demonstrate Rocchio's stability and robustness. And it is shown that Rocchio does not work well if the categories are not linear separable.
In the original paper of HOG (Histogram of Oriented Gradients) http://lear.inrialpes.fr/people/triggs/pubs/Dalal-cvpr05.pdf there are some images, which shows the hog representation of an image (Figure 6).In this figure the f, g part says "HOG descriptor weighted by respectively the positive and the negative SVM weights".
I don't understand what does this mean. I understand that when I train a SVM, I get a Weigth vector, and to classify, I have to use the features (HOG descriptors) as the input of the function. So what do they mean by positive and negative weigths? And how would I plot them like the paper?
Thanks in Advance.
The weights tell you how significant a specific element of the feature vector is for a given class. That means that if you see a high value in your feature vector you can lookup the corresponding weight
If the weight is a high positiv number it's more likely that your object is of the class
If your weight is a high negative number it's more likely that your object is NOT of the class
If your weight is close to zero this position is mostly irrelavant for the classification
Now your using those weights to scale the feature vector you have where the length of the gradients are mapped to the color-intensity. Because you can't display negative color intensities they decided to split the positive and negative visualization. In the visualizations you can now see which parts of the input-image contributes to the class (positiv) and which don't (negative).