Is it possible to use KDTree with cosine similarity? - machine-learning

Looks like I can't use this similarity metric for with sklearn KDTree, for example, but I need because I am using measuring words vectors similarity. What is fast robust customization algorithm for this case? I know about Local Sensitivity Hashing, but it should tunned & tested up a lot to find params.

The ranking your would get with cosine similarity is equivalent to the rank order of the euclidean distance when you normalize all the data points first. So you can use a KD tree to the the k nearest neighbors with KDTrees, but you will need to recompute what the cosine similarity is.
The cosine similarity is not a distance metric as normally presented, but it can be transformed into one. If done, you can then use other structures like Ball Trees to do accelerated nn with cosine similarity directly. I've implemented this in the JSAT library, if you were interested in a Java implementation.

According to the table at the end of this page, cosine support eoth k-d-tree should be possible: ELKI supports cosine with the R-tree, and you can derive bounding rectangles for the k-d-tree, too; and the k-d-tree supports at least five metrics in that table. So I do not see why it shouldn't work.
Indexing support in sklearn often is not very complete (albeit improving), unfortunately; so don't take that as a reference.
While the k-d-tree can theoretically support Cosine by
transforming the data such that Cosine becomes Euclidean distance
working with the bounding boxes and the minimum angle to the bounding box (that appears to be what ELKI is doing for the R-tree)
You should be aware that the k-d-tree does not work very well with high-dimensional data, and cosine is mostly popular for very high-dimensional data. A k-d-tree always only looks at one dimension. If you want all d dimension to be used once, you need O(2^d) data points. For high d, there is no way all attributes are used.
The R-tree is slightly better here because it uses bounding boxes; these shrink with every split in all dimensions, so the pruning does get better. But this also means it needs a lot of memory for such data, and the tree construction may suffer from the same problem.
So in essence, don't use either for high dimensional data.
But also don't assume that Cosine does magically improve your results, in particular for high-d data. It's very much overrated. As above transformation indicates, there cannot be a systematic benefit of Cosine over Euclidean: Cosine is a special case of Euclidean.
For sparse data, inverted lists (c.f. Lucene, Xapian, Solr, ...) are the way to index for cosine.

Related

What “information” in document vectors makes sentiment prediction work?

Sentiment prediction based on document vectors works pretty well, as examples show:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb
http://linanqiu.github.io/2015/10/07/word2vec-sentiment/
I wonder what pattern is in the vectors making that possible. I thought it should be similarity of vectors making that somehow possible. Gensim similarity measures rely on cosine similarity. Therefore, I tried the following:
Randomly initialised a fix “compare” vector, get cosine similarity of the “compare” vector with all other vectors in training and test set, use the similarities and the labels of the train set to estimate a logistic regression model, evaluate the model with the test set.
Looks like this, where train/test_arrays contain document vectors and train/test_labels labels either 0 or 1. (Notice, document vectors are obtained from genism doc2vec and are well trained, predicting the test set 80% right if directly used as input for the logistic regression):
fix_vec = numpy.random.rand(100,1)
def cos_distance_to_fix(x):
return scipy.spatial.distance.cosine(fix_vec, x)
train_arrays_cos = numpy.reshape(numpy.apply_along_axis(cos_distance_to_fix, axis=1, arr=train_arrays), newshape=(-1,1))
test_arrays_cos = numpy.reshape(numpy.apply_along_axis(cos_distance_to_fix, axis=1, arr=test_arrays), newshape=(-1,1))
classifier = LogisticRegression()
classifier.fit(train_arrays_cos, train_labels)
classifier.score(test_arrays_cos, test_labels)
It turns out, that this approach does not work, predicting the test set only to 50%....
So, my question is, what “information” is in the vectors, making the prediction based on vectors work, if it is not the similarity of vectors? Or is my approach simply not possible to capture similarity of vectors correct?
This is less a question about Doc2Vec than about machine-learning principles with high-dimensional data.
Your approach is collapsing 100-dimensions to a single dimension – the distance to your random point. Then, you're hoping that single dimension can still be predictive.
And roughly all LogisticRegression can do with that single-valued input is try to pick a threshold-number that, when your distance is on one side of that threshold, predicts a class – and on the other side, predicts not-that-class.
Recasting that single-threshold-distance back to the original 100-dimensional space, it's essentially trying to find a hypersphere, around your random point, that does a good job collecting all of a single class either inside or outside its volume.
What are the odds your randomly-placed center-point, plus one adjustable radius, can do that well, in a complex high-dimensional space? My hunch is: not a lot. And your results, no better than random guessing, seems to suggest the same.
The LogisticRegression with access to the full 100-dimensions finds a discriminating-frontier for assigning the class that's described by 100 coefficients and one intercept-value – and all of those 101 values (free parameters) can be adjusted to improve its classification performance.
In comparison, your alternative LogisticRegression with access to only the one 'distance-from-a-random-point' dimension can pick just one coefficient (for the distance) and an intercept/bias. It's got 1/100th as much information to work with, and only 2 free parameters to adjust.
As an analogy, consider a much simpler space: the surface of the Earth. Pick a 'random' point, like say the South Pole. If I then tell you that you are in an unknown place 8900 miles from the South Pole, can you answer whether you are more likely in the USA or China? Hardly – both of those 'classes' of location have lots of instances 8900 miles from the South Pole.
Only in the extremes will the distance tell you for sure which class (country) you're in – because there are parts of the USA's Alaska and Hawaii further north and south than parts of China. But even there, you can't manage well with just a single threshold: you'd need a rule which says, "less than X or greater than Y, in USA; otherwise unknown".
The 100-dimensional space of Doc2Vec vectors (or other rich data sources) will often only be sensibly divided by far more complicated rules. And, our intuitions about distances and volumes based on 2- or 3-dimensional spaces will often lead us astray, in high dimensions.
Still, the Earth analogy does suggest a way forward: there are some reference points on the globe that will work way better, when you know the distance to them, at deciding if you're in the USA or China. In particular, a point at the center of the US, or at the center of China, would work really well.
Similarly, you may get somewhat better classification accuracy if rather than a random fix_vec, you pick either (a) any point for which a class is already known; or (b) some average of all known points of one class. In either case, your fix_vec is then likely to be "in a neighborhood" of similar examples, rather than some random spot (that has no more essential relationship to your classes than the South Pole has to northern-Hemisphere temperate-zone countries).
(Also: alternatively picking N multiple random points, and then feeding the N distances to your regression, will preserve more of the information/shape of the original Doc2Vec data, and thus give the classifier a better chance of finding a useful separating-threshold. Two would likely do better than your one distance, and 100 might approach or surpass the 100 original dimensions.)
Finally, some comment about the Doc2Vec aspect:
Doc2Vec optimizes vectors that are somewhat-good, within their constrained model, at predicting the words of a text. Positive-sentiment words tend to occur together, as do negative-sentiment words, and so the trained doc-vectors tend to arrange themselves in similar positions when they need to predict similar-meaning-words. So there are likely to be 'neighborhoods' of the doc-vector space that correlate well with predominantly positive-sentiment or negative-sentiment words, and thus positive or negative sentiments.
These won't necessarily be two giant neighborhoods, 'positive' and 'negative', separated by a simple boundary –or even a small number of neighborhoods matching our ideas of 3-D solid volumes. And many subtleties of communication – such as sarcasm, referencing a not-held opinion to critique it, spending more time on negative aspects but ultimately concluding positive, etc – mean incursions of alternate-sentiment words into texts. A fully-language-comprehending human agent could understand these to conclude the 'true' sentiment, while these word-occurrence based methods will still be confused.
But with an adequate model, and the right number of free parameters, a classifier might capture some generalizable insight about the high-dimensional space. In that case, you can achieve reasonably-good predictions, using the Doc2Vec dimensions – as you've seen with the ~80%+ results on the full 100-dimensional vectors.

Fuzzy clustering using unsupervised dimensionality reduction

An unsupervised dimensionality reduction algorithm is taking as input a matrix NxC1 where N is the number of input vectors and C1 is the number of components for each vector (the dimensionality of the vector). As a result, it returns a new matrix NxC2 (C2 < C1) where each vector has a lower number of component.
A fuzzy clustering algorithm is taking as input a matrix N*C1 where N, here again, is the number of input vectors and C1 is the number of components for each vector. As a result, it returns a new matrix NxC2 (C2 usually lower than C1) where each component of each vector is indicating the degree to which the vector belongs to the corresponding cluster.
I noticed that input and output of both classes of algorithms are the same in structure, only the interpretation of the results changes. Moreover, there no fuzzy clustering implementation in scikit-learn, hence the following question:
Does it make sense to use a dimensionality reduction algorithm to perform fuzzy clustering?
For instance, is it a non-sense to apply FeatureAgglomeration or TruncatedSVD to a dataset built from TF-IDF vectors extracted from textual data, and interpret the results as a fuzzy clustering?
In some sense, sure. It kind of depends on how you want to use the results downstream.
Consider SVD truncation or excluding principal components. We have projected into a new, variance-preserving space with essentially few other restrictions on the structure of the new manifold. The new coordinate representations of the original data points could have large negative numbers for some elements, which is a little weird. But one could shift and rescale the data without much difficulty.
One could then interpret each dimension as a cluster membership weight. But consider a common use for fuzzy clustering, which is to generate a hard clustering. Notice how easy this is with fuzzy cluster weights (e.g. just take the max). Consider a set of points in the new dimensionally-reduced space, say <0,0,1>,<0,1,0>,<0,100,101>,<5,100,99>. A fuzzy clustering would given something like {p1,p2}, {p3,p4} if thresholded, but if we took the max here (i.e. treat the dimensionally reduced axes as membership, we get {p1,p3},{p2,p4}, for k=2, for instance. Of course, one could use a better algorithm than max to derive hard memberships (say by looking at pairwise distances, which would work for my example); such algorithms are called, well, clustering algorithms.
Of course, different dimensionality reduction algorithms may work better or worse for this (e.g. MDS which focuses on preserving distances between data points rather than variances is more naturally cluster-like). But fundamentally, many dimensionality reduction algorithms implicitly preserve data about the underlying manifold that the data lie on, whereas fuzzy cluster vectors only hold information about the relations between data points (which may or may not implicitly encode that other information).
Overall, the purpose is a little different. Clustering is designed to find groups of similar data. Feature selection and dimensionality reduction are designed to reduce the noise and/or redundancy of the data by changing the embedding space. Often we use the latter to help with the former.

DBSCAN using spatial and temporal data

I am looking at data points that have lat, lng, and date/time of event. One of the algorithms I came across when looking at clustering algorithms was DBSCAN. While it works ok at clustering lat and lng, my concern is it will fall apart when incorporating temporal information, since it's not of the same scale or same type of distance.
What are my options for incorporating temporal data into the DBSCAN algorithm?
Look up Generalized DBSCAN by the same authors.
Sander, Jörg; Ester, Martin; Kriegel, Hans-Peter; Xu, Xiaowei (1998). Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. Data Mining and Knowledge Discovery (Berlin: Springer-Verlag) 2(2): 169–194. doi:10.1023/A:1009745219419.
For (Generalized) DBSCAN, you need two functions:
findNeighbors - get all "related" objects from your database
corePoint - decide whether this set is enough to start a cluster
then you can repeatedly find neighbors to grow the clusters.
Function 1 is where you want to hook into, for example by using two thresholds: one that is geographic and one that is temporal (i.e. within 100 miles, and within 1 hour).
tl;dr you are going to have to modify your feature set, i.e. scaling your date/time to match the magnitude of your geo data.
DBSCAN's input is simply a vector, and the algorithm itself doesn't know that one dimension (time) is orders of magnitudes bigger or smaller than another (distance). Thus, when calculating the density of data points, the difference in scaling will screw it up.
Now I suppose you can modify the algorithm itself to treat different dimensions differently. This can be done by changing the definition of "distance" between two points, i.e. supplying your own distance function, instead of using the default Euclidean distance.
IMHO, though, the easier thing to do is to scale one of your dimension to match another. just multiply your time values by a fixed, linear factor so they are on the same order of magnitude as the geo values, and you should be good to go.
more generally, this is part of the features selection process, which is arguably the most important part of solving any machine learning algorithm. choose the right features, and transform them correctly, and you'd be more than halfway to a solution.

Difference between similarity strategies in Mahout recommenditembased

I am using mahout recommenditembased algorithm. What are the differences between all the --similarity Classes available? How to know what is the best choice for my application? These are my choices:
SIMILARITY_COOCCURRENCE
SIMILARITY_LOGLIKELIHOOD
SIMILARITY_TANIMOTO_COEFFICIENT
SIMILARITY_CITY_BLOCK
SIMILARITY_COSINE
SIMILARITY_PEARSON_CORRELATION
SIMILARITY_EUCLIDEAN_DISTANCE
What does it mean each one?
I'm not familiar with all of them, but I can help with some.
Cooccurrence is how often two items occur with the same user. http://en.wikipedia.org/wiki/Co-occurrence
Log-Likelihood is the log of the probability that the item will be recommended given the characteristics you are recommending on. http://en.wikipedia.org/wiki/Log-likelihood
Not sure about tanimoto
City block is the distance between two instances if you assume you can only move around like you're in a checkboard style city. http://en.wikipedia.org/wiki/Taxicab_geometry
Cosine similarity is the cosine of the angle between the two feature vectors. http://en.wikipedia.org/wiki/Cosine_similarity
Pearson Correlation is covariance of the features normalized by their standard deviation. http://en.wikipedia.org/wiki/Pearson_correlation_coefficient
Euclidean distance is the standard straight line distance between two points. http://en.wikipedia.org/wiki/Euclidean_distance
To determine which is the best for you application you most likely need to have some intuition about your data and what it means. If your data is continuous value features than something like euclidean distance or pearson correlation makes sense. If you have more discrete values than something along the lines of city block or cosine similarity may make more sense.
Another option is to set up a cross-validation experiment where you see how well each similarity metric works to predict the desired output values and select the metric that works the best from the cross-validation results.
Tanimoto and Jaccard are similars, is a statistic used for comparing the similarity and diversity of sample sets.
https://en.wikipedia.org/wiki/Jaccard_index

Centroid algorithm for document classification, threshold detection

I have a collection of documents related to a particular domain and have trained the centroid classifier based on that collection. What I want to do is, I will be feeding the classifier with documents from different domains and want to determine how much they are relevant to the trained domain. I can use the cosine similarity for this to get a numerical value but my question is what is the best way to determine the threshold value?
For this, I can download several documents from different domains and inspect their similarity scores to determine the threshold value. But is this the way to go, does it sound statistically good? What are the other approaches for this?
Actually there is another issue with centroids in sparse vectors. The problem is that they usually are significantly less sparse than the original data. For examples, this increases computation costs. And it can yield vectors that are themselves actually atypical because they have a different sparsity pattern. This effect is similar to using arithmetic means of discrete data: say the mean number of doors in a car is 3.4; yet obviously no car exists that actually has 3.4 doors. So in particular, there will be no car with an euclidean distance of less than 0.4 to the centroid! - so how "central" is the centroid then really?
Sometimes it helps to use medoids instead of centroids, because they actually are proper objects of your data set.
Make sure you control such effects on your data!
A simple method to try would be to employ various machine-learning algorithms - and in particular, tree-based ones - on the distances from your centroids.
As mentioned in another answer(#Anony-Mousse), this won't necessarily provide you with good or usable answers, but it just might. Using a ML framework for this procedure, E.g. WEKA, will also help you with estimating your accuracy in a more rigorous manner.
Here are the steps to take, using WEKA:
Generate a train set by finding a decent amount of documents representing each of your classes (to get valid estimations, I'd recommend at least a few dozens per class)
Calculate the distance from each document to each of your centroids.
Generate a feature vector for each such document, composed of the distances from this document to the centroids. You can either use a single feature - the distance to the nearest centroid; or use all distances, if you'd like to try a more elaborate thresholding scheme. For example, if you chose the simpler method of using a single feature, the vector representing a document with a distance of 0.2 to the nearest centroid, belonging to class A would be: "0.2,A"
Save this set in ARFF or CSV format, load into WEKA, and try classifying, e.g. using a J48 tree.
The results would provide you with an overall accuracy estimation, with a detailed confusion matrix, and - of course - with a specific model, e.g. a tree, you can use for classifying additional documents.
These results can be used to iteratively improve the models and thresholds by collecting additional train documents for problematic classes, either by recreating the centroids or by retraining the thresholds classifier.

Resources