Speeding up word embeding comparisons using cosine distance - machine-learning

I need to find the closest match between a word embedding array and 6000 others, right now I'm just doing 6000 comparisons and it seems inefficient.
Since the scikit spatial.distance.cosine function is single-threaded one way to speed this up would be to parallelize it but I'm more interested in better algorithmic ideas. Maybe clustering / dimensionality reduction would work?

Related

K-Nearest Neighbor - how many reference points/features?

I want to use KNN to create a training model (I will use other ML models as well), but i'm just wondering...
I have around 6 features, with a total of let's say 60.000 (60 thousand) reference points (so, I have around 10.000 reference points per feature).
I know that this is, from a computational point of view, not ideal (for an algorithm like KNN), so should I use for example KD-Trees (or is KNN okay for this number of features/reference points)?
Because.. if I have to calculate the distance between my test point and all the reference points (with for example Euclidean distance, for a multi-dimensional model)..... I can imagine that it will take quite some time..?
I know that other (supervised) ML algorithms are maybe more efficient, but KNN is only one of the algorithms I will use.
The time complexity of (naive) KNN would be O(kdn) where d is the dimensionality which is 6 in your case, and n is the number of points, which is 60,000 in your case.
Meanwhile, building a KD tree from n points is O(dnlogn), with subsequent nearest-neighber lookups taking O(klogn) time. This is definitely much better: you sacrifice a little bit of time upfront to build the KD tree, but each KNN lookup later is much faster.
This is all under the assumption that your points are distributed in a "nice" way (see: https://en.wikipedia.org/wiki/K-d_tree#Degradation_in_performance_when_the_query_point_is_far_from_points_in_the_k-d_tree for more details). If they aren't distributed in a "nice" way, then KNN in general might not be the way to go.

How to select features for clustering?

I had time-series data, which I have aggregated into 3 weeks and transposed to features.
Now I have features: A_week1, B_week1, C_week1, A_week2, B_week2, C_week2, and so on.
Some of features are discreet, some - continuous.
I am thinking of applying K-Means or DBSCAN.
How should I approach the feature selection in such situation?
Should I normalise the features? Should I introduce some new ones, that would somehow link periods together?
Since K-means and DBSCAN are unsupervised learning algorithms, selection of features over them are tied to grid search. You may want to test them to evaluate such algorithms based on internal measures such as Davies–Bouldin index, Silhouette coefficient among others. If you're using python you can use Exhaustive Grid Search to do the search. Here is the link to the scikit library.
Formalize your problem, don't just hack some code.
K-means minimizes the sum of squares. If the features have different scales they get different influence on the optimization. Therefore, you carefully need to choose weights (scaling factors) of each variable to balance their importance the way you want (and note that a 2x scaling factor does not make the variable twice as important).
For DBSCAN, the distance is only a binary decision: close enough, or not. If you use the GDBSCAN version, this is easier to understand than with distances. But with mixed variables, I would suggest to use the maximum norm. Two objects are then close if they differ in each variable by at most "eps". You can set eps=1, and scale your variables such that 1 is a "too big" difference. For example in discrete variables, you may want to tolerate one or two discrete steps, but not three.
Logically, it's easy to see that the maximum distance threshold decomposes into a disjunction of one-variablea clauses:
maxdistance(x,y) <= eps
<=>
forall_i |x_i-y_i| <= eps

How does the NEAT speciation algorithm work?

I've been reading up on how NEAT (Neuro Evolution of Augmenting Topologies) works and i've got the main idea of it, but one thing that's been bothering me is how you split the different networks into species. I've gone through the algorithm but it doesn't make a lot of sense to me and the paper i read doesn't explain it very well either so if someone could give a explanation of what each component is and what it's doing then that would be great thanks.
The 2 equations are:
The original paper
Speciation in NEAT is similar to fitness sharing used by other evolutionary algorithms. The idea is to penalize similar solutions, creating a pressure toward a more diverse population.
The delta term is a measure of distance between two solutions. The measure of distance used here is specialized for the variable-length genomes used by NEAT. Small delta values indicate more similar solutions.
The sharing function implemented in NEAT results in a value of 0 or 1 if the distance between two solutions is greater or less than a given threshold, respectively. Each solution is compared to each other solution in the candidate population, and the fitness is modified by the sum of resulting sharing function values. If a solution is similar to several other solutions in the population it's modified fitness will be significantly reduced.

Clustering a huge number of URLs

I have to find similar URLs like
'http://teethwhitening360.com/teeth-whitening-treatments/18/'
'http://teethwhitening360.com/laser-teeth-whitening/22/'
'http://teethwhitening360.com/teeth-whitening-products/21/'
'http://unwanted-hair-removal.blogspot.com/2008/03/breakthroughs-in-unwanted-hair-remo'
'http://unwanted-hair-removal.blogspot.com/2008/03/unwanted-hair-removal-products.html'
'http://unwanted-hair-removal.blogspot.com/2008/03/unwanted-hair-removal-by-shaving.ht'
and gather them in groups or clusters. My problems:
The number of URLs is large (1,580,000)
I don't know which clustering or method of finding similarities is better
I would appreciate any suggestion on this.
There are a few problems at play here. First you'll probably want to wash the URLs with a dictionary, for example to convert
http://teethwhitening360.com/teeth-whitening-treatments/18/
to
teeth whitening 360 com teeth whitening treatments 18
then you may want to stem the words somehow, eg using the Porter stemmer:
teeth whiten 360 com teeth whiten treatment 18
Then you can use a simple vector space model to map the URLs in an n-dimensional space, then just run k-means clustering on them? It's a basic approach but it should work.
The number of URLs involved shouldn't be a problem, it depends what language/environment you're using. I would think Matlab would be able to handle it.
Tokenizing and stemming are obvious things to do. You can then turn these vectors into TF-IDF sparse vector data easily. Crawling the actual web pages to get additional tokens is probably too much work?
After this, you should be able to use any flexible clustering algorithm on the data set. With flexible I mean that you need to be able to use for example cosine distance instead of euclidean distance (which does not work well on sparse vectors). k-means in GNU R for example only supports Euclidean distance and dense vectors, unfortunately. Ideally, choose a framework that is very flexible, but also optimizes well. If you want to try k-means, since it is a simple (and thus fast) and well established algorithm, I belive there is a variant called "convex k-means" that could be applicable for cosine distance and sparse tf-idf vectors.
Classic "hierarchical clustering" (apart from being outdated and performing not very well) is usually a problem due to the O(n^3) complexity of most algorithms and implementations. There are some specialized cases where a O(n^2) algorithm is known (SLINK, CLINK) but often the toolboxes only offer the naive cubic-time implementation (including GNU R, Matlab, sciPy, from what I just googled). Plus again, they often will only have a limited choice of distance functions available, probably not including cosine.
The methods are, however, often easy enough to implement yourself, in an optimized way for your actual use case.
These two research papers published by Google and Yahoo respectively go into detail on algorithms for clustering similar URLs:
http://www.google.com/patents/US20080010291
http://research.yahoo.com/files/fr339-blanco.pdf

Levenshtein Distance Algorithm better than O(n*m)?

I have been looking for an advanced levenshtein distance algorithm, and the best I have found so far is O(n*m) where n and m are the lengths of the two strings. The reason why the algorithm is at this scale is because of space, not time, with the creation of a matrix of the two strings such as this one:
Is there a publicly-available levenshtein algorithm which is better than O(n*m)? I am not averse to looking at advanced computer science papers & research, but haven't been able to find anything. I have found one company, Exorbyte, which supposedly has built a super-advanced and super-fast Levenshtein algorithm but of course that is a trade secret. I am building an iPhone app which I would like to use the Levenshtein distance calculation. There is an objective-c implementation available, but with the limited amount of memory on iPods and iPhones, I'd like to find a better algorithm if possible.
Are you interested in reducing the time complexity or the space complexity ? The average time complexity can be reduced O(n + d^2), where n is the length of the longer string and d is the edit distance. If you are only interested in the edit distance and not interested in reconstructing the edit sequence, you only need to keep the last two rows of the matrix in memory, so that will be order(n).
If you can afford to approximate, there are poly-logarithmic approximations.
For the O(n +d^2) algorithm look for Ukkonen's optimization or its enhancement Enhanced Ukkonen. The best approximation that I know of is this one by Andoni, Krauthgamer, Onak
If you only want the threshold function - eg, to test if the distance is under a certain threshold - you can reduce the time and space complexity by only calculating the n values either side of the main diagonal in the array. You can also use Levenshtein Automata to evaluate many words against a single base word in O(n) time - and the construction of the automatons can be done in O(m) time, too.
Look in Wiki - they have some ideas to improve this algorithm to better space complexity:
Wiki-Link: Levenshtein distance
Quoting:
We can adapt the algorithm to use less space, O(m) instead of O(mn), since it only requires that the previous row and current row be stored at any one time.
I found another optimization that claims to be O(max(m, n)):
http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#C
(the second C implementation)

Resources