Search vector vs. ALL others in Solr - machine-learning

Say I have a field in Solr which is an array of ints that looks something like this:
vector=array(469,323,324,119,74,58,68,59,49,40,32,26,21,17,14,12,10,9,7,5,-642,-184,-99,-84,-79,-63,-50,-38,-30,-21,-18,-16,-17,-16,14,25,52,21,15,93,53,52,32,15,61,29,346,20,69,72,38,165)
Is there a way to find either the k-nearest neighbors or the cosineSimilarity between this vector and that for all other documents matching a search in Solr?
I tried building a matrix manually but it was crashing Solr.
let(
a=search(satracks,
q="vector:*",
fl="vector",
qt="/export",
sort="vector desc"
),
b=col(a, vector),
mat1=matrix(b),
mat2=transpose(mat1),
testvector=array(469,323,324,119,74,58,68,59,49,40,32,26,21,17,14,12,10,9,7,5,-642,-184,-99,-84,-79,-63,-50,-38,-30,-21,-18,-16,-17,-16,14,25,52,21,15,93,53,52,32,15,61,29,346,20,69,72,38,165),
k=knn(mat2, testvector,5)
)
The documentation only shows random samples. I want to compare a vector to every other vector that matches a given search.

You can do this using Solr 9. First you have to add to every document in solr vectorized field. Than you can use knn in query:
&q={!knn f=vectorized_field topK=10}[your_vector]

Related

Finding similar users based on String preperties

Im a software engineering student, and new to Data Mining, I want to implement a solution to find similar users based on their interests and skills (Strings sets).
I think I cannot use K nearest Neighbors using an edit distance(Levenshtein or ..)
If someone could help with that please
The first thing you should do is convert your data into some reasonable representation, so that you will have a well-defined notion of distance between suitably represented users.
I would recommend converting all strings into some canonical form, then sorting all n distinct skills and interest strings into a dictionary D. Now for each user u, construct a vector v(u) with n components, which has i-th component set to 1 if the property in dictionary entry i is present, and 0 otherwise. Essentially we represented each user with a characteristic vector of her interests/skills.
Now you can compare users with Jaccard index (it's just an example, you'll have to figure out what works best for you). With the notion of a distance in hand, you can start trying out various approaches. Here are some that spring to mind:
apply hierarchical clustering if the number of users is sufficiently small;
apply association rule learning (I'll leave you to think out the details);
etc.

Lucene full-text index: all indexed nodes with same score?

I have been trying solving this issue since days.
I want to do a START query against full-text, ordered by relevance, so to paginate results.
Gladly, I finally found this thread on full-text indexing and neo (and using python as driver).
[https://groups.google.com/forum/#!topic/neo4j/9G8fcjVuuLw]
I had imported my db with batch super-importer, and got a reply of #Michaelhunger who kindly noticed there was a bug, all scores would had been imported the same value.
So, now I am recreating the index, and checking the score via REST (&order=score)
http://localhost:7474/db/data/index/node/myInde?query=name:myKeyWord&order=score
and noticed that entries have still the same score.
(You've got to do an ajax query to see it cause if you use the web console you won't see all data!!)
My code to recreate a full-text lucene index, having each node property 'name':
(here using neo4j-rest-client, but I will try also with py2neo as in the Google discussion):
from neo4jrestclient.client import GraphDatabase
gdb = GraphDatabase("http://localhost:7474/db/data/")
myIndex = gdb.nodes.indexes.create("myIndex", type="fulltext", provider="lucene")
myIndex.add("name",node.get("name"),node)
results:
http://localhost:7474/db/data/index/node/myInde?query=name:DNA&order=score
data Object {id: 17062920, name: "DNA damage theory of aging"}
VM995:10 **score 11.097855567932129**
...
data Object {id: 17022698, name: "DNA (film)"}
VM995:10 **score 11.097855567932129**
In the documentation:
[http://neo4j.com/docs/stable/indexing-lucene-extras.html#indexing-lucene-sort]
it is written that Lucene does the sorting itself very well, so I understood it creates a ranking by itself in import; it does not.
What am I doing wrong or missing?
I believe the issue you are seeing is related to a combination of the text you are indexing, the query term(s) and as Michael Hunger pointed out the current lucene configuration in Neo4j which has OMITNORMS=true. With this setting a lucene query, as in your posted examples, where there is text of different size but the query term appears once in each document often results in the same lucene relevancy score. The reason is that the size/length of the document being indexed (field length normalization) is NOT taken into account when OMITNORMS is true.
Looking at your examples it is not clear what your expected results are. For example, are you expecting documents with shorter text to appear first?
In my own experience using lucene and Neo4j I have seen many instances where the relevancy scores being returned are different across different queries.
The goal of my question is to obtain a list of results ordered by relevance of nodes' names matching the queried keywords.
#mfkilgore point out this work-around:
start n=node:topic('name:(keyword1* AND keyword2*)') MATCH (n) with n order by length(split(n.name," ")) asc limit 20 return n
This workaround counts the chars in a node's name, and then order by length of string.

Search queries in neo4j: how to sort results in neo4j in START query with internal TFIDF / levenshtein or other algorithms?

I am working on a model using wikipedia topics' names for my experiments in full-text index.
I set up and index on 'topic' (legacy), and do a full text search for : 'united states':
start n=node:topic('name:(united states)') return n
The first results are not relevant at all:
'List of United States National Historic Landmarks in United States commonwealths and territories, associated states, and foreign states'
[...]
and the actual 'united states' is buried deep down the list.
As such, it raises the problem that, in order to find the best match (e.g. levershtein, bi-gram, and so on algorithms) on results, you first must fetch all the items matching the pattern.
That would be a serious constraint, cause just in this case I have 21K rows, ~4 seconds.
Which algorithms does neo4j use to order the results of a full-text search (START)?
Which rationale does it use to sort result and how to change it using cypher?
In the doc is written to use JAVA api to apply sort() - it would be very useful to have a tutorial for appointing to which files to modify and also to know which ranking rationale is used before any tweak.
EDITED based on comments below - pagination of results is possible as:
n=node:topic('name:(united states)') return n skip 10 limit 50;
(skip before limit) but I need to ensure first results are meaningful before pagination.
I don't know which order algorithms does lucene use to order the results.
However, about the pagination, if you change the order of limit and skip like follows, should be ok.
start n=node:topic('name:(united states)') return n skip 10 limit 50 ;
I would also add that if you are performing full-text search maybe a solution like solr is more appropriate.
For just a lucene index lookup with scoring you might be better off with this:
http://neo4j.com/docs/stable/rest-api-indexes.html#rest-api-find-node-by-query

How to get a search ranking based on multiple factors in sphinx?

Hello stackoverflow folks,
We got a Rails project which is growing and growing and we now get first performance problems on the search, because we don't know how to utilize sphinx properly for our needs.
We have search queries like "Java PHP Software developer". Our problem is now the ranking should work with multiple things.
As search fields we have tag list, description and title.
If one of the terms is inside of one of the fields it should get for example 2 points. More Points if its in more fields, but not multiple points if it is in the same field more than once.
Next Problem is I have a big file with synonyms for which should also be checked. It looks like this:
Java > Java
Java-EE > Java
...
So if Java-EE is found it should get some points too but with a penalty for being a synonym.
Maximum amount of points would be 5 as in 5 stars which get displayed.
Any speedy solution would be nice because at the moment it's done in plain ruby and it gets slow, because we cant rank properly in sphinx.
If there is a solution with another search engine that would also be very nice, as it could be changed.
Thanks in advance for all efforts. All spelling corrections and questions to clear the question are welcome.
Most of the performance issues can be solved by changing the way you use sphinx. First you need to address how you index the data in sphinx. Doing some processing during while indexing will make the search quicker and the results more relevant. Second, tackle the search terms and last but not least, decide on the ranking algorithm to use.
I am going to use the "title" field as an example, but the logic can be replicated for all fields.
Indexing
Add two fields to sphinx ("title" and "title_synonyms"). For each record in the database do the following :-
Perform a DISTINCT on the words to remove duplicates ("Ruby Developer / Java Developer" will become "Ruby Developer / Java". This will stop records from getting two scores for duplicates when searching. This goes in to "title"
Take the DISTINCT title from above and REPLACE all the words with their expanded synonym equivalents. I would suggest putting the synonyms in the DB to make the expansion easier. The text would then become "Ruby Developer / Java-EE". Each word must be replaced with all the synonyms. If Java has two synonyms, they both must be in the field. This goes into "title_synonyms"
Searching
Because there are now two fields in sphinx we can give them each a different weight; "title" can get a weight of "10" and "title_synonyms" a weight of "3". That means a record has to match 4 synonyms before it ranks higher than one with the original title. You can play around with the weights to suit your needs.
Lets assume a user was searching for "Java Developer". For the search phrase do the following :-
Remove duplicate words
Get synonyms for each word in the search phrase
Set Matching Mode in Sphinx to SPH_MATCH_EXTENDED
The above rules will mean the search in sphinx looks like this :-
#title "Java Developer" | #title_synonyms "Java-EE"
If you want to rank exact matches higher than lexemes, the search query would look like this :-
#title ("Java Developer" | "=Java =Developer") | #title_synonyms ("Java-EE" | "=Java-EE")
You will need to use SPH_RANK_PROXIMITY_BM25 or SPH_RANK_SPH04 to make this work properly though.
Ranking
You can try any of the built in ranking algorithms to see what the results look like. I recommend SPH_RANK_MATCHANY or SPH_RANK_WORDCOUNT as a start.
For Proximity and exact match ranking use SPH_RANK_PROXIMITY_BM25, SPH_RANK_SPH04 or SPH_RANK_EXPR where you can use your own algorithm.
Conclusion
You should now have a search that is both fast and accurate. Very little work has to be done by your Ruby application and most of the work is done inside sphinx (where it should be).
Hope this helps...
This performance problem is an algorithm problem.
If you cannot express the problem in a way to utilize a backend tool, like sphinx or the database engine, then you are doing the processing in ruby, and that's easy to have a performance problem.
First, do as much as you can with sphinx (or whatever other search engine) and the database as you can. The more pre-digested the data coming into ruby, the less you have to do in ruby code, and that will likely be faster, since databases have been highly optimized over the last half century.
So, for example, run sphinx on the key words. Also run sphinx on the synonyms. Limit all the answers to the top results, and merge the results. That way your ruby code will be limited to the likely high results instead of having to consider the whole database of entries.
Once in ruby, the most important thing is to avoid high order algorithms, that is, make sure you are using a low order algorithm.
As you process your raw data, if you hold your top results in an array and try to sort or scan the array, you are going to have an N-squared order. That is, your order will be the product of the number of raw entries and the number of elements you keep in your array.
The best algorithms for your problem are a priority queue implemented by a heap like container, or a b-tree. Both have N-log-N order (N times the log of N), or the number of raw data records time the log of the number of items you will keep in your container.
A heap is a binary tree, where each node in the tree (not just the leaves but each node) has a rated record. The nodes below each record all have lower ranks. This is called the heap condition.
There are algorithms for adding elements, taking the top ranked element out, and replacing the lowest ranked element which maintain the heap condition. Look up binary heap in the wikipedia.
Let's say your site will display the top 100 ranked results. Maintain a help where the root is the lowest ranked. Populate the heap by adding the first 100 raw records you are processing.
Now for record 101 and after, compare its rank with the root. If the new record is ranked higher, use the delete algorithm to reduce your heap to 99 nodes (which will remove the lowest ranked record in the heap) and add your new record to the heap.
Once you have gone through all your records, you will have the top 100 ranked results. The heap delete algorithm will pull them out in reverse order.

Is there a MongoDB Trending Topics Gem?

I have a group of documents in MongoDB with a "description" value about the size of a tweet. I need to generate a trending topics list from this. Clearly this is a solved problem but I can't find a definitive answer/gem for getting the job done without writing the code myself.
I am using ruby & mongoid in my app.
Is there any ruby gem that will help with or handle this? Thanks.
I know of no such gem, but here's an algorithm you may write for yourself:
Extract n-grams from texts. Since texts are small (tweet size you said) extract all n-grams, no limit here.
"I eat icecream" => {(I), (eat), (icecream), (I eat), (eat icecream), (I eat icecream)}
Compute TF-IDF weight vectors for each text's n-grams
{(I):0.1, (eat):0.01, (icecream):0.2, (I eat):0.12, (eat icecream):0.001, (I eat icecream):0.00012}
Use cosine similarity as a measure function for a incremental clustering algorithm over your vectors, maybe script the Weka library over JRuby
Order all clusters by the population size. The n-grams in the centers of largest clusters are your trendy topics.
A quick search of rubygems.org revelead that you are going to have to do some programming. This is a good thing as a system to generically detect trends would either be hopelessly difficult to setup and tune or awful at guessing what dictates a "trend" in your application.
I'm going to make some assumptions about your application.
Let's assume users are self categorizing their tweets by using hash tags (#). Also, lets go ahead and say a sorted count of these hash tags would determine if a topic was trending.
Now let's talk about the computer science part. Given our assumptions above, you will need to be able to quickly query and sort a collection of hashtags to figure out what is trending.
Your are using MongoDB and mongoid (with rails) so the simplest way to do this would be to create a collection that has tag documents that contain a count of their use. Create indexes on tag and count.
When someone tweets, figure out what the hash tags are, look them up in the tags collection and increment their count. To figure out what is trending, query the tags collection and sort by count. This would get you all-time trending hash tags.
If you wanted to get more specific, instead of just storing counts, store counts broken out by time deltas (week, day, hour etc) perhaps storing them separately. You could create documents that represent your time delta instead of the individual tags and store all the tags with their counts inside.
{
start: "start datetime",
end: "end datetime",
tags: {
awesome: 3,
cool: 2,
boring: 2
}
}
You could also use a capped collection. Hope that helps, all of this really depends on what you are trying to do. You can get really crazy and calculate the trends with time decay, etc. You could read the reddit or hacker news code to get a good idea of what that is like.

Resources