Search queries in neo4j: how to sort results in neo4j in START query with internal TFIDF / levenshtein or other algorithms? - neo4j

I am working on a model using wikipedia topics' names for my experiments in full-text index.
I set up and index on 'topic' (legacy), and do a full text search for : 'united states':
start n=node:topic('name:(united states)') return n
The first results are not relevant at all:
'List of United States National Historic Landmarks in United States commonwealths and territories, associated states, and foreign states'
[...]
and the actual 'united states' is buried deep down the list.
As such, it raises the problem that, in order to find the best match (e.g. levershtein, bi-gram, and so on algorithms) on results, you first must fetch all the items matching the pattern.
That would be a serious constraint, cause just in this case I have 21K rows, ~4 seconds.
Which algorithms does neo4j use to order the results of a full-text search (START)?
Which rationale does it use to sort result and how to change it using cypher?
In the doc is written to use JAVA api to apply sort() - it would be very useful to have a tutorial for appointing to which files to modify and also to know which ranking rationale is used before any tweak.
EDITED based on comments below - pagination of results is possible as:
n=node:topic('name:(united states)') return n skip 10 limit 50;
(skip before limit) but I need to ensure first results are meaningful before pagination.

I don't know which order algorithms does lucene use to order the results.
However, about the pagination, if you change the order of limit and skip like follows, should be ok.
start n=node:topic('name:(united states)') return n skip 10 limit 50 ;
I would also add that if you are performing full-text search maybe a solution like solr is more appropriate.

For just a lucene index lookup with scoring you might be better off with this:
http://neo4j.com/docs/stable/rest-api-indexes.html#rest-api-find-node-by-query

Related

Neo4j Query Optimization for Cartesian Product

I am trying to implement a user-journey analytics solution. Simply analyze on which screens, which users leave the application.
For this , I have modeled the data like this:
I modeled single activity since I want to index some attributes. Relation attributes can not be indexed in Neo4j.
With this model, I am trying to write a query that follows three successive event types with below query:
MATCH (eventType1:EventType {eventName:'viewStart-home'})<--(event:EventNode)
<--(eventType2:EventType{eventName:'viewStart-payment'})
WITH distinct event.deviceId as eUsers, event.clientCreationDate as eDate
MATCH((eventType2)<--(event2:EventNode)
<--(eventType3:EventType{eventName:'viewStart-screen1'}))
WITH distinct event2.deviceId as e2Users, event2.clientCreationDate as e2Date
RETURN e2Users limit 200000
And the execution plan is below:
I could not figure the reason of this process out. Can you help me?
Your query is doing a lot more work than it needs to.
The first WITH clause is not needed at all, since its generated eUsers and eDate variables are never used. And the second WITH clause does not need to generate the unused e2Date variable.
In addition, you could first add an index for :EventType(eventName) to speed up the processing:
CREATE INDEX ON :EventType(eventName);
With these changes, your query's profile could be simpler and the processing would be faster.
Here is an updated query (that should use the index to quickly find the EventType node at one end of the path, to kick off the query):
MATCH (:EventType {eventName:'viewStart-home'})<--(:EventNode)
<--(:EventType{eventName:'viewStart-payment'})<--(event2:EventNode)
<--(:EventType{eventName:'viewStart-screen1'})
RETURN distinct event2.deviceId as e2Users
LIMIT 200000;
Here is an alternate query that uses 2 USING INDEX hints to tell the planner to quickly find the :EventType nodes at both ends of the path to kick off the query. This might be even faster than the first query:
MATCH (a:EventType {eventName:'viewStart-home'})<--(:EventNode)
<--(:EventType{eventName:'viewStart-payment'})<--(event2:EventNode)
<--(b:EventType{eventName:'viewStart-screen1'})
USING INDEX a:EventType(eventName)
USING INDEX b:EventType(eventName)
RETURN distinct event2.deviceId as e2Users
LIMIT 200000;
Try profiling them both on your DB, and pick the best one or keep tweaking further.

Neo4j Recommendation Cypher Query Optimization

I am using Neo4j community edition embedded in java application for recommendation purpose. I made a custom function which contains a complex logic of comparing two entities, namely product and users. Both entities are present as nodes in graph and has more than 20 properties each for comparison purpose. For eg. I am calling this function in following format:
match (e:User {user_id:"some-id"}) with e
match (f:Product {product_id:"some-id"}) with e,f
return e,f,findComparisonValue(e,f) as pref_value;
This function call on an average takes about 4-5 ms to run. Now, to recommend best product to a particular user, I wrote a cypher query which iterates over all products, calculate the pref_value and rank them. My cypher query looks like this:
MATCH (source:User) WHERE id(source)={id} with source
MATCH (reco:Product) WHERE reco.is_active='t'
with reco, source, findComparisonValue(source, reco) as score_result
RETURN distinct reco, score_result.score as score, score_result.params as params, score_result.matched_keywords as matched_keywords
order by score desc
Some insights on graph structure:
Total Number of nodes: 2 million
Total Number of relationships: 20 million
Total Number of Users: 0.2 million
Total Number of Products: 1.8 million
The above cypher query is taking more than 10 seconds as it is iterating over all the products. On top of this cypher query, I am using graphaware-reco module for my recommendation needs (Using precompute, filteing, post processing etc). I thought of parallelising this but community edition does not support clustering. Now, as number of users in system is increasing day by day, I need to think of a scalable solution.
Can anyone help me out here, on how to optimize the query.
As others have commented, doing a significant calculation potentially millions of times in a single query is going to be slow, and does not take advantage of neo4j's strengths. You should investigate modifying your data model and calculation so that you can leverage relationships and/or indexes.
In the meantime, there are a number of things to suggest with your second query:
Make sure you have created an index for :Product(is_active), so that it is not necessary to scan all products. (By the way, if that property is actually supposed to be a boolean, then consider making it a boolean rather than a string.)
The RETURN clause should not need the DISTINCT operator, since all the result rows should be distinct anyway. This is because every reco value is already distinct. Removing that keyword should improve performance.

Lucene full-text index: all indexed nodes with same score?

I have been trying solving this issue since days.
I want to do a START query against full-text, ordered by relevance, so to paginate results.
Gladly, I finally found this thread on full-text indexing and neo (and using python as driver).
[https://groups.google.com/forum/#!topic/neo4j/9G8fcjVuuLw]
I had imported my db with batch super-importer, and got a reply of #Michaelhunger who kindly noticed there was a bug, all scores would had been imported the same value.
So, now I am recreating the index, and checking the score via REST (&order=score)
http://localhost:7474/db/data/index/node/myInde?query=name:myKeyWord&order=score
and noticed that entries have still the same score.
(You've got to do an ajax query to see it cause if you use the web console you won't see all data!!)
My code to recreate a full-text lucene index, having each node property 'name':
(here using neo4j-rest-client, but I will try also with py2neo as in the Google discussion):
from neo4jrestclient.client import GraphDatabase
gdb = GraphDatabase("http://localhost:7474/db/data/")
myIndex = gdb.nodes.indexes.create("myIndex", type="fulltext", provider="lucene")
myIndex.add("name",node.get("name"),node)
results:
http://localhost:7474/db/data/index/node/myInde?query=name:DNA&order=score
data Object {id: 17062920, name: "DNA damage theory of aging"}
VM995:10 **score 11.097855567932129**
...
data Object {id: 17022698, name: "DNA (film)"}
VM995:10 **score 11.097855567932129**
In the documentation:
[http://neo4j.com/docs/stable/indexing-lucene-extras.html#indexing-lucene-sort]
it is written that Lucene does the sorting itself very well, so I understood it creates a ranking by itself in import; it does not.
What am I doing wrong or missing?
I believe the issue you are seeing is related to a combination of the text you are indexing, the query term(s) and as Michael Hunger pointed out the current lucene configuration in Neo4j which has OMITNORMS=true. With this setting a lucene query, as in your posted examples, where there is text of different size but the query term appears once in each document often results in the same lucene relevancy score. The reason is that the size/length of the document being indexed (field length normalization) is NOT taken into account when OMITNORMS is true.
Looking at your examples it is not clear what your expected results are. For example, are you expecting documents with shorter text to appear first?
In my own experience using lucene and Neo4j I have seen many instances where the relevancy scores being returned are different across different queries.
The goal of my question is to obtain a list of results ordered by relevance of nodes' names matching the queried keywords.
#mfkilgore point out this work-around:
start n=node:topic('name:(keyword1* AND keyword2*)') MATCH (n) with n order by length(split(n.name," ")) asc limit 20 return n
This workaround counts the chars in a node's name, and then order by length of string.

Is there anything like a "do while" match pattern that satisfy an aggregated value ? (propeties etc)

I dont know if this make sense using Cypher or graph traversal, but i was trying to do sort of a "shortest path" query but not based on weighted relationship but rather aggregated properties.
Assume i have nodes labeled People and they all vists different homepages with a VISIT relationship to the homepage node. Each homepage node has hits stats depending on its popularity. Now i would like to match people that has a visit relationship to a homepage until i reach max X number of exposure (hits).
Why ? Becuase then i know a "expected" exposure strategy for a certain group of people.
Something like
Do
MATCH (n:People)-[:VISITS]-(sites)
while (reduce (x)<100000)
Of course this "Do while" is nothing i have seen in the Cypher syntax but wouldn't it be useful? or should this be on app level by just returning a DESC list and do the math on in the applicaton. Mabey it should also be matched with some case if the loop cant be satisfied.
MATCH (n:People)-[:VISITS]-sites
WITH reduce(hits=0, x IN collect(sites.dailyhits)| hits + x) AS totalhits
RETURN totalhits;
Can return the correct aggregated hits value (all), but i would like this function to run each matched pattern until it satisfy a value and the return the match (of course i miss other possible and mixes between pages becuase the match never traversal the entire graph..but at least i have got an answer of pages in a list that match the requirement if it makes sense) ?
Thanks!
Not sure how you'd aggregate, but there are several aggregation functions (avg, sum, etc). And... you can pass these to a 2nd part of the cypher query, with a WITH clause.
That said: Cypher also supports the ability to sort a result (ORDER BY), and the ability to limit the number of results given (LIMIT). I don't know what you'd sort by, but... just for fun, let's sort it arbitrarily on something:
MATCH (n:People)-[v:VISITS]->(site:Site)
WHERE site.url= "http://somename.com"
RETURN n
ORDER BY v.VisitCount DESC
LIMIT 1000
This would cap your return set at 1,000 people, for people who visit a given site.

How to paginate query results with cypher?

Is it possible to have cypher query paginated. For instance, a list of products, but I don't want to display/retrieve/cache all the results as i can have a lot of results.
I'm looking for something similar to the offset / limit in SQL.
Is cypher skip + limit + orderby a good option ? http://docs.neo4j.org/chunked/stable/query-skip.html
SKIP and LIMIT combined is indeed the way to go. Using ORDER BY inevitably makes cypher scan every node that is relevant to your query. Same thing for using a WHERE clause. Performance should not be that bad though.
Its like normal sql, the syntax is as follow
match (user:USER_PROFILE)-[USAGE]->uUsage
where HAS(uUsage.impressionsPerHour) AND (uUsage.impressionsPerHour > 100)
ORDER BY user.hashID
SKIP 10
LIMIT 10;
This syntax suit to last version (2.x)
Neo4j apparently uses "indexed-backed order by" nowadays, which means if you are using alphabetical ORDERBY on indexed node properties within your SKIP/LIMIT query, then Neo4j will not perform a full scan of all "relevant nodes" as other have mentioned (their responses were long ago, so keep that in mind). The index will allow Neo4j to optimize on the basis that it already stores indexed properties in ORDERBY order (alphabetical), such that your pagination will be even faster than without the index.

Resources