Why Neo4j index not working with order by? - neo4j

why neo4j order by is very slow for large database :(
here is the example query:
PROFILE MATCH (n:Item) RETURN n ORDER BY n.name Desc LIMIT 25
and in result it's read all records but i already used index on name property.
here is the result
Click here to see results
it reads all nodes, it's real mess for large number of records.
any solution for this?
or neo4j is not good choice too for us :(
and any way to get last record from nodes?

Your question and problem are not very clear.
1) Are you sure that you added the index correctly?
CREATE INDEX ON :Item(name)
In the Neo4j browser execute :schema to see all your indexes.
2) How many Items does your database hold and what running time are you expecting and achieving?
3) What do you mean by 'last record from nodes'?

Indexes are currently only used to find entry points into the graph, but not for other uses including ordering of results.
Indexed-backed ORDER BY operations have been a highly requested feature for awhile, and while we've been tracking and ordering its priority, we've had several other features that took priority over this work.
I believe indexed-backed ORDER BY operations are currently scheduled very soon, for our 3.5 release coming in the last few months of 2018.

Related

Neo4j Query Optimization for Cartesian Product

I am trying to implement a user-journey analytics solution. Simply analyze on which screens, which users leave the application.
For this , I have modeled the data like this:
I modeled single activity since I want to index some attributes. Relation attributes can not be indexed in Neo4j.
With this model, I am trying to write a query that follows three successive event types with below query:
MATCH (eventType1:EventType {eventName:'viewStart-home'})<--(event:EventNode)
<--(eventType2:EventType{eventName:'viewStart-payment'})
WITH distinct event.deviceId as eUsers, event.clientCreationDate as eDate
MATCH((eventType2)<--(event2:EventNode)
<--(eventType3:EventType{eventName:'viewStart-screen1'}))
WITH distinct event2.deviceId as e2Users, event2.clientCreationDate as e2Date
RETURN e2Users limit 200000
And the execution plan is below:
I could not figure the reason of this process out. Can you help me?
Your query is doing a lot more work than it needs to.
The first WITH clause is not needed at all, since its generated eUsers and eDate variables are never used. And the second WITH clause does not need to generate the unused e2Date variable.
In addition, you could first add an index for :EventType(eventName) to speed up the processing:
CREATE INDEX ON :EventType(eventName);
With these changes, your query's profile could be simpler and the processing would be faster.
Here is an updated query (that should use the index to quickly find the EventType node at one end of the path, to kick off the query):
MATCH (:EventType {eventName:'viewStart-home'})<--(:EventNode)
<--(:EventType{eventName:'viewStart-payment'})<--(event2:EventNode)
<--(:EventType{eventName:'viewStart-screen1'})
RETURN distinct event2.deviceId as e2Users
LIMIT 200000;
Here is an alternate query that uses 2 USING INDEX hints to tell the planner to quickly find the :EventType nodes at both ends of the path to kick off the query. This might be even faster than the first query:
MATCH (a:EventType {eventName:'viewStart-home'})<--(:EventNode)
<--(:EventType{eventName:'viewStart-payment'})<--(event2:EventNode)
<--(b:EventType{eventName:'viewStart-screen1'})
USING INDEX a:EventType(eventName)
USING INDEX b:EventType(eventName)
RETURN distinct event2.deviceId as e2Users
LIMIT 200000;
Try profiling them both on your DB, and pick the best one or keep tweaking further.

Can "DISTINCT" in a CYPHER query be responsible of a memory error when the query returns no result?

working on a pretty small graph of 5000 nodes with low density (mean connectivity < 5), I get the following error which I never got before upgrading to neo4j 3.3.0. The graph contains 900 molecules and their scaffold hierarchy, down to 5 levels.
(:Molecule)<-[:substructureOf*1..5]-(:Scaffold)
Neo.TransientError.General.StackOverFlowError
There is not enough stack size to perform the current task. This is generally considered to be a database error, so please contact Neo4j support. You could try increasing the stack size: for example to set the stack size to 2M, add `dbms.jvm.additional=-Xss2M' to in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation just add -Xss2M as command line flag.
The query is actually very simple, I use distinct because several path may lead to a single scaffold.
match (m:Molecule) <-[:substructureOf*3]- (s:Scaffold) return distinct s limit 20
This query returns the above error message whereas the next query does work.
match (m:Molecule) <-[:substructureOf*3]- (s:Scaffold) return s limit 20
Interestingly, the query works on a much larger database, in this small one the deepest hierarchy happened to be 2. Therefore the result of the last query is "No changes, no records)".
How comes that adding DISTINCT to the query fails with that memory error? Is there a way to avoid it, because I cannot guess the depth of the hierarchy which can be different for each molecules.
I tried the following values for as suggested in other posts.
#dbms.memory.heap.initial_size=512m
#dbms.memory.heap.max_size=512m
dbms.memory.heap.initial_size=512m
dbms.memory.heap.max_size=4096m
dbms.memory.heap.initial_size=4096m
dbms.memory.heap.max_size=4096m
None of these addressed the issue.
Thanks in advance for any help or clues.
Thanks for the additional info, I was able to replicate this on Neo4j 3.3.0 and 3.3.1, and this likely has to do with the behavior of the pruning-var-expand operation (that is meant to help when using variable-length expansions and distinct results) that was introduced in 3.2.x, and only when using an exact number of expansions (not a range). Neo4j engineering will be looking into this.
In the meantime, your requirement is such that we can use a different kind of query to get the results you want that should avoid this operation. Try giving this one a try:
match (s:Scaffold)
where (s)-[:substructureOf*3]->(:Molecule)
return distinct s limit 20
And if you do need to perform queries that may produce this error, you may be able to circumvent them by prepending your query with CYPHER 3.1, which will execute this with a plan produced by an older version of Cypher which doesn't use the pruning var expand operation.

Neo4j performance - counting nodes - linked list performance - alternatives?

UPDATED: Wes hit a home run here! Thanks.. I've added a Rails version I was developing using the neography Gem.. Accomplishes the same thing but his version is much faster.. see comparison below
I am using a linked list in Neo4j (1.9, REST, Cypher) to help keep the comments in proper order (Yes I know I can sort on the time etc).
(object node)---[:comment]--->(comment)--->(comment)--->(comment).... etc
Currently I have 900 comments and it's taking 7 seconds to get through the whole list - completely unacceptable.. I'm just returning the ID of the node (I know, don't do this, but it's not he point of my post).
What I'm trying to do is find the ID's of users who commented so I can return a count.. (like "Joe and 405 others commented on your post").. Now, I'm not even counting the unique nodes at this point - I'm just returning the author_id for each record.. (I'll worry about counting later - first take care of the basic performance issue).
start object=node(15837) match object-[:COMMENTS*]->comments return comments.author_id
7 seconds is waaaay too long..
Instead of using a linked list, I could just simply have an object and link all the comments directly to the node - but that could lead to a supernode that is just bogged down, and then finding the most recent comments, even with skip and limit, will be dog slow..
Will relationship indexes help here? I've never used them other than to ensure a unique relationship, or to see if a relationship exists, but can I use them in a cypher query to help speed things up?
If not, what else can I do to decrease the time it takes to return the IDs?
COMPARISON: Here is the Rails version using "phase II" methods of the Neography gem:
next_node_id=18233
#neo=Neography::Rest.new
start_node = Neography::Node.load(next_node_id, #neo)
all_nodes=start_node.outgoing(:COMMENTS).depth(10000)
raise all_nodes.size.to_i
Result: 526 nodes found in 290ms..
Wes' solution took 5 ms.. :-)
Relationship indexes will not help. I'd suggest using an unmanaged extension and the traversal API--it will be a lot faster than Cypher for this particular query on long lists. This example should get you close:
https://github.com/wfreeman/linkedlistlength
I based it on Mark Needham's example here:
http://www.markhneedham.com/blog/2014/07/20/neo4j-2-1-2-finding-where-i-am-in-a-linked-list/
If you're only doing this to return a count, the best solution here is to not figure it out on every query since it isn't changing that often. Cache the results on the node in a total_comments property to your node. Every time a relationship is added or removed, update that count. If you want to know whether any of the current user's friends commented on it so you can say, "Joe and 700 others commented on this," you could do a second query:
start joe=node(15830) object=node(15838) match joe-[:FRIENDS]->friend-[:POSTED_COMMENT]->comment<-[:COMMENTS]-object RETURN friend LIMIT 1
You limit it to 1 since you only need the name of one friend who commented. If it returns someone, adjust the number of comments displayed by 1, include the user's name. You could do that with JS so it doesn't delay your page load. Sorry if my Cypher is a little off, not used to <2.0 syntax.

Neo4j 2.0: Indexing array-valued properties with schema indexing

I have nodes with multiple "sourceIds" in one array-valued property called "sourceIds", just because there could be multiple resources a node could be derived from (I'm assembling multiple databases into one Neo4j model).
I want to be able to look up nodes by any of their source IDs. With legacy indexing this was no problem, I would just add a node to the index associated with each element of the sourceIds property array.
Now I wanted to switch to indexing with labels and I'm wondering how that kind of index works here. I can do
CREATE INDEX ON :<label>(sourceIds)
but what does that actually do? I hoped it would just create index entries for each array element, but that doesn't seem to be the case. With
MATCH n:<label> WHERE "testid" in n.sourceIds RETURN n
the query takes between 300ms and 500ms which is too long for an index lookup (other schema indexes work three to five times faster). With
MATCH n:<label> WHERE n.sourceIds="testid" RETURN n
I don't get a result. That's clear because it's an array property but I just gave it a try since it would make sense if array properties would be broken down to their elements for indexing purposes.
So, is there a way to handle array properties with schema indexing or are there plans or will I just have to stick to legacy indexing here? My problem with the legacy Lucene index was that I hit the max number of boolean clauses (1024). Another question thus would be: Can I raise this number? Lucene allows that, but can I do this with the Lucene index used by Neo4j?
Thanks and best regards!
Edit: A bit more elaboration on why I hit the boolean clauses max limit: I need to export specific parts of the database into custom file formats for text processing pipelines. These pipelines use components I cannot (be it for the sake of accessibility or time) change to query Neo4j directly, so I'd rather stay with the defined required file format(s). I do the export via the pattern "give me all IDs in the DB; now, for batches of IDs, query the desired information (e.g. specific paths) from Neo4j and store the results to file". Why I use batches at all? Well, if I don't, things are slowed down significantly via the connection overhead. Thus, large batches are a kind of optimization here.
Schema indexes can only do exact matches right now. Your "testid" in n.sourceIds does not use the index (as shown by your query times). I think there are plans to make this behave better, but I'm waiting for them just as eagerly as you are.
I've actually hit a lower max in the lucene query: 512. If there is a way to increase it I'd love to hear of it. The way I got around it is just doing more than one query if I have one of the rare cases that actually goes over 512 ids. What query are you doing where you need more?

How to improve performance when fetching 3rd order connections in neo4j database?

I'm using neo4j database to track connections between people. I need to track 3rd order connection(something similar to how linkedin does this), but i've faced some issues with performance. In my test database i have approximately 3 thousand users with 3 to 8 connections of the first order(contacts). When fetching second order connections everything seems to be good with the performance. But fetching 3rd order connections takes a long time. I use CYPHER queries to fetch the data. Only profile ids and connections between them are stored in the database.
here is the query itself:
THIRD_ORDER_CONNECTIONS = <<-CYPHER
START n=node:profile(id='%{id}')
MATCH n-[:contacts]-common_contact_1-[:contacts]-common_contact_2-[:contacts]-profile
WHERE common_contact.id <> %{exclude_id} AND common_contact_1.id <> common_contact_2.id
RETURN COLLECT(DISTINCT profile.id)
CYPHER
It takes 48 seconds on my local machine. So the question is - how can i improve the performance or change the query to get 3rd order connections for appropriate time?
Your query is not valid: common_contact.id does not resolve to an identifier
How many results do you get back?
How does the query time change if you add a direction --> t your query?
Please use parameters instead of ruby substitution.
Try RETURN profile.id (distinct needs to keep everything in memory for the unique filtering)
Normally cypher takes care of uniqueness, so common_contact_1.id <> common_contact_2.id might be unnecessary
Have you tried with neo4j version 1.9.M01? There are Cypher performance improvements for straight forward patterns like this which could make a huge difference, where it off-loads more work to the traversal framework.

Resources