Simple Cypher Query for apoc dijkstra taking FOREVER - neo4j

Maybe I am very stupid or Neo4j is not supposed to be fast. (Disclaimer: I am a Neo4j noob)
I have the following simple dijkstra query which is taking forever to run. I have to atleast wait for 5-10 minutes for it to execute.Sometimes my Chrome browser crashes because of it.
Sample Graph
Cypther Query
profile MATCH (startNode:Stop)--(st:Stoptime),
(endNode:Stop)--(et:Stoptime)
where endNode.name = 'Hauptbahnhof Süd' and
(startNode.name = 'Schlump' or startNode.name = 'U Schlump')
call apoc.algo.dijkstra(st, et, 'PRECEDES', 'weight') YIELD path, weight
return startNode, endNode, path, weight
limit 100;
Computer Config
I am using a Ubuntu VM on windows machine which has 24GB Ram and 6 Cpus.
Indexes
Sysinfo
When I run profile on the above Query, i get the following information:
Profile Information
For the love of God, I cant figure out, where the bottleneck lies. I have checked all other answers on this, but to no avail.

Since I don't have the data set to test out my suggestion with, I can only point you in the direction that I would look. Hopefully, it leads you to the answer.
In looking at the profile and query I see that startNode and endNode are both type :Stop and that the Stop.name property is indexed.
When looking for endNode.name = 'Hauptbahnhof Süd' there are 3 estimated rows and 3 rows are returned.
However when looking for (startNode.name = 'Schlump' or startNode.name = 'U Schlump') there are 6 estimated rows, but 14827 returned.
Are there indeed 14827 :Stop nodes that contain either 'Schlump' or 'U Schlump'?
Or is it the 6 estimated rows? If the latter is the case can you run the query without the OR:
where endNode.name = 'Hauptbahnhof Süd' and startNode.name = 'Schlump'
to see what the profiler comes up with.
If that performs as expected then the solution may be to rewrite the query to include that OR logic in a different format?
Perhaps
where endNode.name = 'Hauptbahnhof Süd' and startNode.name IN ['Schlump','U Schlump']
Also found this older answer indicating an issue with the OR operator and indexes prior to 3.2.
I had remembered seeing another recent answer about some issue with OR, but can't seem to locate it now.
Good luck!

Related

Neo4j removing direction to relationship significantly takes more time

I have 2 queries which take significantly different amount of time. What dont I understand in relationships in Cypher to have this issue
less than 1 second with 5 results:
allshortestpaths((et)-[*]->(st))
less than 1 second with 3 results:
allshortestpaths((et)<-[*]-(st))
Takes forever (timeout):
allshortestpaths((et)-[*]-(st))
Why does this take forever. I would assume that this just has to return 8 results!!
Complete sample query:
profile match (s:Stop)--(st:Stoptime),
(e:Stop)--(et:Stoptime)
where s.name IN [ 'Schlump', 'U Schlump']
and e.name IN [ 'Hamburg Hbf', 'Hauptbahnhof Nord']
match p = allshortestpaths((et)-[*]->(st))
return p
With allshortestpaths((et)-[*]->(st)) you will have a paths that look like that : (a)-->(b)-->(c)-->(d)
With allshortestpaths((et)<-[*]-(st)) you will have a path that look like that : (a)<--(b)<--(c)<--(d)
With allshortestpaths((et)<-[*]-(st)) you will have a paths that look like that :
(a)-->(b)-->(c)-->(d)
(a)<--(b)<--(c)<--(d)
(a)-->(b)<--(c)-->(d)
(a)<--(b)<--(c)-->(d)
(a)-->(b)-->(c)<--(d)
(a)-->(b)<--(c)--<(d)
Do you the differences ? The last one is a more complexe query than the previous ones, that's why it can take a long time, specially when you not specify the relationship type and the max depth of the path.
So here it'spossible that you are asking to Neo4j to traverse all your database ...

neo4j cypher - Differing query plan behavior

Nodes with the Location node label have an index on Label.name
Profiling the following query gives me a smart plan, with a NodeHashJoin between the two sides of the graph on either side of Trip nodes. Very clever. Works great.
PROFILE MATCH (rosen:Location)<-[:OCCURS_AT]-(ev:Event)<-[:HAS]-(trip:Trip)-[:OPERATES_ON]->(date:Date)
WHERE rosen.name STARTS WITH "U Rosent" AND
ev.scheduled_departure_time > "07:45:00" AND
date.date = '2015-11-20'
RETURN rosen.name, ev.scheduled_departure_time, trip.headsign
ORDER BY ev.scheduled_departure_time
LIMIT 20;
However, just changing one line of the query from:
WHERE rosen.name STARTS WITH "U Rosent" AND
to
WHERE id(rosen) = 4752371 AND
seems to alter the entire behavior of the query plan, which now appears to become more "sequential", losing the parallel execution of (Trip)-[:OPERATES_ON]->(Date)
Much slower. 6x more DB hits in total.
Question
Why does changing the retrieval of one, seemingly-unrelated Location node via a different index/mechanism alter the behavior of the whole query?
(I'm not sure how best to convey more information about the graph model, but please advise, and I'd be happy to add details that are missing)
Edit:
It gets better. Changing that query line from:
WHERE rosen.name STARTS WITH "U Rosent" AND
to
WHERE rosen.name = "U Rosenthaler Platz." AND
results in the same loss of parallelism in the query plan!
Seems odd that a LIKE query is faster than an = ?

Low performance of neo4j

I am server engineer in company that provide dating service.
Currently I am building a PoC for our new recommendation engine.
I try to use neo4j. But performance of this database does not meet our needs.
I have strong feeling that I am doing something wrong and neo4j can do much better.
So can someone give me an advice how to improve performance of my Cypher’s query or how to tune neo4j in right way?
I am using neo4j-enterprise-2.3.1 which is running on c4.4xlarge instance with Amazon Linux.
In our dataset each user can have 4 types of relationships with others users - LIKE, DISLIKE, BLOCK and MATCH.
Also he has a properties like countryCode, birthday and gender.
I made import of all our users and relationships from RDBMS to neo4j using neo4j-import tool.
So each user is a node with properties and each reference is a relationship.
The report from neo4j-import tool said that :
2 558 667 nodes,
1 674 714 539 properties and
1 664 532 288 relationships
were imported.
So it’s huge DB :-) In our case some nodes can have up to 30 000 outgoing relationships..
I made 3 indexes in neo4j :
Indexes
ON :User(userId) ONLINE
ON :User(countryCode) ONLINE
ON :User(birthday) ONLINE
Then I try to build online recommendation engine using this query :
MATCH (me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()<-[:LIKE | :MATCH]-(similar:User)
USING INDEX me:User(userId)
USING INDEX similar:User(birthday)
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
here is the execution plan for one of the user :
plan
When I executed this query for list of users I had the result :
count=2391, min=4565.128849, max=36257.170065, mean=13556.750555555178, stddev=2250.149335254768, median=13405.409811, p75=15361.353029999998, p95=17385.136478, p98=18040.900481, p99=18426.811424, p999=19506.149138, mean_rate=0.9957385490980866, m1=1.2148195797996817, m5=1.1418078036067119, m15=0.9928564378521962, rate_unit=events/second, duration_unit=milliseconds
So even the fastest is too slow for Real-time recommendations..
Can you tell me what I am doing wrong?
Thanks.
EDIT 1 : plan with the expanded boxes :
I built an unmanaged extension to see if I could do better than Cypher. You can grab it here => https://github.com/maxdemarzi/social_dna
This is a first shot, there are a couple of things we can do to speed things up. We can pre-calculate/save similar users, cache things here and there, and random other tricks. Give it a shot, let us know how it goes.
Regards,
Max
If I'm reading this right, it's finding all matches for users by userId and separately finding all matches for users by your various criteria. It's then finding all of the places that they come together.
Since you have a case where you're starting on the left with a single node, my guess is that we'd be better served by following the paths and then filtering what it gotten via relationship traversal.
Let's see how starting like this works for you:
MATCH
(me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()
<-[:LIKE | :MATCH]-(similar:User)
WITH similar
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
[UPDATED]
One possible (and nonintuitive) cause of inefficiency in your query is that when you specify the similar:User(birthday) filter, Cypher uses an index seek with the :User(birthday) index (and additional tests for countryCode and gender) to find all possible DB matches for similar. Let's call that large set of similar nodes A.
Only after finding A does the query filter to see which of those nodes are actually connected to me, as specified by your MATCH pattern.
Now, if there are relatively few me to similar paths (as specified by the MATCH pattern, but without considering its WHERE clause) as compared to the size of A -- say, 2 or more orders of magnitude smaller -- then it might be faster to remove the :User label from similar (since I presume they are probably all going to be users anyway, in your data model), and also remove the USING INDEX similar:User(birthday) clause. In this case, not using the index for similar may actually be faster for you, since you will only be using the WHERE clause on a relatively small set of nodes.
The same considerations also apply to the recommendation node.
Of course, this all has to be verified by testing on your actual data.

How to find nodes being contained in a node's properties interval?

I'm currently developing some kind of a configurator using neo4j as a backend. Now I ran into a problem, I don't know how to solve best.
I've got nodes created like this:
(A:Product {name:'ProductA', minWidth:20, maxWidth:200, minHeight:10, maxHeight:400})
(B:Product {name:'ProductB', minWidth:40, maxWidth:100, minHeight:20, maxHeight:300})
...
There is an interface where the user can input a desired width & height, f.e. Width=30, Height=250. Now I'd like to check which products match the input criteria. As the input might be any long value, the approach used in http://neo4j.com/blog/modeling-a-multilevel-index-in-neoj4/ with dates doesn't seem to be suitable for me. How can I run a cypher query giving me all the nodes matching the input criteria?
I don't know if I understand well what you are asking for, but if I do, here a simple query to get this:
Assuming the user wants width = 30 and height = 50
Match (p:Product)
WHERE
p.minWidth < 30 AND p.maxWidth > 30 AND
p.minHeight < 50 AND p.maxHeight > 50
RETURN
p
If this is not what you are looking for, feel free to say it as comment.

Neo4j doesn't look for multiple starting points if one of them is null

In the following scenario, node "x" does not exist.
start x=node:node_auto_index(key="x"), y=node(*)
return count(x), count(y)
It seems that if any of the starting points can't be found, nothing is returned.
Any suggestions how to work around this issue?
This is like saying the below (in SQL)--what do you expect will happen if table X is empty?
select count(x), count(y)
from x, y
I'm not sure exactly what you're trying to query here, but you might need to get your counts one at a time, if there's a chance that x will come back with no results:
start x=node:node_auto_index(key="x")
with count(x) as cntx
start y=node(*)
return cntx, count(y) as cnty
Thanks to Wes, I figured out how to do a "conditional add" with the old Cypher syntax (pre 2.0):
START x=node:node_auto_index(key="x")
with count(x) as exists
start y=node:node_auto_index(key="y")
where exists = 0
create (n {key:"y"})<-[:rel]-y
return n, y
The crux here is that you can't fire another "start" after a "where" clause. You need to query for the second node before checking for conditions (which is kind of bad for performance). This is remedied in 2.0 with if-then-else statements anyway...

Resources