neo4j cypher - Differing query plan behavior - neo4j

Nodes with the Location node label have an index on Label.name
Profiling the following query gives me a smart plan, with a NodeHashJoin between the two sides of the graph on either side of Trip nodes. Very clever. Works great.
PROFILE MATCH (rosen:Location)<-[:OCCURS_AT]-(ev:Event)<-[:HAS]-(trip:Trip)-[:OPERATES_ON]->(date:Date)
WHERE rosen.name STARTS WITH "U Rosent" AND
ev.scheduled_departure_time > "07:45:00" AND
date.date = '2015-11-20'
RETURN rosen.name, ev.scheduled_departure_time, trip.headsign
ORDER BY ev.scheduled_departure_time
LIMIT 20;
However, just changing one line of the query from:
WHERE rosen.name STARTS WITH "U Rosent" AND
to
WHERE id(rosen) = 4752371 AND
seems to alter the entire behavior of the query plan, which now appears to become more "sequential", losing the parallel execution of (Trip)-[:OPERATES_ON]->(Date)
Much slower. 6x more DB hits in total.
Question
Why does changing the retrieval of one, seemingly-unrelated Location node via a different index/mechanism alter the behavior of the whole query?
(I'm not sure how best to convey more information about the graph model, but please advise, and I'd be happy to add details that are missing)
Edit:
It gets better. Changing that query line from:
WHERE rosen.name STARTS WITH "U Rosent" AND
to
WHERE rosen.name = "U Rosenthaler Platz." AND
results in the same loss of parallelism in the query plan!
Seems odd that a LIKE query is faster than an = ?

Related

Simple Cypher Query for apoc dijkstra taking FOREVER

Maybe I am very stupid or Neo4j is not supposed to be fast. (Disclaimer: I am a Neo4j noob)
I have the following simple dijkstra query which is taking forever to run. I have to atleast wait for 5-10 minutes for it to execute.Sometimes my Chrome browser crashes because of it.
Sample Graph
Cypther Query
profile MATCH (startNode:Stop)--(st:Stoptime),
(endNode:Stop)--(et:Stoptime)
where endNode.name = 'Hauptbahnhof Süd' and
(startNode.name = 'Schlump' or startNode.name = 'U Schlump')
call apoc.algo.dijkstra(st, et, 'PRECEDES', 'weight') YIELD path, weight
return startNode, endNode, path, weight
limit 100;
Computer Config
I am using a Ubuntu VM on windows machine which has 24GB Ram and 6 Cpus.
Indexes
Sysinfo
When I run profile on the above Query, i get the following information:
Profile Information
For the love of God, I cant figure out, where the bottleneck lies. I have checked all other answers on this, but to no avail.
Since I don't have the data set to test out my suggestion with, I can only point you in the direction that I would look. Hopefully, it leads you to the answer.
In looking at the profile and query I see that startNode and endNode are both type :Stop and that the Stop.name property is indexed.
When looking for endNode.name = 'Hauptbahnhof Süd' there are 3 estimated rows and 3 rows are returned.
However when looking for (startNode.name = 'Schlump' or startNode.name = 'U Schlump') there are 6 estimated rows, but 14827 returned.
Are there indeed 14827 :Stop nodes that contain either 'Schlump' or 'U Schlump'?
Or is it the 6 estimated rows? If the latter is the case can you run the query without the OR:
where endNode.name = 'Hauptbahnhof Süd' and startNode.name = 'Schlump'
to see what the profiler comes up with.
If that performs as expected then the solution may be to rewrite the query to include that OR logic in a different format?
Perhaps
where endNode.name = 'Hauptbahnhof Süd' and startNode.name IN ['Schlump','U Schlump']
Also found this older answer indicating an issue with the OR operator and indexes prior to 3.2.
I had remembered seeing another recent answer about some issue with OR, but can't seem to locate it now.
Good luck!

ActiveRecord count analysis in rails query

I am hardly checking to find the execution speed of two queries, explain analyze and benchmark because i got timeout for one query but i am not sure this query was causing this.
queue_count = purchase.purchase_items.where("queue_id = ?", queue.id).count
same sql query
SELECT COUNT(*) FROM "purchase_items" WHERE "purchase_items"."purchase_id" = 1241422 AND (queue_id = 3479783)
so i have to remove the count then i got one solution to take all record in array and do the count then i got the query like this
queue_count = purchase.purchase_items.where("queue_id = ?", queue.id).all.count
same sql query
SELECT "purchase_items".* FROM "purchase_items" WHERE "purchase_items"."purchase_id" = 1241422 AND (queue_id = 3479783)
finally got some slight variation when i was checking with query analyze and also benchmark, so this was the correct way? or am i doing anything wrong?
In terms of performance second query will be quite terrible. It will load all records in memory and count them using Ruby. Database is designed to do stuff like this quickly.
In order to analyze query you can do EXPLAIN ANALYZE in Psql console. My long shot is that you're missing some indexes (on purchase_id and queue_id). You can look into this by running:
EXPLAIN ANALYZE SELECT COUNT(*) FROM purchase_items WHERE purchase_id = 1241422 AND (queue_id = 3479783)
If you see that PostgreSQL is scanning whole table, then performance will not be optimal. Try adding indexes:
CREATE INDEX purchase_id_purchase_items_idx ON purchase_items (purchase_id);
CREATE INDEX queue_id_purchase_items_idx ON purchase_items (queue_id);
and examining performance using EXPLAIN ANALYZE then. But never load all records into Ruby to do simple .count on them.

Low performance of neo4j

I am server engineer in company that provide dating service.
Currently I am building a PoC for our new recommendation engine.
I try to use neo4j. But performance of this database does not meet our needs.
I have strong feeling that I am doing something wrong and neo4j can do much better.
So can someone give me an advice how to improve performance of my Cypher’s query or how to tune neo4j in right way?
I am using neo4j-enterprise-2.3.1 which is running on c4.4xlarge instance with Amazon Linux.
In our dataset each user can have 4 types of relationships with others users - LIKE, DISLIKE, BLOCK and MATCH.
Also he has a properties like countryCode, birthday and gender.
I made import of all our users and relationships from RDBMS to neo4j using neo4j-import tool.
So each user is a node with properties and each reference is a relationship.
The report from neo4j-import tool said that :
2 558 667 nodes,
1 674 714 539 properties and
1 664 532 288 relationships
were imported.
So it’s huge DB :-) In our case some nodes can have up to 30 000 outgoing relationships..
I made 3 indexes in neo4j :
Indexes
ON :User(userId) ONLINE
ON :User(countryCode) ONLINE
ON :User(birthday) ONLINE
Then I try to build online recommendation engine using this query :
MATCH (me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()<-[:LIKE | :MATCH]-(similar:User)
USING INDEX me:User(userId)
USING INDEX similar:User(birthday)
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
here is the execution plan for one of the user :
plan
When I executed this query for list of users I had the result :
count=2391, min=4565.128849, max=36257.170065, mean=13556.750555555178, stddev=2250.149335254768, median=13405.409811, p75=15361.353029999998, p95=17385.136478, p98=18040.900481, p99=18426.811424, p999=19506.149138, mean_rate=0.9957385490980866, m1=1.2148195797996817, m5=1.1418078036067119, m15=0.9928564378521962, rate_unit=events/second, duration_unit=milliseconds
So even the fastest is too slow for Real-time recommendations..
Can you tell me what I am doing wrong?
Thanks.
EDIT 1 : plan with the expanded boxes :
I built an unmanaged extension to see if I could do better than Cypher. You can grab it here => https://github.com/maxdemarzi/social_dna
This is a first shot, there are a couple of things we can do to speed things up. We can pre-calculate/save similar users, cache things here and there, and random other tricks. Give it a shot, let us know how it goes.
Regards,
Max
If I'm reading this right, it's finding all matches for users by userId and separately finding all matches for users by your various criteria. It's then finding all of the places that they come together.
Since you have a case where you're starting on the left with a single node, my guess is that we'd be better served by following the paths and then filtering what it gotten via relationship traversal.
Let's see how starting like this works for you:
MATCH
(me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()
<-[:LIKE | :MATCH]-(similar:User)
WITH similar
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
[UPDATED]
One possible (and nonintuitive) cause of inefficiency in your query is that when you specify the similar:User(birthday) filter, Cypher uses an index seek with the :User(birthday) index (and additional tests for countryCode and gender) to find all possible DB matches for similar. Let's call that large set of similar nodes A.
Only after finding A does the query filter to see which of those nodes are actually connected to me, as specified by your MATCH pattern.
Now, if there are relatively few me to similar paths (as specified by the MATCH pattern, but without considering its WHERE clause) as compared to the size of A -- say, 2 or more orders of magnitude smaller -- then it might be faster to remove the :User label from similar (since I presume they are probably all going to be users anyway, in your data model), and also remove the USING INDEX similar:User(birthday) clause. In this case, not using the index for similar may actually be faster for you, since you will only be using the WHERE clause on a relatively small set of nodes.
The same considerations also apply to the recommendation node.
Of course, this all has to be verified by testing on your actual data.

How to find nodes being contained in a node's properties interval?

I'm currently developing some kind of a configurator using neo4j as a backend. Now I ran into a problem, I don't know how to solve best.
I've got nodes created like this:
(A:Product {name:'ProductA', minWidth:20, maxWidth:200, minHeight:10, maxHeight:400})
(B:Product {name:'ProductB', minWidth:40, maxWidth:100, minHeight:20, maxHeight:300})
...
There is an interface where the user can input a desired width & height, f.e. Width=30, Height=250. Now I'd like to check which products match the input criteria. As the input might be any long value, the approach used in http://neo4j.com/blog/modeling-a-multilevel-index-in-neoj4/ with dates doesn't seem to be suitable for me. How can I run a cypher query giving me all the nodes matching the input criteria?
I don't know if I understand well what you are asking for, but if I do, here a simple query to get this:
Assuming the user wants width = 30 and height = 50
Match (p:Product)
WHERE
p.minWidth < 30 AND p.maxWidth > 30 AND
p.minHeight < 50 AND p.maxHeight > 50
RETURN
p
If this is not what you are looking for, feel free to say it as comment.

Can I add where clauses after putting limit on a scoped query?

I have a model called Game in which I build up a scoped query.
Something like:
games = Game.scoped
games = games.team(team_name) if team_name
games = game.opponent(opponent_name) if opponent_name
total_games = games
I then calculate several subsets like:
wins = games.where("team_score > opponent_score").count
losses = games.where("opponent_score > team_score").count
Everything is great. Then I decided that I want to limit the original scope to show the last X number of games.
total_games = games.limit(10)
If there are 100 games that match what I want for total_games, and then I add .limit(10) - it gets the last 10. Great. But now calling
total_games.where("team_score > opponent_score").count
will reach back beyond the last 10, and into results that aren't part of total_games. Since adding .limit(10), I'll always get 10 total games, but also 10 wins, and 10 losses.
After typing this all out, I've realized that the cases where I want to use limit are for showing a smaller set of results - so I'll probably end up just looping through the results to calculate things like wins and losses (instead of doing separate queries as in my subsets above).
I tried this out when total_games had hundreds or thousands of results, and it's significantly slower to loop through than it is to just do separate queries for the subsets.
So, now I must know - what is the best way to limit a scoped query, and then do future queries of those results that restrict themselves results returned by the original .limit(x)?
I don't think you can do what you want to do without separating your query into two steps, first getting 10 games from total_games and making the DB query with all:
last_10_games = total_games.limit(10).all
then selecting from the resulting array and getting the size of the result:
wins = last_10_games.select { |g| g.team_score > g.opponent_score }.count
losses = last_10_games.select { |g| g.opponent_score > g.team_score }.count
I know this is not exactly what you asked for, but I think it's probably the most straightforward solution to the problem.

Resources