I'm using TheMovieDB database downloaded here. It has ~60k nodes and ~100k relationships and I need to find all the paths of a given length k between two nodes a and b with a given name property.
Let's say I need to find all the path of lenght 2 between Keanu Reeves and Laurence Fishburne. I used the following CYPHER query:
MATCH (k)-[e*2..2]-(l)
WHERE k.name = "Keanu Reeves" AND l.name = "Laurence Fishburne"
RETURN k,e,l
and it took 40 seconds.
I decided to try a different approach and used the following query instead:
MATCH (k)--(m)--(l)
WHERE k.name = "Keanu Reeves" AND l.name = "Laurence Fishburne"
RETURN k,m,l
and it took 252 milliseconds!
Those two queries gave the same results, had the same meaning and yet the first one took 200x more time. How is that possibile?
I need to conduct some tests in which I have to find all the paths with a given maximum (but not a minimum) length between two given nodes. This gives me some problems because I cannot use the second approach I described (it works only with a fixed lenght path) and the first one is waaaay too slow.
I also cannot use allShortestPath because it doesn't return any path whose length is greater than the shorter one.
It's driving me crazy...
Any idea how to solve it?
Edit
Another example of how big this issue is: finding a path of lenght 4 between Robert Downey Jr. and Harrison Ford.
Method #2: ~500 milliseconds
Method #1: >360 seconds (after those 6 minutes I brutally unplugged the pc power adaptor)
The reason your first query is taking so long is because it is not using any indexes at all; you are scanning the entire database.
If you change your query slightly to include the Actor label in the path you are matching you will significantly improve the query performance.
MATCH (k)-[e*2..2]-(l)
WHERE k.name = "Keanu Reeves" AND l.name = "Laurence Fishburne"
RETURN k,e,l
If you reveal the indexes by executing the :schema command in the browser you will see the indexes that are in place. You can see that the first one is on :Actor(name); withing the Actor label the name property is indexed.
Indexes
ON :Actor(name) ONLINE
ON :Director(name) ONLINE
ON :Movie(title) ONLINE
ON :Person(name) ONLINE
ON :User(login) ONLINE (for uniqueness constraint)
Constraints
ON (user:User) ASSERT user.login IS UNIQUE
If you profile your query
profile
MATCH (k)-[e*2..2]-(l)
WHERE k.name = "Keanu Reeves" AND l.name = "Laurence Fishburne"
RETURN k,e,l
and then profile the one with the :Actor label added it will be abundantly clear why the two perform differently.
profile
MATCH (k:Actor)-[e*2..2]-(l:Actor)
WHERE k.name = "Keanu Reeves" AND l.name = "Laurence Fishburne"
RETURN k,e,l
I forgot to add that you should also profile your second ( faster ) query:
profile
MATCH (k)--(m)--(l)
WHERE k.name = "Keanu Reeves" AND l.name = "Laurence Fishburne"
RETURN k,m,l
You will see that the query plans are significantly different. I think simply adding an asterisk to a relationship probably sends the database engine down a different optimization path.
Good luck!
Related
I find it hard to explain, so consider the following picture
I'm trying to select all products that fulfill the warehouse requirements
In this example I need to select all products that have a maximum size of 5 AND maximum weight of 10.
To simplify, I only have MAX (no MIN or EQ) constraints, so the operator can be hardcoded.
I've tried to group the requirement subgraph using COLLECT and using the ALL operator, but failed.
Query to create the graph
CREATE
// NODES
(warehouse:WAREHOUSE{name:'My Warehouse'}),
(smallProduct:PRODUCT{name:'Small Product'}),
(largeProduct:PRODUCT{name:'Large Product'}),
// RELATIONSHIPS
(size:CONSTRAINT{name:'Size'}),
(weight:CONSTRAINT{name:'Weight'}),
(warehouse)-[:LIMIT{value:5}]->(size),
(warehouse)-[:LIMIT{value:5}]->(weight),
(smallProduct)-[:AMOUNT{value:3}]->(size),
(smallProduct)-[:AMOUNT{value:2}]->(weight),
(largeProduct)-[:AMOUNT{value:10}]->(size),
(largeProduct)-[:AMOUNT{value:4}]->(weight)
UPDATE
The following query apparently solves the problem:
MATCH (warehouse:WAREHOUSE)
MATCH rel = ((warehouse)-[limit:LIMIT]->(constraint:CONSTRAINT)<-[amount:AMOUNT]-(product:PRODUCT))
WITH warehouse, product, collect(relationships(rel)) as paths
WHERE all( p in paths WHERE p[0].value > p[1].value )
return product
I am wondering if there is a better solution.
I have a simple query
MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN m
and when executing the query "manually" (i.e. using the browser interface to follow edges) I only get a single node as a result as there are no further connections. Checking this with the query
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o
shows no results and
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE) RETURN m
shows a single node so I have made no mistake doing the query manually.
However, the issue is that the first question takes ages to finish and I do not understand why.
Consequently: What is the reason such trivial query takes so long even though the maximum result would be one?
Bonus: How to fix this issue?
As Tezra mentioned, the variable-length pattern match isn't in the same category as the other two queries you listed because there's no restrictions given on any of the nodes in between n and m, they can be of any type. Given that your query is taking a long time, you likely have a fairly dense graph of :CONNECTION relationships between nodes of different types.
If you want to make sure all nodes in your path are of the same label, you need to add that yourself:
MATCH path = (n:TYPE {id:123})<-[:CONNECTION*]-(m:TYPE)
WHERE all(node in nodes(path) WHERE node:TYPE)
RETURN m
Alternately you can use APOC Procedures, which has a fairly efficient means of finding connected nodes (and restricting nodes in the path by label):
MATCH (n:TYPE {id:123})
CALL apoc.path.subgraphNodes(n, {labelFilter:'TYPE', relationshipFilter:'<CONNECTION'}) YIELD node
RETURN node
SKIP 1 // to avoid returning `n`
MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:TYPE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o Is not a fair test of MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN m because it excludes the possibility of MATCH (n:TYPE {id:123})<-[:CONNECTION]<-(m:ANYTHING_ELSE)<-[n:CONNECTION]-(o:TYPE) RETURN m,o.
For your main query, you should be returning DISTINCT results MATCH (n:TYPE {id:123})<-[:CONNECTION*]<-(m:TYPE) RETURN DISTINCT m.
This is for 2 main reasons.
Without distinct, each node needs to be returned the number of times for each possible path to it.
Because of the previous point, that is a lot of extra work for no additional meaningful information.
If you use RETURN DISTINCT, it gives the cypher planner the choice to do a pruning search instead of an exhaustive search.
You can also limit the depth of the exhaustive search using ..# so that it doesn't kill your query if you run against a much older version of Neo4j where the Cypher Planner hasn't learned pruning search yet. Example use MATCH (n:TYPE {id:123})<-[:CONNECTION*..10]<-(m:TYPE) RETURN m
I'm trying to build a database that works along the lines of http://static.echonest.com/BoilTheFrog/ where you enter 2 artists and you get a list of tracks that seamlessly transitions between them.
I have built a Neo4j database with around 100k inter-related Artists. The artists and their relationships with each other were taken from the Spotify Web API.
Each artist has a popularity value (between 0 and 100).
I want to find a path that is at least 20 connections long between two artists where all the artists on the path have a minimum popularity.
Here's what i've done so far, and it makes sense in my head, but it just runs infinitely and never finishes.
MATCH (start:Artist {name: 'Ed Sheeran'}), (end:Artist {name: 'The Strokes'})
MATCH path = shortestPath((start)-[:`HAS SIMILAR ARTIST`*..20]-(end))
WHERE ALL(x in nodes(path) WHERE x.popularity > 20)
AND LENGTH(path) = 20
RETURN path
LIMIT 1
My guess is that MATCH path =.. is finding the same path every time and it is then applying the WHERE filter, so it never succeeds.
I have seen approaches that filter based on the relationship itself, but the properties I want to filter are on the nodes themselves.
If I instead use
MATCH (start:Artist {name: 'Ed Sheeran'}), (end:Artist {name: 'The Strokes'})
MATCH path = shortestPath((start)-[:`HAS SIMILAR ARTIST`*..20]-(end))
WHERE LENGTH(path) = 20
RETURN path
LIMIT 1
It succeeds, but some of the connections are extremely obscure, so I was hoping to strengthen the relationships with the popularity requirement.
Since you only want paths of exactly length 20, you should specify 20 as the lower bound (as well as the upper bound) for the variable-length path pattern. That should eliminate (or greatly reduce the number of) repeated traversals of shorter paths.
MATCH (start:Artist {name: 'Ed Sheeran'}), (end:Artist {name: 'The Strokes'})
MATCH path = shortestPath((start)-[:`HAS SIMILAR ARTIST`*20..20]-(end))
WHERE ALL(x in nodes(path) WHERE x.popularity > 20)
RETURN path
LIMIT 1;
I have some questions regarding Neo4j's Query profiling.
Consider below simple Cypher query:
PROFILE
MATCH (n:Consumer {mobileNumber: "yyyyyyyyy"}),
(m:Consumer {mobileNumber: "xxxxxxxxxxx"})
WITH n,m
MATCH (n)-[r:HAS_CONTACT]->(m)
RETURN n,m,r;
and output is:
So according to Neo4j's Documentation:
3.7.2.2. Expand Into
When both the start and end node have already been found, expand-into
is used to find all connecting relationships between the two nodes.
Query.
MATCH (p:Person { name: 'me' })-[:FRIENDS_WITH]->(fof)-->(p) RETURN
> fof
So here in the above query (in my case), first of all, it should find both the StartNode & the EndNode before finding any relationships. But unfortunately, it's just finding the StartNode, and then going to expand all connected :HAS_CONTACT relationships, which results in not using "Expand Into" operator. Why does this work this way? There is only one :HAS_CONTACT relationship between the two nodes. There is a Unique Index constraint on :Consumer{mobileNumber}. Why does the above query expand all 7 relationships?
Another question is about the Filter operator: why does it requires 12 db hits although all nodes/ relationships are already retrieved? Why does this operation require 12 db calls for just 6 rows?
Edited
This is the complete Graph I am querying:
Also I have tested different versions of same above query, but the same Query Profile result is returned:
1
PROFILE
MATCH (n:Consumer{mobileNumber: "yyyyyyyyy"})
MATCH (m:Consumer{mobileNumber: "xxxxxxxxxxx"})
WITH n,m
MATCH (n)-[r:HAS_CONTACT]->(m)
RETURN n,m,r;
2
PROFILE
MATCH (n:Consumer{mobileNumber: "yyyyyyyyy"}), (m:Consumer{mobileNumber: "xxxxxxxxxxx"})
WITH n,m
MATCH (n)-[r:HAS_CONTACT]->(m)
RETURN n,m,r;
3
PROFILE
MATCH (n:Consumer{mobileNumber: "yyyyyyyyy"})
WITH n
MATCH (n)-[r:HAS_CONTACT]->(m:Consumer{mobileNumber: "xxxxxxxxxxx"})
RETURN n,m,r;
The query you are executing and the example provided in the Neo4j documentation for Expand Into are not the same. The example query starts and ends at the same node.
If you want the planner to find both nodes first and see if there is a relationship then you could use shortestPath with a length of 1 to minimize the DB hits.
PROFILE
MATCH (n:Consumer {mobileNumber: "yyyyyyyyy"}),
(m:Consumer {mobileNumber: "xxxxxxxxxxx"})
WITH n,m
MATCH Path=shortestPath((n)-[r:HAS_CONTACT*1]->(m))
RETURN n,m,r;
Why does this do this?
It appears that this behaviour relates to how the query planner performs a database search in response to your cypher query. Cypher provides an interface to search and perform operations in the graph (alternatives include the Java API, etc.), queries are handled by the query planner and then turned into graph operations by neo4j's internals. It make sense that the query planner will find what is likely to be the most efficient way to search the graph (hence why we love neo), and so just because a cypher query is written one way, it won't necessarily search the graph in the way we imagine it will in our head.
The documentation on this seemed a little sparse (or, rather I couldn't find it properly), any links or further explanations would be much appreciated.
Examining your query, I think you're trying to say this:
"Find two nodes each with a :Consumer label, n and m, with contact numbers x and y respectively, using the mobileNumber index. If you find them, try and find a -[:HAS_CONTACT]-> relationship from n to m. If you find the relationship, return both nodes and the relationship, else return nothing."
Running this query in this way requires a cartesian product to be created (i.e., a little table of all combinations of n and m - in this case only one row - but for other queries potentially many more), and then relationships to be searched for between each of these rows.
Rather than doing that, since a MATCH clause must be met in order to continue with the query, neo knows that the two nodes n and m must be connected via the -[:HAS_CONTACT]-> relationship if the query is to return anything. Thus, the most efficient way to run the query (and avoid the cartesian product) is as below, which is what your query can be simplified to.
"Find a node n with the :Consumer label, and value x for the index mobileNumber, which is connected via a -[:HAS_CONTACT]-> relationshop to a node m with the :Consumer label, and value y for its proprerty mobileNumber. Return both nodes and the relationship, else return nothing."
So, rather than perform two index searches, a cartesian product and a set of expand into operations, neo performs only one index search, an expand all, and a filter.
You can see the result of this simplification by the query planner through the presence of AUTOSTRING parameters in your query profile.
How to Change Query to Implement Search as Desired
If you want to change the query so that it must use an expand into relationship, make the requirement for the relationship optional, or use explicitly iterative execution. Both these queries below will produce the initially expected query profiles.
Optional example:
PROFILE
MATCH (n:Consumer{mobileNumber: "xxx"})
MATCH (m:Consumer{mobileNumber: "yyy"})
WITH n,m
OPTIONAL MATCH (n)-[r:HAS_CONTACT]->(m)
RETURN n,m,r;
Iterative example:
PROFILE
MATCH (n1:Consumer{mobileNumber: "xxx"})
MATCH (m:Consumer{mobileNumber: "yyy"})
UNWIND COLLECT(n1) AS n
MATCH (n)-[r:HAS_CONTACT]->(m)
RETURN n,m,r;
I have a Neo4j graph (ver. 2.2.2) with large number of relationships. For examaple: 1 node "Group", 300000 nodes "Data", 300000 relationships from "Group" to all existing nodes "Data". I need to check if there is a relationship between set of Data nodes and specific Group node (for example for 200 nodes). But Cypher query I used is very slow. I tried many modifications of this cypher but with no result.
Cypher to create graph:
FOREACH (r IN range(1,300000) | CREATE (:Data {id:r}));
CREATE (:Group);
MATCH (g:Group),(d:Data) create (g)-[:READ]->(d);
Query 1: COST. 600003 total db hits in 730 ms.
Acceptable but I asked only for 1 node.
PROFILE MATCH (d:Data)<-[:READ]-(g:Group) WHERE id(d) IN [10000] AND id(g)=300000 RETURN id(d);
Query 2: COST. 600003 total db hits in 25793 ms.
Not acceptable.
You need to replace "..." with real numbers of nodes from 10000 to 10199
PROFILE MATCH (d:Data)<-[:READ]-(g:Group) WHERE id(d) IN [10000,10001,10002 " ..." ,10198,10199] AND id(g)=300000 RETURN id(d);
Query 3: COST. 1000 total db hits in 309 ms.
This is only one solution I found to make query acceptable. I returned all ids of nodes "Group" and manualy filter result in my code to return only relationships to node with id 300000
You need to replace "..." with real numbers of nodes from 10000 to 10199
PROFILE MATCH (d:Data)<-[:READ]-(g:Group) WHERE id(d) IN [10000,10001,10002 " ..." ,10198,10199] RETURN id(d), id(g);
Question 1: Total DB hits in query 1 is surprising but I accept that physical model of neoj defines how this query is executed - it needs to look into every existing relation from node "Group". I accept that. But why is so big difference in execution time between query 1 and query 2 if number of db hits is the same (and exucution plan is the same)? I'm only returning id of node, not large set of properties.
Question 2: Is a query 3 the only one solution to optimize this query?
Apparently there is an issue with Cypher in 2.2.x with the seekById.
You can prefix your query with PLANNER RULE in order to make use of the previous Cypher planner, but you'll have to split your pattern in two for making it really fast, tested e.g. :
PLANNER RULE
MATCH (d:Data) WHERE id(d) IN [30]
MATCH (g:Group) WHERE id(g) = 300992
MATCH (d)<-[:READ]-(g)
RETURN id(d)