I have been experimenting pattern comprehensions for optimization, but seems getting even more confused
Here is my initial query:
MATCH (a:Actor)-[:ACTED_IN]->(m:Movie)
WHERE 2000 <= m.year <= 2005 AND a.born.year >= 1980
RETURN a.name AS Actor, a.born AS Born,
collect(DISTINCT m.title) AS Movies ORDER BY Actor
from profiling, I am getting:
Cypher version: , planner: COST, runtime: PIPELINED. 41944 total db hits in 152 ms.
I attempted the following rewrite:
profile MATCH (a:Actor)
WHERE a.born.year >= 1980
// Add a WITH clause to create the list using pattern comprehension
with a
match (a)-[:ACTED_IN]-(m:Movie)
where 2000 <= m.year <= 2005
// filter the result of the pattern comprehension to return only lists with elements
// return the Actor, Born, and Movies
return a.name as Actor, a.born as Born, [(a)-[:ACTED_IN]-(m) | m.title] as Movies
order by a
from profiling, I am getting:
Cypher version: , planner: COST, runtime: PIPELINED. 47879 total db hits in 47 ms.
Then I try another rewrite:
profile MATCH (a:Actor)
WHERE a.born.year >= 1980
// Add a WITH clause to create the list using pattern comprehension
// filter the result of the pattern comprehension to return only lists with elements
// return the Actor, Born, and Movies
with a, [ (a)-[:ACTED_IN]-(m:Movie) where 2000 <= m.year <= 2005 | m.title] as Movies
return a.name as Actor, a.born as Born, Movies
order by a
Cypher version: , planner: COST, runtime: PIPELINED. 59251 total db hits in 6 ms.
Each performance is worse than another. While I can review the query plan to understand the differences. Is there a way to use pattern comprehension to actually reduce my DB hits comparing to the initial query using collect statement?
Please show us the profile result on your last query; I tested it in Movie database and it worked well vs the orig query(46ms vs orig: 120db hits). Also, check if Actor.born.year has an index.
profile MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE 2000 <= m.released <= 2005 AND a.born >= 1980
RETURN a.name AS Actor, a.born AS Born,
collect(DISTINCT m.title) AS Movies ORDER BY Actor
planner: COST, runtime: PIPELINED. 120 total db hits in 9 ms
profile MATCH (a:Person)
WHERE a.born >= 1980
RETURN a.name AS Actor, a.born AS Born,
[(a)-[:ACTED_IN]-(m:Movie) where 2000 <= m.released <= 2005 | m.title] AS Movies ORDER BY Actor
planner: COST, runtime: PIPELINED. 43 total db hits in 6 ms
Related
This cypher query on the neo4j movie dataset will return 822 rows
MATCH (a:Actor)-[:ACTED_IN]->(m:Movie)
WHERE m.year = 2000
AND a.born IS NOT NULL
RETURN DISTINCT a.name AS Actor, a.born AS Born
order by a.born
I modify the query using pattern comprehension below and it returns 15443 rows, but all of them are empty arrays.
MATCH (a:Actor)
with a, [ (a where a.born is not null)-[:ACTED_IN]->(m:Movie where m.year = 2000) | a.name ] as Actors
return Actors
My intent is to return a list of actors just like the first query. What went wrong in the second query?
[UPDATED]
This will give you the distinct actors:
WITH [(a:Actor WHERE a.born IS NOT NULL)-[:ACTED_IN]->(m:Movie WHERE m.year = 2000)|a.name] AS actorList
UNWIND actorList AS actor
RETURN DISTINCT actor
I have a Cypher query:
PROFILE MATCH (dg:DecisionGroup {id: -2})-[rdgd:CONTAINS]->(childD:Profile )
WITH childD
RETURN count(childD)
Cypher version: CYPHER 4.4, planner: COST, runtime: INTERPRETED. 20003 total db hits in 14 ms
and the second query:
PROFILE MATCH (dg:DecisionGroup {id: -2})-[rdgd:CONTAINS]->(childD:Profile)
MATCH (childD)-[:CONTAINS]->(childDStat:JobableStatistic)
WITH childD
RETURN count(childD)
Cypher version: CYPHER 4.4, planner: COST, runtime: INTERPRETED. 224367 total db hits in 68 ms.
as you may see DB hits incresses from 20003 total db hits to 224367.. But I have one_2_one relationship between childD and childDStat and 10k childD and 10K childDStat for them. What am I doing wrong in my query and how to decrease DB hits?
Using multiple relationships types can help you optimize your queries, especially if you are only counting relationships and not doing anything else. What i've seen in practice is having really specific relationships like:
(dg:DecisionGroup {id: -2})-[:DECISIONGROUP_HAS_PROFILE]->(childD:Profile )
So something like that. Then you can quickly count relationships by utilizing the relationship count store:
PROFILE MATCH (dg:DecisionGroup {id: -2})
WITH dg, size((dg)-[DECISIONGROUP_HAS_PROFILE]->()) AS c
RETURN sum(c) AS result
Take a look at: https://neo4j.com/developer/kb/fast-counts-using-the-count-store/
It seems they have added a few more Cypher options to access the count store, but anyway, count store is much more performant than expanding each relationship.
You can get creative with more "complex" queries and rewrite the
PROFILE MATCH (dg:DecisionGroup {id: -2})-[rdgd:CONTAINS]->(childD:Profile)
MATCH (childD)-[:CONTAINS]->(childDStat:JobableStatistic)
WITH childD
RETURN count(childD)
into
PROFILE MATCH (dg:DecisionGroup {id: -2})-[rdgd:CONTAINS]->(childD:Profile)
WITH childD, size((childD)-[:CONTAINS]->()) AS count
RETURN sum(count) AS result
Notice that you are not checking the label of the node at the end of the relationship, so your model must ensure that is always correct.
Background
I want to create a histogram of the relationships starting from a set of nodes.
Input is a set of node ids, for example set = [ id_0, id_1, id_2, id_3, ... id_n ].
The output is a the relationship type histogram for each node (e.g. Map<Long, Map<String, Long>>):
id_0:
- ACTED_IN: 14
- DIRECTED: 1
id_1:
- DIRECTED: 12
- WROTE: 5
- ACTED_IN: 2
id_2:
...
The current cypher query I've written is:
MATCH (n)-[r]-()
WHERE id(n) IN [ id_0, id_1, id_2, id_3, ... id_n ] # set
RETURN id(n) as id, type(r) as type, count(r) as count
It returns the pair of [ id, type ] count like:
id | rel type | count
id0 | ACTED_IN | 14
id0 | DIRECTED | 1
id1 | DIRECTED | 12
id1 | WROTE | 5
id1 | ACTED_IN | 2
...
The result is collected using java and merged to the first structure (e.g. Map<Long, Map<String, Long>>).
Problem
Getting the relationship histogram on smaller graphs is fast but can be very slow on bigger datasets. For example if I want to create the histogram where the set-size is about 100 ids/nodes and each of those nodes have around 1000 relationships the cypher query took about 5 minutes to execute.
Is there more efficient way to collect the histogram for a set of nodes?
Could this query be parallelized? (With java code or using UNION?)
Is something wrong with how I set up my neo4j database, should these queries be this slow?
There is no need for parallel queries, just the need to understand Cypher efficiency and how to use statistics.
Bit of background :
Using count, will execute an expandAll, which is as expensive as the number of relationships a node has
PROFILE
MATCH (n) WHERE id(n) = 21
MATCH (n)-[r]-(x)
RETURN n, type(r), count(*)
Using size and a relationship type, uses internally getDegree which is a statistic a node has locally, and thus is very efficient
PROFILE
MATCH (n) WHERE id(n) = 0
RETURN n, size((n)-[:SEARCH_RESULT]-())
Morale of the story, for using size you need to know the relationship types a labeled node can have. So, you need to know the schema of the database ( in general you will want that, it makes things easily predictable and building dynamically efficient queries becomes a joy).
But let's assume you don't know the schema, you can use APOC cypher procedures, allowing you to build dynamic queries.
The flow is :
Get all the relationship types from the database ( fast )
Get the nodes from id list ( fast )
Build dynamic queries using size ( fast )
CALL db.relationshipTypes() YIELD relationshipType
WITH collect(relationshipType) AS types
MATCH (n) WHERE id(n) IN [21, 0]
UNWIND types AS type
CALL apoc.cypher.run("RETURN size((n)-[:`" + type + "`]-()) AS count", {n: n})
YIELD value
RETURN id(n), type, value.count
I have the following query:
CALL apoc.index.relationships('TO','user:37f0ce60-b428-11e8-bb45-9394d4f42b57') YIELD rel, start, end
WITH DISTINCT rel, start, end
MATCH (ctx:Context)
WHERE rel.context = ctx.uid AND (ctx.name="iG9CE55wbtY" )
RETURN DISTINCT start.uid AS source_id, start.name AS source_name, end.uid AS target_id, end.name AS target_name, rel.uid AS edge_id, ctx.name AS context_name, rel.statement AS statement_id, rel.weight AS weight;
Which uses indexed relationships. However, it takes about 4 to 10 seconds to process.
Here's the results with PROFILE:
Cypher version: CYPHER 3.3, planner: COST, runtime: INTERPRETED. 470705 total db hits in 2758 ms.
Is there anything I could optimize in this query, for instance, using parameters or rewriting it in any way that could improve the performance?
I am writing a Cypher query in Neo4j 2.0.4 that attempts to get the total number of inbound and outbound relationships for a selected node. I can do this easily when I only use this query one-node-at-a-time, like so:
MATCH (g1:someIndex{name:"name1"})
MATCH g1-[r1]-()
RETURN count(r1);
//Returns 305
MATCH (g2:someIndex{name:"name2"})
MATCH g2-[r2]-()
RETURN count(r2);
//Returns 2334
But when I try to run the query with 2 nodes together (i.e. get the total number of relationships for both g1 and g2), I seem to get a bizarre result.
MATCH (g1:someIndex{name:"name1"}), (g2:someIndex{name:"name2"})
MATCH g1-[r1]-(), g2-[r2]-()
RETURN count(r1)+count(r2);
//Returns 1423740
For some reason, the number is much much greater than the total of 305+2334.
It seems like other Neo4j users have run into strange issues when using multiple MATCH clauses, so I read through Michael Hunger's explanation at https://groups.google.com/d/msg/neo4j/7ePLU8y93h8/8jpuopsFEFsJ, which advised Neo4j users to pipe the results of one match using WITH to avoid "identifier uniqueness". However, when I run the following query, it simply times out:
MATCH (g1:gene{name:"SV422_HUMAN"}),(g2:gene{name:"BRCA1_HUMAN"})
MATCH g1-[r1]-()
WITH r1
MATCH g2-[r2]-()
RETURN count(r1)+count(r2);
I suspect this query doesn't return because there's a lot of records returned by r1. In this case, how would I operate my "get-number-of-relationships" query on 2 nodes? Am I just using some incorrect syntax, or is there some fundamental issue with the logic of my "2 node at a time" query?
Your first problem is that you are returning a Cartesian product when you do this:
MATCH (g1:someIndex{name:"name1"}), (g2:someIndex{name:"name2"})
MATCH g1-[r1]-(), g2-[r2]-()
RETURN count(r1)+count(r2);
If there are 305 instances of r1 and 2334 instances of r2, you're returning (305 * 2334) == 711870 rows, and because you are summing this (count(r1)+count(r2)) you're getting a total of 711870 + 711870 == 1423740.
Your second problem is that you are not carrying over g2 in the WITH clause of this query:
MATCH (g1:gene{name:"SV422_HUMAN"}),(g2:gene{name:"BRCA1_HUMAN"})
MATCH g1-[r1]-()
WITH r1
MATCH g2-[r2]-()
RETURN count(r1)+count(r2);
You match on g2 in the first MATCH clause, but then you leave it behind when you only carry over r1 in the WITH clause at line 3. Then, in line 4, when you match on g2-[r2]-() you are matching literally everything in your graph, because g2 has been unbound.
Let me walk through a solution with the movie dataset that ships with the Neo4j browser, as you have not provided sample data. Let's say I want to get the total count of relationships attached to Tom Hanks and Hugo Weaving.
As separate queries:
MATCH (:Person {name:'Tom Hanks'})-[r]-()
RETURN COUNT(r)
=> 13
MATCH (:Person {name:'Hugo Weaving'})-[r]-()
RETURN COUNT(r)
=> 5
If I try to do it your way, I'll get (13 * 5) * 2 == 90, which is incorrect:
MATCH (:Person {name:'Tom Hanks'})-[r1]-(),
(:Person {name:'Hugo Weaving'})-[r2]-()
RETURN COUNT(r1) + COUNT(r2)
=> 90
Again, this is because I've matched on all combinations of r1 and r2, of which there are 65 (13 * 5 == 65) and then summed this to arrive at a total of 90 (65 + 65 == 90).
The solution is to use DISTINCT:
MATCH (:Person {name:'Tom Hanks'})-[r1]-(),
(:Person {name:'Hugo Weaving'})-[r2]-()
RETURN COUNT(DISTINCT r1) + COUNT(DISTINCT r2)
=> 18
Clearly, the DISTINCT modifier only counts the distinct instances of each entity.
You can also accomplish this with WITH if you wanted:
MATCH (:Person {name:'Tom Hanks'})-[r]-()
WITH COUNT(r) AS r1
MATCH (:Person {name:'Hugo Weaving'})-[r]-()
RETURN r1 + COUNT(r)
=> 18
TL;DR - Beware of Cartesian products. DISTINCT is your friend:
MATCH (:someIndex{name:"name1"})-[r1]-(),
(:someIndex{name:"name2"})-[r2]-()
RETURN COUNT(DISTINCT r1) + COUNT(DISTINCT r2);
The explosion of results you're seeing can be easily explained:
MATCH (g1:someIndex{name:"name1"}), (g2:someIndex{name:"name2"})
MATCH g1-[r1]-(), g2-[r2]-()
RETURN count(r1)+count(r2);
//Returns 1423740
In the 2nd line every combination of any relationship from g1 is combined with any relationship of g2, this explains the number since 1423740 = 305 * 2334 * 2. So you're evaluating basically a cross product here.
The right way to calculate the sum of all relationships for name1 and name2 is:
MATCH (g:someIndex)-[r]-()
WHERE g.name in ["name1", "name2"]
RETURN count(r)