I have the following query:
CALL apoc.index.relationships('TO','user:37f0ce60-b428-11e8-bb45-9394d4f42b57') YIELD rel, start, end
WITH DISTINCT rel, start, end
MATCH (ctx:Context)
WHERE rel.context = ctx.uid AND (ctx.name="iG9CE55wbtY" )
RETURN DISTINCT start.uid AS source_id, start.name AS source_name, end.uid AS target_id, end.name AS target_name, rel.uid AS edge_id, ctx.name AS context_name, rel.statement AS statement_id, rel.weight AS weight;
Which uses indexed relationships. However, it takes about 4 to 10 seconds to process.
Here's the results with PROFILE:
Cypher version: CYPHER 3.3, planner: COST, runtime: INTERPRETED. 470705 total db hits in 2758 ms.
Is there anything I could optimize in this query, for instance, using parameters or rewriting it in any way that could improve the performance?
Related
I have been experimenting pattern comprehensions for optimization, but seems getting even more confused
Here is my initial query:
MATCH (a:Actor)-[:ACTED_IN]->(m:Movie)
WHERE 2000 <= m.year <= 2005 AND a.born.year >= 1980
RETURN a.name AS Actor, a.born AS Born,
collect(DISTINCT m.title) AS Movies ORDER BY Actor
from profiling, I am getting:
Cypher version: , planner: COST, runtime: PIPELINED. 41944 total db hits in 152 ms.
I attempted the following rewrite:
profile MATCH (a:Actor)
WHERE a.born.year >= 1980
// Add a WITH clause to create the list using pattern comprehension
with a
match (a)-[:ACTED_IN]-(m:Movie)
where 2000 <= m.year <= 2005
// filter the result of the pattern comprehension to return only lists with elements
// return the Actor, Born, and Movies
return a.name as Actor, a.born as Born, [(a)-[:ACTED_IN]-(m) | m.title] as Movies
order by a
from profiling, I am getting:
Cypher version: , planner: COST, runtime: PIPELINED. 47879 total db hits in 47 ms.
Then I try another rewrite:
profile MATCH (a:Actor)
WHERE a.born.year >= 1980
// Add a WITH clause to create the list using pattern comprehension
// filter the result of the pattern comprehension to return only lists with elements
// return the Actor, Born, and Movies
with a, [ (a)-[:ACTED_IN]-(m:Movie) where 2000 <= m.year <= 2005 | m.title] as Movies
return a.name as Actor, a.born as Born, Movies
order by a
Cypher version: , planner: COST, runtime: PIPELINED. 59251 total db hits in 6 ms.
Each performance is worse than another. While I can review the query plan to understand the differences. Is there a way to use pattern comprehension to actually reduce my DB hits comparing to the initial query using collect statement?
Please show us the profile result on your last query; I tested it in Movie database and it worked well vs the orig query(46ms vs orig: 120db hits). Also, check if Actor.born.year has an index.
profile MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
WHERE 2000 <= m.released <= 2005 AND a.born >= 1980
RETURN a.name AS Actor, a.born AS Born,
collect(DISTINCT m.title) AS Movies ORDER BY Actor
planner: COST, runtime: PIPELINED. 120 total db hits in 9 ms
profile MATCH (a:Person)
WHERE a.born >= 1980
RETURN a.name AS Actor, a.born AS Born,
[(a)-[:ACTED_IN]-(m:Movie) where 2000 <= m.released <= 2005 | m.title] AS Movies ORDER BY Actor
planner: COST, runtime: PIPELINED. 43 total db hits in 6 ms
I have a Cypher query:
PROFILE MATCH (dg:DecisionGroup {id: -2})-[rdgd:CONTAINS]->(childD:Profile )
WITH childD
RETURN count(childD)
Cypher version: CYPHER 4.4, planner: COST, runtime: INTERPRETED. 20003 total db hits in 14 ms
and the second query:
PROFILE MATCH (dg:DecisionGroup {id: -2})-[rdgd:CONTAINS]->(childD:Profile)
MATCH (childD)-[:CONTAINS]->(childDStat:JobableStatistic)
WITH childD
RETURN count(childD)
Cypher version: CYPHER 4.4, planner: COST, runtime: INTERPRETED. 224367 total db hits in 68 ms.
as you may see DB hits incresses from 20003 total db hits to 224367.. But I have one_2_one relationship between childD and childDStat and 10k childD and 10K childDStat for them. What am I doing wrong in my query and how to decrease DB hits?
Using multiple relationships types can help you optimize your queries, especially if you are only counting relationships and not doing anything else. What i've seen in practice is having really specific relationships like:
(dg:DecisionGroup {id: -2})-[:DECISIONGROUP_HAS_PROFILE]->(childD:Profile )
So something like that. Then you can quickly count relationships by utilizing the relationship count store:
PROFILE MATCH (dg:DecisionGroup {id: -2})
WITH dg, size((dg)-[DECISIONGROUP_HAS_PROFILE]->()) AS c
RETURN sum(c) AS result
Take a look at: https://neo4j.com/developer/kb/fast-counts-using-the-count-store/
It seems they have added a few more Cypher options to access the count store, but anyway, count store is much more performant than expanding each relationship.
You can get creative with more "complex" queries and rewrite the
PROFILE MATCH (dg:DecisionGroup {id: -2})-[rdgd:CONTAINS]->(childD:Profile)
MATCH (childD)-[:CONTAINS]->(childDStat:JobableStatistic)
WITH childD
RETURN count(childD)
into
PROFILE MATCH (dg:DecisionGroup {id: -2})-[rdgd:CONTAINS]->(childD:Profile)
WITH childD, size((childD)-[:CONTAINS]->()) AS count
RETURN sum(count) AS result
Notice that you are not checking the label of the node at the end of the relationship, so your model must ensure that is always correct.
I have the following request:
CALL apoc.index.relationships('TO','context:34b4a5b0-0dfa-11e9-98ed-7761a512a9c0')
YIELD rel, start, end WITH DISTINCT rel, start, end
RETURN DISTINCT start.uid AS source_id,
start.name AS source_name,
end.uid AS target_id,
end.name AS target_name,
rel.uid AS edge_id,
rel.context AS context_id,
rel.statement AS statement_id,
rel.weight AS weight
Which returns a table of results such as
The question:
Is there a way to filter out the top 150 most connected nodes (source_name/source_id and target_name/edge_id nodes)?
I don't think it would work with frequency as each table row is unique (because of the different edge_id) but maybe there's a function inside Neo4J / Cypher that allows me to count the top most frequent (source_name/source_id and target_name/edge_id) nodes?
Thank you!
This query might do what you want:
CALL apoc.index.relationships('TO','context:34b4a5b0-0dfa-11e9-98ed-7761a512a9c0')
YIELD rel, start, end
WITH start, end, COLLECT(rel) AS rs
ORDER BY SIZE(rs) DESC LIMIT 50
RETURN
start.uid AS source_id,
start.name AS source_name,
end.uid AS target_id,
end.name AS target_name,
[r IN rs | {edge_id: r.uid, context_id: r.context, statement_id: r.statement, weight: r.weight}] AS rels
The query uses the aggregating function COLLECT to collect all the relationships for each pair of start/end nodes, keeps the data for the 50 node pairs with the most relationships, and returns a row of data for each pair (with the data for the relationships in a rels list).
You could always use size( (node)-[:REL]->() ) to get the degree.
And if you compute the top-n degree's first you can filter those out by comparing
WHERE min < size( (node)-[:REL]->() ) < max
I have a Neo4J cypher request of the following kind that I use in my app:
START rel=relationship:relationship_auto_index(user='6dbe5450-852d-11e4-9c48-b552fc8c2b90')
WHERE TYPE(rel)='TO' WITH rel MATCH (ctx:Context) WHERE rel.context = ctx.uid
RETURN
DISTINCT STARTNODE(rel).uid AS source_id, STARTNODE(rel).name AS source_name,
ENDNODE(rel).uid AS target_id, ENDNODE(rel).name AS target_name, rel.uid AS edge_id,
ctx.name AS context_name, rel.statement AS statement_id, rel.weight AS weight;
It returns 112 rows, which are relationships between the nodes, as well as the context where each relationship appears and the statement where it occurred.
I know that I can limit the number of rows I get in this table using LIMIT 50.
However, what I need to do is to automatically sort the rows in such a way, that I only get 50 most frequently mentioned nodes, which can be both in source_name and in target_name columns.
So what I need to do is to count how many of each kind of node I have both in source_name and in target_name, collect them, and only display the top frequently mentioned 50 of them.
Does anyone have an idea how I could do that?
Thank you!
START rel=relationship:relationship_auto_index(user='6dbe5450-852d-11e4-9c48-b552fc8c2b90')
WHERE TYPE(rel)='TO' WITH rel MATCH (ctx:Context) WHERE rel.context = ctx.uid
UNWIND [startnode(rel),endnode(rel)] as node
RETURN node.uid, node.name
collect([rel.uid AS edge_id, rel.statement AS statement_id, rel.weight AS weight,
ctx.name AS context_name]) as aggregated_data,
count(*)
ORDER BY count(*) desc limit 50;
I currently have this query:
START n=node(*)
MATCH (p:Person)-[:is_member]->(g:Group)
WHERE g.name ='FooManGroup'
RETURN p, count(p)
LIMIT 5
Say there are 42 people in FooManGroup, I want to return 5 of these people, with a count of 42.
Is this possible to do in one query?
Running this now returns 5 rows, which is fine, but a count of 104, which is the total number of nodes of any type in my DB.
Any suggestions?
You can use a WITH clause to do the counting of the persons, followed by an identical MATCH clause to do the matching of each person. Notice that you need to START on the p nodes and not just some n that will match any node in the graph:
MATCH (p:Person )-[:is_member]->(g:Group)
WHERE g.name ='FooManGroup'
WITH count(p) as personsInGroup
MATCH (p:Person)-[:is_member]->(g:Group)
WHERE g.name ='FooManGroup'
RETURN p, personsInGroup
LIMIT 5
It may not be the best or most elegant way to this, but it works. If you use cypher 2.0 it may be a bit more compact like this:
MATCH (p:Person)-[:is_member]->(g:Group {name: 'FooManGroup'})
WITH count(p) as personsInGroup
MATCH (p:Person)-[:is_member]->(g:Group {name: 'FooManGroup'})
RETURN p, personsInGroup
LIMIT 5
Relationship types are always uppercased in cypher, so :is_member should be :IS_MEMBER which I think is more readable:
MATCH (p:Person)-[:IS_MEMBER]->(g:Group {name: 'FooManGroup'})
WITH count(p) as personsInGroup
MATCH (p:Person)-[:IS_MEMBER]->(g:Group {name: 'FooManGroup'})
RETURN p, personsInGroup
LIMIT 5
Try this:
MATCH (p:Person)-[:is_member]->(g:Group)
WHERE g.name ='FooManGroup'
RETURN count(p), collect(p)[0..5]