I have the following graph setup:
start root=node(0)
create (F {name:'FRAME'}), (I {name: 'INTERACTION'}), (A {name: 'A'}), (B {name: 'B'}),
root-[:ROOT]->F, F-[:FRAME_INTERACTION]->I, I-[:INTERACTION_ACTOR]->A, I-[:INTERACTION_ACTOR]->B
And the following query returns duplicated results:
START actor=node:node_auto_index(name='A')
MATCH actor<-[:INTERACTION_ACTOR]-interaction-[:INTERACTION_ACTOR]->actor2,
frame-[:FRAME_INTERACTION]->interaction
RETURN frame, interaction
Query Results
+-----------------------------------------------------+
| frame | interaction |
+-----------------------------------------------------+
| Node[1]{name:"FRAME"} | Node[2]{name:"INTERACTION"} |
| Node[1]{name:"FRAME"} | Node[2]{name:"INTERACTION"} |
+-----------------------------------------------------+
2 rows
52 ms
Even if I add one more start node trying to limit the results, I have the same:
START actor=node:node_auto_index(name='A'), frame=node:node_auto_index(name='FRAME')
MATCH actor<-[:INTERACTION_ACTOR]-interaction-[:INTERACTION_ACTOR]->actor2,
frame-[:FRAME_INTERACTION]->interaction
RETURN frame, interaction
I would like to understand why the query returns duplicated results.
I know that it is possible to return unique results by using distinct, but is it possible to change the query in order to return only one result by matching path, without applying an additional operation (distinct)?
(setup and query can be tested at http://console.neo4j.org/?id=q2e0ay)
If you add actor2 to your return list you'll see what the problem is:
frame interaction actor actor2
(7 {name:"FRAME"}) (8 {name:"INTERACTION"}) (9 {name:"A"}) (9 {name:"A"})
(7 {name:"FRAME"}) (8 {name:"INTERACTION"}) (9 {name:"A"}) (10 {name:"B"})
Actor "A" is being included as a value for actor2! But this makes sense when you think about it, because nowhere in your query did you tell neo4j that actor and actor2 need to be distinct entities.
Luckily it's easy to do:
START actor=node:node_auto_index(name='A')
MATCH actor<-[:INTERACTION_ACTOR]-interaction-[:INTERACTION_ACTOR]->actor2,
frame-[:FRAME_INTERACTION]->interaction
WHERE actor <> actor2 //like this!
RETURN frame, interaction
Related
Background
I want to create a histogram of the relationships starting from a set of nodes.
Input is a set of node ids, for example set = [ id_0, id_1, id_2, id_3, ... id_n ].
The output is a the relationship type histogram for each node (e.g. Map<Long, Map<String, Long>>):
id_0:
- ACTED_IN: 14
- DIRECTED: 1
id_1:
- DIRECTED: 12
- WROTE: 5
- ACTED_IN: 2
id_2:
...
The current cypher query I've written is:
MATCH (n)-[r]-()
WHERE id(n) IN [ id_0, id_1, id_2, id_3, ... id_n ] # set
RETURN id(n) as id, type(r) as type, count(r) as count
It returns the pair of [ id, type ] count like:
id | rel type | count
id0 | ACTED_IN | 14
id0 | DIRECTED | 1
id1 | DIRECTED | 12
id1 | WROTE | 5
id1 | ACTED_IN | 2
...
The result is collected using java and merged to the first structure (e.g. Map<Long, Map<String, Long>>).
Problem
Getting the relationship histogram on smaller graphs is fast but can be very slow on bigger datasets. For example if I want to create the histogram where the set-size is about 100 ids/nodes and each of those nodes have around 1000 relationships the cypher query took about 5 minutes to execute.
Is there more efficient way to collect the histogram for a set of nodes?
Could this query be parallelized? (With java code or using UNION?)
Is something wrong with how I set up my neo4j database, should these queries be this slow?
There is no need for parallel queries, just the need to understand Cypher efficiency and how to use statistics.
Bit of background :
Using count, will execute an expandAll, which is as expensive as the number of relationships a node has
PROFILE
MATCH (n) WHERE id(n) = 21
MATCH (n)-[r]-(x)
RETURN n, type(r), count(*)
Using size and a relationship type, uses internally getDegree which is a statistic a node has locally, and thus is very efficient
PROFILE
MATCH (n) WHERE id(n) = 0
RETURN n, size((n)-[:SEARCH_RESULT]-())
Morale of the story, for using size you need to know the relationship types a labeled node can have. So, you need to know the schema of the database ( in general you will want that, it makes things easily predictable and building dynamically efficient queries becomes a joy).
But let's assume you don't know the schema, you can use APOC cypher procedures, allowing you to build dynamic queries.
The flow is :
Get all the relationship types from the database ( fast )
Get the nodes from id list ( fast )
Build dynamic queries using size ( fast )
CALL db.relationshipTypes() YIELD relationshipType
WITH collect(relationshipType) AS types
MATCH (n) WHERE id(n) IN [21, 0]
UNWIND types AS type
CALL apoc.cypher.run("RETURN size((n)-[:`" + type + "`]-()) AS count", {n: n})
YIELD value
RETURN id(n), type, value.count
In neo4j my database consists of chains of nodes. For each distinct stucture/layout (does graph theory has a better word?), I want to count the number of chains. For example, the database consists of 9 nodes and 5 relationships as this:
(:a)->(:b)
(:b)->(:a)
(:a)->(:b)
(:a)->(:b)->(:b)
where (:a) is a node with label a. Properties on nodes and relationships are irrelevant.
The result of the counting should be:
------------------------
| Structure | n |
------------------------
| (:a)->(:b) | 2 |
| (:b)->(:a) | 1 |
| (:a)->(:b)->(:b) | 1 |
------------------------
Is there a query that can achieve this?
Appendix
Query to create test data:
create (:a)-[:r]->(:b), (:b)-[:r]->(:a), (:a)-[:r]->(:b), (:a)-[:r]->(:b)-[:r]->(:b)
EDIT:
Thanks for the clarification.
We can get the equivalent of what you want, a capture of the path pattern using the labels present:
MATCH path = (start)-[*]->(end)
WHERE NOT ()-->(start) and NOT (end)-->()
RETURN [node in nodes(path) | labels(node)[0]] as structure, count(path) as n
This will give you a list of the labels of the nodes (the first label present for each...remember that nodes can be multi-labeled, which may throw off your results).
As for getting it into that exact format in your example, that's a different thing. We could do this with some text functions in APOC Procedures, specifically apoc.text.join().
We would need to first add formatting around the extraction of the first label to add the prefixed : as well as the parenthesis. Then we could use apoc.text.join() to get a string where the nodes are joined by your desired '->' symbol:
MATCH path = (start)-[*]->(end)
WHERE NOT ()-->(start) and NOT (end)-->()
WITH [node in nodes(path) | labels(node)[0]] as structure, count(path) as n
RETURN apoc.text.join([label in structure | '(:' + label + ')'], '->') as structure, n
I am trying to compute the transitive closure of an undirected graph in Neo4j using the following Cypher Query ("E" is the label that every edge of the graph has):
MATCH (a) -[:E*]- (b) WHERE ID(a) < ID(b) RETURN DISTINCT a, b
I tried to execute this query on a graph with 10k nodes and around 150k edges, but even after 8 hours it did not finish. I find this surprising, because even the most naive SQL solutions are much faster and I expected that Neo4j would be much more efficient for these kind of standard graph queries. So is there something that I am missing, maybe some tuning of the Neo4j server or a better way to write the query?
Edit
Here is the result of EXPLAINing the above query:
+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
908 ms
Compiler CYPHER 3.3
Planner COST
Runtime INTERPRETED
+-----------------------+----------------+------------------+--------------------------------+
| Operator | Estimated Rows | Variables | Other |
+-----------------------+----------------+------------------+--------------------------------+
| +ProduceResults | 14069 | a, b | |
| | +----------------+------------------+--------------------------------+
| +Distinct | 14069 | a, b | a, b |
| | +----------------+------------------+--------------------------------+
| +Filter | 14809 | anon[11], a, b | ID(a) < ID(b) |
| | +----------------+------------------+--------------------------------+
| +VarLengthExpand(All) | 49364 | anon[11], b -- a | (a)-[:E*]-(b) |
| | +----------------+------------------+--------------------------------+
| +AllNodesScan | 40012 | a | |
+-----------------------+----------------+------------------+--------------------------------+
Total database accesses: ?
You can limit the direction, but it requires the graph to be directed.
After doing some testing and profiling of my own, I found that for even very small sets of data (Randomly-generated sets of 10 nodes with 2 random edges on each), making the query be only for a single direction cut down on database hits by a factor of 10000 (from 2266909 to 149 database hits).
Adding a direction to your query (and thus forcing the graph to be directed) cuts down the search space by a great deal, but it requires the graph to be directed.
I also tried simply adding a reverse relationship for each directed one, to see if that would have similar performance. It did not; it did not complete before 5 minutes had passed, at which point I killed it.
Unfortunately, you are not doing anything wrong, but your query is massive.
Neo4J being a graph database does not mean that all mathematical operations involving graphs will be extremely fast; they are still subject to performance constraints, up to and including the transitive closure operation.
The query you have written is an unbounded path search for every single pair of nodes. The node pairs are bounded, but not in a very meaningful way (the bound of ID(a) < ID(b) just means that the search only needs to be done one way; there are still 10k! (as in factorial) possible sets of nodes in the result set.
And then, that's only after every single path is checked. Searching for the entire transitive closure of a graph the size that you specified will be extremely expensive performance-wise.
The SQL that you posted is not performing the same operation.
You mentioned in the comments that you tried this query in a relational table in a recursive form:
WITH RECURSIVE temp_tc AS (
SELECT v AS a, v AS b FROM nodes
UNION SELECT a,b FROM edges g
UNION SELECT t.a,g.b FROM temp_tc t, edges g WHERE t.b = g.a
)
SELECT a, b FROM temp_tc;
I should note that this query is not performing the same thing that Neo4J does when it tries to find all paths. Before Neo4J can start to pare down your results, it must generate a result set that consists of every single path in the entire graph.
The SQL and relational query does not do that; it starts from the list of links, but that recursive query has the effect of removing any potential duplicate links; it discovers other links as its searching for the links of others; e.g. if the graph looks like (A)-(B)-(C), that query will find that B connects to C in the process of finding that A connects to C.
With the Neo4J, every path must be discovered separately.
If this is your general use-case, it is possible that Neo4J is not a good choice if speed is a concern.
I am struggling to get the proper cypher that is both efficient and allows pagination through skip and limit.
Here is the simple scenario: I have the related nodes (company)<-[A]-(set)<-[B]-(job) where there are multiple instances of (set) with distinct (job) instances related to them. The (job) nodes have a specific status property that can hold one of several states. We need to count the number of (job) nodes in a particular state per (set) and use skip and limit to paginate on the distinct (set) nodes.
So we can get a very efficient query for job.status counts using this.
match (c:Company {id: 'MY.co'})<-[:type_of]-(s:Set)<-[:job_for]-(j:Job)
return s.Description, j.Status, count(*) as StatusCount;
Which will give us a rows of the Set.Description, Job.Status, and JobStatus count. But we will get multiple rows for the Set based on the Job.Status. This is not conducive to paging over distinct sets though. Something like:
s.Description j.Status StatusCount
-------------------+--------------+----------------
Set 1 | Unassigned | 10
Set 1 | Completed | 2
Set 2 | Unassigned | 3
Set 1 | Reviewed | 10
Set 3 | Completed | 4
Set 2 | Reviewed | 7
What we are trying to achieve with the proper cypher is result rows based on distinct Sets. Something like this:
s.Description Unassigned Completed Reviewed
-------------------+--------------+-------------+----------
Set 1 | 10 | 2 | 10
Set 2 | 3 | 0 | 7
Set 3 | 0 | 4 | 0
This would then allow us to paginate over Sets using skip and limit properly.
I have tried many different approaches and cannot seem to find the right combination for this type of result. Anyone have any ideas? Thanks!
** EDIT - Using the answer provided by MIchael, here's how to get the status count values in java **
match (c:Company {id: 'MY.co'})<-[:type_of]-(s:Set)<-[:job_for]-(j:Job)
with s, j.Status as Status,count(*) as StatusCount
return s.Description, collect({Status:Status,StatusCount:StatusCount]) as StatusCounts;
List<Object> statusMaps = (List<Object>) row.get("StatusCounts");
for(Object statusEntry : statusMaps ) {
Map<String,Object> statusMap = (Map<String,Object>) statusEntry;
String status = (String) statusMap.get("Status");
Number count = statusMap.get("StatusCount");
}
You can use WITH and aggregation, and optionally a map result
match (c:Company {id: 'MY.co'})<-[:type_of]-(s:Set)<-[:job_for]-(j:Job)
with s, j.Status as Status,count(*) as StatusCount
return s.Description, collect([Status,StatusCount]);
or
match (c:Company {id: 'MY.co'})<-[:type_of]-(s:Set)<-[:job_for]-(j:Job)
with s, j.Status as Status,count(*) as StatusCount
return s.Description, collect({Status:Status,StatusCount:StatusCount]);
how do I express the following in Cypher
"Return all nodes with at least one incoming edge of type A and no outgoing edges".
Best Regards
You can use a pattern to exclude nodes from the result subset like this:
MATCH ()-[:A]->(n) WHERE NOT (n)-->() RETURN n
Try
MATCH (n)
WHERE ()-[:A]->n AND NOT n-->()
RETURN n
or
MATCH ()-[:A]->(n)
WHERE NOT n-->()
RETURN DISTINCT n
Edit
Pattern expressions can be used both for pattern matching and as predicates for filtering. If used in the MATCH clause, the paths that answer the pattern are included in the result. If used for filtering, in the WHERE clause, the pattern serves as a limiting condition on the paths that have previously been matched. The result is limited, not extended to include the filter condition. When a pattern is used as a predicate for filtering, the negation of that predicate is also a predicate that can be used as a filter condition. No path answers to the negation of a pattern (if there is such a thing) so negations of patterns cannot be used in the MATCH clause. The phrase
Return all nodes with at least one incoming edge of type A and no outgoing edges
involves two patterns on nodes n, namely any incoming relationship [:A] on n and any outgoing relationship on n. The second must be interpreted as a pattern for a predicate filter condition since it involves a negation, not any outgoing relationship on n. The first, however, can be interpreted either as a pattern to match along with n, or as another pattern predicate filter condition.
These two interpretations give rise to the two cypher queries above. The first query matches all nodes and uses both patterns to filter the result. The second matches the incoming relationship on n along with n and uses the second pattern to filter the results.
The first query will match every node only once before the filtering happens. It will therefore return one result item per node that meets the criteria. The second query will match the pattern any incoming relationship [:A] on n once for each path, i.e. once for each incoming relationship on n. It may therefore contain a node multiple times in the result, hence the DISTINCT keyword to remove doubles.
If the items of interest are precisely the nodes, then using both patterns for predicates in the WHERE clause seems to me the correct interpretation. It is also more efficient since it needs to find only zero or one incoming [:A] on n to resolve the predicate. If the incoming relationships are also of interest, then some version of the second query is the right choice. One would need to bind the relationship and do something useful with it, such as return it.
Below are the execution plans for the two queries executed on a 'fresh' neo4j console.
First query:
----
Filter
|
+AllNodes
+----------+------+--------+-------------+------------------------------------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------+------+--------+-------------+------------------------------------------------------------------------------------------------------------------------------+
| Filter | 0 | 0 | | (nonEmpty(PathExpression((17)-[ UNNAMED18:A]->(n), true)) AND NOT(nonEmpty(PathExpression((n)-[ UNNAMED36]->(40), true)))) |
| AllNodes | 6 | 7 | n, n | |
+----------+------+--------+-------------+------------------------------------------------------------------------------------------------------------------------------+
Second query:
----
Distinct
|
+Filter
|
+TraversalMatcher
+------------------+------+--------+-------------+--------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+------+--------+-------------+--------------------------------------------------------------+
| Distinct | 0 | 0 | | |
| Filter | 0 | 0 | | NOT(nonEmpty(PathExpression((n)-[ UNNAMED30]->(34), true))) |
| TraversalMatcher | 0 | 13 | | n, UNNAMED8, n |
+------------------+------+--------+-------------+--------------------------------------------------------------+