I'm new to neo4j and I'm trying to extract a single, directed acyclic subgraph from from a graph in such a way as to be able to iterate over the resulting graph (or the collection of nodes and paths) and keep track of layers in which each node would reside (where layers are based on the number of hops from the nearest root node). Is this possible in cypher? I'm using the REST api.
The example datastructure I have is something like the following subgraph, where A and H would be the 'root' nodes:
A -+ B -+ D
|
+
H -+ C -+ E
|
+
F
S*
B and C would each be layer 1 and D, E, F would be layer 2. A and H have a :SUBGRAPH_ENTER relation to S* and all nodes in the subgraph I want have the relation :MEMBER_OF_SUBGRAPH to S*
The following query will return the subgraph (piecemeal) I want, however, I'm not sure how to go about ordering the nodes in the paths.
MATCH p = (n)-[r:ARROW_TO*]-(t)-[:SUBGRAPH_ENTER]-(s)
where
(n)-[:MEMBER_OF_SUBGRAPH]->(s)
RETURN p
Can anyone advise?
[EDITED]
If you create your sample acyclic graph with these 2 queries:
CREATE (s:Subgraph),
(a:Foo {id:'A'}), (b:Foo {id:'B'}), (c:Foo {id:'C'}), (d:Foo {id:'D'}), (e:Foo {id:'E'}), (f:Foo {id:'F'}), (h:Foo {id:'H'}),
(a)-[:ARROW_TO]->(b)-[:ARROW_TO]->(d),
(h)-[:ARROW_TO]->(c)-[:ARROW_TO]->(e),
(b)-[:ARROW_TO]->(c)-[:ARROW_TO]->(f),
(a)<-[:SUBGRAPH_ENTER]-(s),
(h)<-[:SUBGRAPH_ENTER]-(s);
MATCH (f:Foo), (s:Subgraph)
CREATE (f)-[:MEMBER_OF_SUBGRAPH]->(s);
then this query will return the subgraph nodes, ordered by distance from the nearest root:
MATCH p=(s)-[:SUBGRAPH_ENTER]->(root)-[:ARROW_TO*]->(leaf)
WHERE (NOT (leaf)-[:ARROW_TO]->()) AND ALL(n IN NODES(p)[1..] WHERE (n)-[:MEMBER_OF_SUBGRAPH]->(s))
WITH s, NODES(p)[2..] AS nodes
WITH s, REDUCE(s = [], i IN RANGE(0, SIZE(nodes)-1) | s + {node: nodes[i], dist: i+1}) AS data
UNWIND data AS datum
RETURN s, datum.node AS node, MIN(datum.dist) AS distance
ORDER BY distance;
The WHERE clause filters out paths that are partial or have nodes from other subgraphs.
The first WITH clause collects the nodes in each path starting after the root node.
The second WITH clause generates a collection of node/distance pairs for each node in the first collection.
The UNWIND transforms the latter collection into data rows, for processing by the MIN aggregation function.
Here are the results:
+------------------------------------------+
| s | node | distance |
+------------------------------------------+
| Node[38]{} | Node[41]{id:"C"} | 1 |
| Node[38]{} | Node[40]{id:"B"} | 1 |
| Node[38]{} | Node[44]{id:"F"} | 2 |
| Node[38]{} | Node[43]{id:"E"} | 2 |
| Node[38]{} | Node[42]{id:"D"} | 2 |
+------------------------------------------+
Including the root nodes
If you want to include the root nodes in the output, this query will do that:
MATCH p=(s)-[:SUBGRAPH_ENTER]->(root)-[:ARROW_TO*]->(leaf)
WHERE (NOT (leaf)-[:ARROW_TO]->()) AND ALL (n IN NODES(p)[1..] WHERE (n)-[:MEMBER_OF_SUBGRAPH]->(s))
WITH s, NODES(p)[1..] AS nodes
WITH s, REDUCE(s =[], i IN RANGE(0, SIZE(nodes)-1)| s + { node: nodes[i], dist: i }) AS data
UNWIND data AS datum
RETURN s, datum.node AS node, MIN(datum.dist) AS distance
ORDER BY distance;
Here are the results:
+-----------------------------------------+
| s | node | distance |
+-----------------------------------------+
| Node[6]{} | Node[7]{id:"A"} | 0 |
| Node[6]{} | Node[13]{id:"H"} | 0 |
| Node[6]{} | Node[8]{id:"B"} | 1 |
| Node[6]{} | Node[9]{id:"C"} | 1 |
| Node[6]{} | Node[10]{id:"D"} | 2 |
| Node[6]{} | Node[12]{id:"F"} | 2 |
| Node[6]{} | Node[11]{id:"E"} | 2 |
+-----------------------------------------+
Related
Sample Data:
Sample Query
CREATE (a1:A {title: "a1"})
CREATE (a2:A {title: "a2"})
CREATE (a3:A {title: "a3"})
CREATE (b1:B {title: "b1"})
CREATE (b2:B {title: "b2"})
MATCH (a:A {title: "a1"}), (b:B {title: "b1"})
CREATE (a)-[r:LINKS]->(b)
MATCH (a:A {title: "a2"}), (a1:A {title: "a1"})
CREATE (a)-[:CONNECTED]->(a1)
MATCH (a:A), (b:B) return a,b
Objective: Finding some connections in the where clause
Now lets write some variations to find A's not directly connected to B (a2 and b3)
// Q1. Both work fine
MATCH (a:A) WHERE (a)--(:B) RETURN a
MATCH (a:A) WHERE (:B)--(a) RETURN a
// Q2. Works
MATCH (a:A)-[r]-(b:B) WHERE (a)-[r]-(b) RETURN a
// Q3. Fails
MATCH (a:A)-[r]-(b:B) WHERE (b)-[r]-(a) RETURN a
Any idea why Q2, Q3 are not behaving the same way even if the direction is specified as bi-directional? Is this a NEO4J bug?
All credits to stdob at this
answer for narrowing
down the anomaly that was happening in my other query.
Update: Posted the same to the NEO4J GitHub issues
Update: NEO4J has accepted this as a bug are will be fixing it at 3.1
While this might not be a complete answer, it is too much info for a comment. This should hopefully provide some helpful insight though.
I would consider this a bug. Below are some variations of what should give the same results from the sample data. They should all pass with the given data (pass being return anything)
MATCH (a:A)-[r]-(b:B) WHERE (b)-[r]-(a) RETURN * -> fails
remove r
MATCH (a:A)--(b:B) WHERE (b)--(a) RETURN * -> pass
MATCH (a:A)-[r]-(b:B) WHERE (b)--(a) RETURN * -> pass
add direction
MATCH (a:A)-[r]-(b:B) WHERE (b)<-[r]-(a) RETURN * -> pass
reverse order
MATCH (a:A)-[r]-(b:B) WHERE (a)-[r]-(b) RETURN * -> pass
And, from the profile of the failed test
+---------------------+----------------+------+---------+-----------+--------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+---------------------+----------------+------+---------+-----------+--------------+
| +ProduceResults | 1 | 0 | 0 | a | a |
| | +----------------+------+---------+-----------+--------------+
| +SemiApply | 1 | 0 | 0 | a, b, r | |
| |\ +----------------+------+---------+-----------+--------------+
| | +ProjectEndpoints | 1 | 0 | 0 | a, b, r | r, b, a |
| | | +----------------+------+---------+-----------+--------------+
| | +Argument | 2 | 1 | 0 | a, b, r | |
| | +----------------+------+---------+-----------+--------------+
| +Filter | 2 | 1 | 1 | a, b, r | a:A |
| | +----------------+------+---------+-----------+--------------+
| +Expand(All) | 2 | 1 | 3 | a, r -- b | (b)-[r:]-(a) |
| | +----------------+------+---------+-----------+--------------+
| +NodeByLabelScan | 2 | 2 | 3 | b | :B |
+---------------------+----------------+------+---------+-----------+--------------+
and the equivalent passed test (reverse order)
+---------------------+----------------+------+---------+-----------+--------------+
| Operator | Estimated Rows | Rows | DB Hits | Variables | Other |
+---------------------+----------------+------+---------+-----------+--------------+
| +ProduceResults | 1 | 1 | 0 | a | a |
| | +----------------+------+---------+-----------+--------------+
| +SemiApply | 1 | 1 | 0 | a, b, r | |
| |\ +----------------+------+---------+-----------+--------------+
| | +ProjectEndpoints | 1 | 0 | 0 | a, b, r | r, a, b |
| | | +----------------+------+---------+-----------+--------------+
| | +Argument | 2 | 1 | 0 | a, b, r | |
| | +----------------+------+---------+-----------+--------------+
| +Filter | 2 | 1 | 1 | a, b, r | a:A |
| | +----------------+------+---------+-----------+--------------+
| +Expand(All) | 2 | 1 | 3 | a, r -- b | (b)-[r:]-(a) |
| | +----------------+------+---------+-----------+--------------+
| +NodeByLabelScan | 2 | 2 | 3 | b | :B |
+---------------------+----------------+------+---------+-----------+--------------+
Notice the row count after step 1 in each. The same plan should not produce different results. I can speculate that is is a bug related to the graph pruning shortcuts (namely, once Neo4j traverses an edge in one direction, it will not traverse back on the same edge in the same match. This is an anti-cycle fail-safe/performance feature) So, in theory, after reversing the order in the where part from the match part, Neo4j has to traverse a pruned edge to validate the relationship. If it is the same direction, it auto-passes. If Neo4j tries to do the same check in reverse, it fails because that edge has been pruned. (This is just theory though. The validation that is failing is technically on the r validation in reverse)
I'm trying to find all possible path between two nodes. I've used few cypher queries which does the required job but it take a lot of time if the hops increases. This is the query
match p = (n{name:"Node1"})-[:Route*1..5]-(b{name:"Node2"}) return p
Also if I use shortestpath it limits the result if a path with minimum hop is found. So I don't get the results with 2 or more than two hops if a direct connection (1 hop) is found between the nodes.
match p = shortestpath((n{name:"Node1"})-[:Route*1..5]-(b{name:"Node2"})) return p
and if I increase the hop to 2 or more it throws an exception.
shortestPath(...) does not support a minimal length different from 0 or 1
Is there any other alternative framework or algorithm to get all path with minimum time ?
P.S. I'm looking for something in the order of ms. Currently all queries with hops greater than 3 takes few seconds to complete.
I gather you are trying to speed up your original query involving variable-length paths. The shortestpath function is not appropriate for your query, as it literally tries to find a shortest path -- not all paths up to a certain length.
The execution plan for your original query (using sample data) looks like this:
+-----------------------+----------------+------+---------+-------------------+---------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Identifiers | Other |
+-----------------------+----------------+------+---------+-------------------+---------------------------------------------+
| +ProduceResults | 0 | 1 | 0 | p | p |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +Projection | 0 | 1 | 0 | anon[30], b, n, p | ProjectedPath(Set(anon[30], n),) |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +Filter | 0 | 1 | 2 | anon[30], b, n | n.name == { AUTOSTRING0} |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +VarLengthExpand(All) | 0 | 2 | 7 | anon[30], b, n | (b)<-[:Route*]-(n) |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +Filter | 0 | 1 | 3 | b | b.name == { AUTOSTRING1} |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +AllNodesScan | 3 | 3 | 4 | b | |
+-----------------------+----------------+------+---------+-------------------+---------------------------------------------+
So, your original query is scanning through every node to find the node(s) that match the b pattern. Then, it expands all variable-length paths starting at b. And then it filters those paths to find the one(s) that end with a node that matches the pattern for n.
Here are a few suggestions that should speed up your query, although you'll have to test it on your data to see by how much:
Give each node a label. For example, Foo.
Create an index that can speed up the search for your end nodes. For example:
CREATE INDEX ON :Foo(name);
Modify your query to force the use of the index on both end nodes. For example:
MATCH p =(n:Foo { name:"Node1" })-[:Route*1..5]-(b:Foo { name:"Node2" })
USING INDEX n:Foo(name)
USING INDEX b:Foo(name)
RETURN p;
After the above changes, the execution plan is:
+-----------------+------+---------+-----------------------------+-----------------------------+
| Operator | Rows | DB Hits | Identifiers | Other |
+-----------------+------+---------+-----------------------------+-----------------------------+
| +ColumnFilter | 1 | 0 | p | keep columns p |
| | +------+---------+-----------------------------+-----------------------------+
| +ExtractPath | 1 | 0 | anon[33], anon[34], b, n, p | |
| | +------+---------+-----------------------------+-----------------------------+
| +PatternMatcher | 1 | 3 | anon[33], anon[34], b, n | |
| | +------+---------+-----------------------------+-----------------------------+
| +SchemaIndex | 1 | 2 | b, n | { AUTOSTRING1}; :Foo(name) |
| | +------+---------+-----------------------------+-----------------------------+
| +SchemaIndex | 1 | 2 | n | { AUTOSTRING0}; :Foo(name) |
+-----------------+------+---------+-----------------------------+-----------------------------+
This query plan uses the index to directly get the b and n nodes -- without scanning. This, by itself, should provide a speed improvement. And then this plan uses the "PatternMatcher" to find the variable-length paths between those end nodes. You will have to try this query out to see how efficient the "PatternMatcher" is in doing that.
From your description I assume that you want to get a shortest path based on some weight like a duration property on the :Route relationships.
If that is true using shortestPath in cypher is not helpful since it just takes into account the number of hops. Weighted shortest paths are not yet available in Cypher in an efficient way.
The Java API has support for weighted shortest paths via dijekstra or astar via the GraphAlgoFactory class. For the simple case that your cost function is just the value of a relationship property (as mentioned above) you can also use an existing REST endpoint.
I have a Neo4J database with a large number of datasets (~15M), where I want to perform a greater than search on one of its properties. I have the corresponding property indexed. The property is a float value.
When I do an exact match like MATCH (i:Label) WHERE i.property = $value RETURN count(i) I get the result within a very short time. But when I do the same search with greater than, i.e. MATCH (i:Label) WHERE i.property > $value RETURN count(i) it just takes forever. What is the correct way to do this in Cypher?
Edit: Execution plan:
+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
74 ms
Compiler CYPHER 2.2
Planner COST
EagerAggregation
|
+Filter
|
+NodeByLabelScan
+------------------+---------------+-------------+------------------------------+
| Operator | EstimatedRows | Identifiers | Other |
+------------------+---------------+-------------+------------------------------+
| EagerAggregation | 2064 | count(r) | |
| Filter | 4260557 | r | r.date > Subtract(Divide( |
| | | | TimestampFunction(),{ |
| | | | AUTOINT0}),Literal(86400)) |
| NodeByLabelScan | 14201858 | r | :Request |
+------------------+---------------+-------------+------------------------------+
Total database accesses: ?
Another approach is to create additional/aggregation nodes for that property and searching thru those nodes.
Example
Let say the property is a value from 0 - 100.
Create following nodes
* 0to30
* 31to60
* 61to100
Create relationship from you nodes to this 'aggregate' nodes.
Than searching thru those nodes
MATCH (l:Label)-[i:IN]->(a:0to30)
RETURN l
Unfortunately Neo4j doesn't use it's indexes for inequalities in 2.2.x. In the upcoming 2.3.x this should be supported.
I'm curious how filters work in neo4j queries. They result in db hits (according to PROFILE), and it seems that they shouldn't.
An example query:
PROFILE MATCH (a:act)<-[r:relationship]-(n)
WHERE a.chapter='13' and a.year='2009'
RETURN r, n
NodeIndexSeek: (I created the an index on the label act for chapter property) returns 6 rows.
Filter: a.year == {AUTOSTRING1} which results in 12 db hits.
Why does it need to do any db hits if it's already fetched the 6 matching instances of a in earlier db reads, shouldn't it just filter them down without going back to do more db reads?
I realise I'm equating 'db hits' with 'db reads' here, which may not be accurate. If not, what exactly are 'db hits'?
Lastly, the number of db hits incurred by a filter appear to approximately match:
<number of filtering elements> * 2 * <number of already queried nodes to filter on>
where 'number of filtering elements' is the number of filters provided, i.e.
WHERE a.year='2009' and a.property_x='thing'
is two elements.
Thanks for any help.
EDIT:
Here are the results of PROFILE and EXPLAIN on the query.
This is just an example query. I've found the behaviour of
filter db hits = <number of filtering elements> * 2 * <number of already queried nodes to filter on>
to be generally true in queries I've run.
PROFILE MATCH (a:act)<-[r:CHILD_OF]-(n)
WHERE a.chapter='13' AND a.year='2009'
RETURN r, n
8 rows
55 ms
Compiler CYPHER 2.2
Planner COST
Projection
|
+Expand(All)
|
+Filter
|
+NodeIndexSeek
+---------------+---------------+------+--------+-------------+---------------------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+---------------+---------------+------+--------+-------------+---------------------------+
| Projection | 1 | 8 | 0 | a, n, r | r; n |
| Expand(All) | 1 | 8 | 9 | a, n, r | (a)<-[r:CHILD_OF]-(n) |
| Filter | 0 | 1 | 12 | a | a.year == { AUTOSTRING1} |
| NodeIndexSeek | 1 | 6 | 7 | a | :act(chapter) |
+---------------+---------------+------+--------+-------------+---------------------------+
Total database accesses: 28
EXPLAIN MATCH (a:act)<-[r:CHILD_OF]-(n)
WHERE a.chapter='13' AND a.year='2009'
RETURN r, n
4 ms
Compiler CYPHER 2.2
Planner COST
Projection
|
+Expand(All)
|
+Filter
|
+NodeIndexSeek
+---------------+---------------+-------------+---------------------------+
| Operator | EstimatedRows | Identifiers | Other |
+---------------+---------------+-------------+---------------------------+
| Projection | 1 | a, n, r | r; n |
| Expand(All) | 1 | a, n, r | (a)<-[r:CHILD_OF]-(n) |
| Filter | 0 | a | a.year == { AUTOSTRING1} |
| NodeIndexSeek | 1 | a | :act(chapter) |
+---------------+---------------+-------------+---------------------------+
Total database accesses: ?
Because reading a node (record) and reading property (records) is not the same db-operation.
You are right that the filter hit's should be at most 6 though.
Usually Neo4j pulls filters and predicates to the earliest possible moment, so it should filter directly after the index lookup.
In some situations though (due to the predicate) it can only filter after finding the paths, then the number of db-hits might equal the number of checked paths.
Which Neo4j version are you using? Can you share your full query plan?
I have a tree with about 300 nodes. For an arbitrary node, I need to draw a subtree from that node to the root (including all possible paths).
For example if I have this tree (edited):
a
|
-----
| | |
b c d
| |
---
|
e
|
f
and the e node is selected, I need to draw:
a
|
---
| |
b c
| |
---
|
e
I am using this Cypher query:
start n=node({nodeId}) optional match n-[r:DEPENDS*]->p return n,r,p
Although it works, depending on the depth of the searched node, it is very very slow (more than 10 seconds).
¿How can I achieve this efficiently?
Your query will compute all paths, while you are only interested in the one path too the root. So get the root and the node and the shortest-path in beween.
MATCH path=shortestPath((root)<-[:DEPENDS*]-(n))
WHERE id(root) = {rootId} and id(n) = {nodeId}
RETURN path