I am trying to compute the transitive closure of an undirected graph in Neo4j using the following Cypher Query ("E" is the label that every edge of the graph has):
MATCH (a) -[:E*]- (b) WHERE ID(a) < ID(b) RETURN DISTINCT a, b
I tried to execute this query on a graph with 10k nodes and around 150k edges, but even after 8 hours it did not finish. I find this surprising, because even the most naive SQL solutions are much faster and I expected that Neo4j would be much more efficient for these kind of standard graph queries. So is there something that I am missing, maybe some tuning of the Neo4j server or a better way to write the query?
Edit
Here is the result of EXPLAINing the above query:
+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
908 ms
Compiler CYPHER 3.3
Planner COST
Runtime INTERPRETED
+-----------------------+----------------+------------------+--------------------------------+
| Operator | Estimated Rows | Variables | Other |
+-----------------------+----------------+------------------+--------------------------------+
| +ProduceResults | 14069 | a, b | |
| | +----------------+------------------+--------------------------------+
| +Distinct | 14069 | a, b | a, b |
| | +----------------+------------------+--------------------------------+
| +Filter | 14809 | anon[11], a, b | ID(a) < ID(b) |
| | +----------------+------------------+--------------------------------+
| +VarLengthExpand(All) | 49364 | anon[11], b -- a | (a)-[:E*]-(b) |
| | +----------------+------------------+--------------------------------+
| +AllNodesScan | 40012 | a | |
+-----------------------+----------------+------------------+--------------------------------+
Total database accesses: ?
You can limit the direction, but it requires the graph to be directed.
After doing some testing and profiling of my own, I found that for even very small sets of data (Randomly-generated sets of 10 nodes with 2 random edges on each), making the query be only for a single direction cut down on database hits by a factor of 10000 (from 2266909 to 149 database hits).
Adding a direction to your query (and thus forcing the graph to be directed) cuts down the search space by a great deal, but it requires the graph to be directed.
I also tried simply adding a reverse relationship for each directed one, to see if that would have similar performance. It did not; it did not complete before 5 minutes had passed, at which point I killed it.
Unfortunately, you are not doing anything wrong, but your query is massive.
Neo4J being a graph database does not mean that all mathematical operations involving graphs will be extremely fast; they are still subject to performance constraints, up to and including the transitive closure operation.
The query you have written is an unbounded path search for every single pair of nodes. The node pairs are bounded, but not in a very meaningful way (the bound of ID(a) < ID(b) just means that the search only needs to be done one way; there are still 10k! (as in factorial) possible sets of nodes in the result set.
And then, that's only after every single path is checked. Searching for the entire transitive closure of a graph the size that you specified will be extremely expensive performance-wise.
The SQL that you posted is not performing the same operation.
You mentioned in the comments that you tried this query in a relational table in a recursive form:
WITH RECURSIVE temp_tc AS (
SELECT v AS a, v AS b FROM nodes
UNION SELECT a,b FROM edges g
UNION SELECT t.a,g.b FROM temp_tc t, edges g WHERE t.b = g.a
)
SELECT a, b FROM temp_tc;
I should note that this query is not performing the same thing that Neo4J does when it tries to find all paths. Before Neo4J can start to pare down your results, it must generate a result set that consists of every single path in the entire graph.
The SQL and relational query does not do that; it starts from the list of links, but that recursive query has the effect of removing any potential duplicate links; it discovers other links as its searching for the links of others; e.g. if the graph looks like (A)-(B)-(C), that query will find that B connects to C in the process of finding that A connects to C.
With the Neo4J, every path must be discovered separately.
If this is your general use-case, it is possible that Neo4J is not a good choice if speed is a concern.
Related
In neo4j my database consists of chains of nodes. For each distinct stucture/layout (does graph theory has a better word?), I want to count the number of chains. For example, the database consists of 9 nodes and 5 relationships as this:
(:a)->(:b)
(:b)->(:a)
(:a)->(:b)
(:a)->(:b)->(:b)
where (:a) is a node with label a. Properties on nodes and relationships are irrelevant.
The result of the counting should be:
------------------------
| Structure | n |
------------------------
| (:a)->(:b) | 2 |
| (:b)->(:a) | 1 |
| (:a)->(:b)->(:b) | 1 |
------------------------
Is there a query that can achieve this?
Appendix
Query to create test data:
create (:a)-[:r]->(:b), (:b)-[:r]->(:a), (:a)-[:r]->(:b), (:a)-[:r]->(:b)-[:r]->(:b)
EDIT:
Thanks for the clarification.
We can get the equivalent of what you want, a capture of the path pattern using the labels present:
MATCH path = (start)-[*]->(end)
WHERE NOT ()-->(start) and NOT (end)-->()
RETURN [node in nodes(path) | labels(node)[0]] as structure, count(path) as n
This will give you a list of the labels of the nodes (the first label present for each...remember that nodes can be multi-labeled, which may throw off your results).
As for getting it into that exact format in your example, that's a different thing. We could do this with some text functions in APOC Procedures, specifically apoc.text.join().
We would need to first add formatting around the extraction of the first label to add the prefixed : as well as the parenthesis. Then we could use apoc.text.join() to get a string where the nodes are joined by your desired '->' symbol:
MATCH path = (start)-[*]->(end)
WHERE NOT ()-->(start) and NOT (end)-->()
WITH [node in nodes(path) | labels(node)[0]] as structure, count(path) as n
RETURN apoc.text.join([label in structure | '(:' + label + ')'], '->') as structure, n
I am trying to load 500000 nodes ,but the query is not executed successfully.Can any one tell me the limitation of number of nodes in neo4j community edition database?
I am running these queries
result = session.run("""
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:///relationships.csv" AS row
merge (s:Start {ac:row.START})
on create set s.START=row.START
merge (e:End {en:row.END})
on create set s.END=row.END
FOREACH (_ in CASE row.TYPE WHEN "PAID" then [1] else [] end |
MERGE (s)-[:PAID {cr:row.CREDIT}]->(e))
FOREACH (_ in CASE row.TYPE WHEN "UNPAID" then [1] else [] end |
MERGE (s)-[:UNPAID {db:row.DEBIT}]->(e))
RETURN s.START as index, count(e) as connections
order by connections desc
""")
I don't think the community edition is more limited than the enterprise edition in that regard, and most of the limits have been removed in 3.0.
Anyway, I can easily create a million nodes (in one transaction):
neo4j-sh (?)$ unwind range(1, 1000000) as i create (n:Node) return count(n);
+----------+
| count(n) |
+----------+
| 1000000 |
+----------+
1 row
Nodes created: 1000000
Labels added: 1000000
3495 ms
Running that 10 times, I've definitely created 10 million nodes:
neo4j-sh (?)$ match (n) return count(n);
+----------+
| count(n) |
+----------+
| 10000000 |
+----------+
1 row
3 ms
Your problem is most likely related to the size of the transaction: if it's too large, it can result in an OutOfMemory error, and before that it can slow the instance to a crawl because of all the garbage collection. Split the node creation in smaller batches, e.g. with USING PERIODIC COMMIT if you use LOAD CSV.
Update:
Your query already includes USING PERIODIC COMMIT and only creates 2 nodes and 1 relationship per line from the CSV file, so it most likely has to do with the performance of the query itself than the size of the transaction.
You have Start nodes with 2 properties set with the same value from the CSV (ac and START), and End nodes also with 2 properties set with the same value (en and END). Is there a unicity constraint on the property used for the MERGE? Without it, as nodes are created, processing each line will take longer and longer as it needs to scan all the existing nodes with the wanted label (an O(n^2) algorithm, which is pretty bad for 500K nodes).
CREATE CONSTRAINT ON (n:Start) ASSERT n.ac IS UNIQUE;
CREATE CONSTRAINT ON (n:End) ASSERT n.en IS UNIQUE;
That's probably the main improvement to apply.
However, do you actually need to MERGE the relationships (instead of CREATE)? Either the CSV contains a snapshot of the current credit relationships between all Start and End nodes (in which case there's a single relationship per pair), or it contains all transactions and there's no real reason to merge those for the same amount.
Finally, do you actually need to report the sorted, aggregated result from that loading query? It requires more memory and could be split into a separate query, after the loading has succeeded.
I'm using neo4j 2.1.7 Recently i was experimenting with Match queries, searching for nodes with several labels. And i found out, that generally query
Match (p:A:B) return count(p) as number
and
Match (p:B:A) return count(p) as number
works different time, extremely in cases when you have for example 2 millions of Nodes A and 0 of Nodes B.
So do labels order effects search time? Is this future is documented anywhere?
Neo4j internally maintains a labelscan store - that's basically a lookup to quickly get all nodes carrying a definied label A.
When doing a query like
MATCH (n:A:B) return count(n)
labelscanstore is used to find all A nodes and then they're filtered if those nodes carry label B as well. If n(A) >> n(B) it's way more efficient to do MATCH (n:B:A) instead since you look up only a few B nodes and filter those for A.
You can use PROFILE MATCH (n:A:B) return count(n) to see the query plan. For Neo4j <= 2.1.x you'll see a different query plan depending on the order of the labels you've specified.
Starting with Neo4j 2.2 (milestone M03 available as of writing this reply) there's a cost based Cypher optimizer. Now Cypher is aware of node statistics and they are used to optimize the query.
As an example I've used the following statements to create some test data:
create (:A:B);
with 1 as a foreach (x in range(0,1000000) | create (:A));
with 1 as a foreach (x in range(0,100) | create (:B));
We have now 100 B nodes, 1M A nodes and 1 AB node. In 2.2 the two statements:
MATCH (n:B:A) return count(n)
MATCH (n:A:B) return count(n)
result in the exact same query plan (and therefore in the same execution speed):
+------------------+---------------+------+--------+-------------+---------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+------------------+---------------+------+--------+-------------+---------------+
| EagerAggregation | 3 | 1 | 0 | count(n) | |
| Filter | 12 | 1 | 12 | n | hasLabel(n:A) |
| NodeByLabelScan | 12 | 12 | 13 | n | :B |
+------------------+---------------+------+--------+-------------+---------------+
Since there are only few B nodes, it's cheaper to scan for B's and filter for A. Smart Cypher, isn't it ;-)
how do I express the following in Cypher
"Return all nodes with at least one incoming edge of type A and no outgoing edges".
Best Regards
You can use a pattern to exclude nodes from the result subset like this:
MATCH ()-[:A]->(n) WHERE NOT (n)-->() RETURN n
Try
MATCH (n)
WHERE ()-[:A]->n AND NOT n-->()
RETURN n
or
MATCH ()-[:A]->(n)
WHERE NOT n-->()
RETURN DISTINCT n
Edit
Pattern expressions can be used both for pattern matching and as predicates for filtering. If used in the MATCH clause, the paths that answer the pattern are included in the result. If used for filtering, in the WHERE clause, the pattern serves as a limiting condition on the paths that have previously been matched. The result is limited, not extended to include the filter condition. When a pattern is used as a predicate for filtering, the negation of that predicate is also a predicate that can be used as a filter condition. No path answers to the negation of a pattern (if there is such a thing) so negations of patterns cannot be used in the MATCH clause. The phrase
Return all nodes with at least one incoming edge of type A and no outgoing edges
involves two patterns on nodes n, namely any incoming relationship [:A] on n and any outgoing relationship on n. The second must be interpreted as a pattern for a predicate filter condition since it involves a negation, not any outgoing relationship on n. The first, however, can be interpreted either as a pattern to match along with n, or as another pattern predicate filter condition.
These two interpretations give rise to the two cypher queries above. The first query matches all nodes and uses both patterns to filter the result. The second matches the incoming relationship on n along with n and uses the second pattern to filter the results.
The first query will match every node only once before the filtering happens. It will therefore return one result item per node that meets the criteria. The second query will match the pattern any incoming relationship [:A] on n once for each path, i.e. once for each incoming relationship on n. It may therefore contain a node multiple times in the result, hence the DISTINCT keyword to remove doubles.
If the items of interest are precisely the nodes, then using both patterns for predicates in the WHERE clause seems to me the correct interpretation. It is also more efficient since it needs to find only zero or one incoming [:A] on n to resolve the predicate. If the incoming relationships are also of interest, then some version of the second query is the right choice. One would need to bind the relationship and do something useful with it, such as return it.
Below are the execution plans for the two queries executed on a 'fresh' neo4j console.
First query:
----
Filter
|
+AllNodes
+----------+------+--------+-------------+------------------------------------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------+------+--------+-------------+------------------------------------------------------------------------------------------------------------------------------+
| Filter | 0 | 0 | | (nonEmpty(PathExpression((17)-[ UNNAMED18:A]->(n), true)) AND NOT(nonEmpty(PathExpression((n)-[ UNNAMED36]->(40), true)))) |
| AllNodes | 6 | 7 | n, n | |
+----------+------+--------+-------------+------------------------------------------------------------------------------------------------------------------------------+
Second query:
----
Distinct
|
+Filter
|
+TraversalMatcher
+------------------+------+--------+-------------+--------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+------+--------+-------------+--------------------------------------------------------------+
| Distinct | 0 | 0 | | |
| Filter | 0 | 0 | | NOT(nonEmpty(PathExpression((n)-[ UNNAMED30]->(34), true))) |
| TraversalMatcher | 0 | 13 | | n, UNNAMED8, n |
+------------------+------+--------+-------------+--------------------------------------------------------------+
I'm getting started with Neo4J 2.0.1 and I'm already running into performance problems that make me think that my approach is wrong. I have a single node type so far (all with the label NeoPerson) and one type of relationship (all with the label NeoWeight). In my test setup, there are about 100,000 nodes and each node has between 0 and 300 relationships to other nodes. There is a Neo4j2.0-style index on NeoPerson's only field, called profile_id (eg CREATE INDEX ON :NeoPerson(profile_id)). Looking up a NeoPerson by profile_id is reasonably fast:
neo4j-sh (?)$ match (n:NeoPerson {profile_id:38}) return n;
+----------------------------+
| n |
+----------------------------+
| Node[23840]{profile_id:38} |
+----------------------------+
1 row
45 ms
However, once I throw relationships into the mix, it gets quite slow.
neo4j-sh (?)$ match (n:NeoPerson {profile_id:38})-[e:NeoWeight]->() return n, e;
+----------------------------------------------------------------------------+
| n | e |
+----------------------------------------------------------------------------+
| Node[23840]{profile_id:38} | :NeoWeight[8178324]{value:384} |
| Node[23840]{profile_id:38} | :NeoWeight[8022460]{value:502} |
...
| Node[23840]{profile_id:38} | :NeoWeight[54914]{} |
+----------------------------------------------------------------------------+
244 rows
2409 ms
My understanding was that traversing relationships from a single node should be quite efficient (isn't that the point of using a graph database?), so why is it taking over 2 seconds for such a simple query on a small data set? I didn't see a way to add an index on a relationship whose keys are the source and/or destination nodes.
People use Neo4j in production without issues. If they have the requirement that the first user query has to return in a few ms, they warm up the caches after server start. E.g. by running their most important use-case queries upfront.
It takes some time to load the nodes and rels from disk. Esp. if the relationships (and their properties) of the single node are distributed across the relationship store file and are loaded from a spinning disk.
For the first query it also takes a bit longer as its query plan has to be built and compiled.
That's why in production you usually use parameters to allow query caching.
What is the use-case you're trying to address?