I'm using neo4j 2.1.7 Recently i was experimenting with Match queries, searching for nodes with several labels. And i found out, that generally query
Match (p:A:B) return count(p) as number
and
Match (p:B:A) return count(p) as number
works different time, extremely in cases when you have for example 2 millions of Nodes A and 0 of Nodes B.
So do labels order effects search time? Is this future is documented anywhere?
Neo4j internally maintains a labelscan store - that's basically a lookup to quickly get all nodes carrying a definied label A.
When doing a query like
MATCH (n:A:B) return count(n)
labelscanstore is used to find all A nodes and then they're filtered if those nodes carry label B as well. If n(A) >> n(B) it's way more efficient to do MATCH (n:B:A) instead since you look up only a few B nodes and filter those for A.
You can use PROFILE MATCH (n:A:B) return count(n) to see the query plan. For Neo4j <= 2.1.x you'll see a different query plan depending on the order of the labels you've specified.
Starting with Neo4j 2.2 (milestone M03 available as of writing this reply) there's a cost based Cypher optimizer. Now Cypher is aware of node statistics and they are used to optimize the query.
As an example I've used the following statements to create some test data:
create (:A:B);
with 1 as a foreach (x in range(0,1000000) | create (:A));
with 1 as a foreach (x in range(0,100) | create (:B));
We have now 100 B nodes, 1M A nodes and 1 AB node. In 2.2 the two statements:
MATCH (n:B:A) return count(n)
MATCH (n:A:B) return count(n)
result in the exact same query plan (and therefore in the same execution speed):
+------------------+---------------+------+--------+-------------+---------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+------------------+---------------+------+--------+-------------+---------------+
| EagerAggregation | 3 | 1 | 0 | count(n) | |
| Filter | 12 | 1 | 12 | n | hasLabel(n:A) |
| NodeByLabelScan | 12 | 12 | 13 | n | :B |
+------------------+---------------+------+--------+-------------+---------------+
Since there are only few B nodes, it's cheaper to scan for B's and filter for A. Smart Cypher, isn't it ;-)
Related
In neo4j my database consists of chains of nodes. For each distinct stucture/layout (does graph theory has a better word?), I want to count the number of chains. For example, the database consists of 9 nodes and 5 relationships as this:
(:a)->(:b)
(:b)->(:a)
(:a)->(:b)
(:a)->(:b)->(:b)
where (:a) is a node with label a. Properties on nodes and relationships are irrelevant.
The result of the counting should be:
------------------------
| Structure | n |
------------------------
| (:a)->(:b) | 2 |
| (:b)->(:a) | 1 |
| (:a)->(:b)->(:b) | 1 |
------------------------
Is there a query that can achieve this?
Appendix
Query to create test data:
create (:a)-[:r]->(:b), (:b)-[:r]->(:a), (:a)-[:r]->(:b), (:a)-[:r]->(:b)-[:r]->(:b)
EDIT:
Thanks for the clarification.
We can get the equivalent of what you want, a capture of the path pattern using the labels present:
MATCH path = (start)-[*]->(end)
WHERE NOT ()-->(start) and NOT (end)-->()
RETURN [node in nodes(path) | labels(node)[0]] as structure, count(path) as n
This will give you a list of the labels of the nodes (the first label present for each...remember that nodes can be multi-labeled, which may throw off your results).
As for getting it into that exact format in your example, that's a different thing. We could do this with some text functions in APOC Procedures, specifically apoc.text.join().
We would need to first add formatting around the extraction of the first label to add the prefixed : as well as the parenthesis. Then we could use apoc.text.join() to get a string where the nodes are joined by your desired '->' symbol:
MATCH path = (start)-[*]->(end)
WHERE NOT ()-->(start) and NOT (end)-->()
WITH [node in nodes(path) | labels(node)[0]] as structure, count(path) as n
RETURN apoc.text.join([label in structure | '(:' + label + ')'], '->') as structure, n
I am trying to compute the transitive closure of an undirected graph in Neo4j using the following Cypher Query ("E" is the label that every edge of the graph has):
MATCH (a) -[:E*]- (b) WHERE ID(a) < ID(b) RETURN DISTINCT a, b
I tried to execute this query on a graph with 10k nodes and around 150k edges, but even after 8 hours it did not finish. I find this surprising, because even the most naive SQL solutions are much faster and I expected that Neo4j would be much more efficient for these kind of standard graph queries. So is there something that I am missing, maybe some tuning of the Neo4j server or a better way to write the query?
Edit
Here is the result of EXPLAINing the above query:
+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
908 ms
Compiler CYPHER 3.3
Planner COST
Runtime INTERPRETED
+-----------------------+----------------+------------------+--------------------------------+
| Operator | Estimated Rows | Variables | Other |
+-----------------------+----------------+------------------+--------------------------------+
| +ProduceResults | 14069 | a, b | |
| | +----------------+------------------+--------------------------------+
| +Distinct | 14069 | a, b | a, b |
| | +----------------+------------------+--------------------------------+
| +Filter | 14809 | anon[11], a, b | ID(a) < ID(b) |
| | +----------------+------------------+--------------------------------+
| +VarLengthExpand(All) | 49364 | anon[11], b -- a | (a)-[:E*]-(b) |
| | +----------------+------------------+--------------------------------+
| +AllNodesScan | 40012 | a | |
+-----------------------+----------------+------------------+--------------------------------+
Total database accesses: ?
You can limit the direction, but it requires the graph to be directed.
After doing some testing and profiling of my own, I found that for even very small sets of data (Randomly-generated sets of 10 nodes with 2 random edges on each), making the query be only for a single direction cut down on database hits by a factor of 10000 (from 2266909 to 149 database hits).
Adding a direction to your query (and thus forcing the graph to be directed) cuts down the search space by a great deal, but it requires the graph to be directed.
I also tried simply adding a reverse relationship for each directed one, to see if that would have similar performance. It did not; it did not complete before 5 minutes had passed, at which point I killed it.
Unfortunately, you are not doing anything wrong, but your query is massive.
Neo4J being a graph database does not mean that all mathematical operations involving graphs will be extremely fast; they are still subject to performance constraints, up to and including the transitive closure operation.
The query you have written is an unbounded path search for every single pair of nodes. The node pairs are bounded, but not in a very meaningful way (the bound of ID(a) < ID(b) just means that the search only needs to be done one way; there are still 10k! (as in factorial) possible sets of nodes in the result set.
And then, that's only after every single path is checked. Searching for the entire transitive closure of a graph the size that you specified will be extremely expensive performance-wise.
The SQL that you posted is not performing the same operation.
You mentioned in the comments that you tried this query in a relational table in a recursive form:
WITH RECURSIVE temp_tc AS (
SELECT v AS a, v AS b FROM nodes
UNION SELECT a,b FROM edges g
UNION SELECT t.a,g.b FROM temp_tc t, edges g WHERE t.b = g.a
)
SELECT a, b FROM temp_tc;
I should note that this query is not performing the same thing that Neo4J does when it tries to find all paths. Before Neo4J can start to pare down your results, it must generate a result set that consists of every single path in the entire graph.
The SQL and relational query does not do that; it starts from the list of links, but that recursive query has the effect of removing any potential duplicate links; it discovers other links as its searching for the links of others; e.g. if the graph looks like (A)-(B)-(C), that query will find that B connects to C in the process of finding that A connects to C.
With the Neo4J, every path must be discovered separately.
If this is your general use-case, it is possible that Neo4J is not a good choice if speed is a concern.
I have
50K Post nodes
40K Tag nodes
125K TAGGED relationships (meaning average 2,5 tags per post)
in my graph and below query causes a "Java heap space" error.
match (p1:Post)-[r1:TAGGED]->(t:Tag)<-[r2:TAGGED]-(p2:Post)
return p1.Title, count(r1), p2.Title, count(r2)
limit 10
What I expected was some repeated rows depending on number of shared tags. I was not sure how limit would work (stop after first 10 posts or tags). But, since I have limit 10 I did not expect this query to traverse all the graph. It seems like it does.
UPDATE 1
With a few changes, Christophe Willemsen's query returns 10 rows in 15 sec.
// I need label for the otherPost because Users are also TAGGED
MATCH (post:Post)-[:TAGGED]->(t)<-[:TAGGED]-(otherPost:Post)
RETURN post.Title, count(t) as cnt, otherPost.Title
// ORDER BY cnt DESC // for now I do not need this
LIMIT 10;
I thought "ORDER BY" clause may cause traversal of all possible paths so I removed the clause but it is still 15 sec. It is also 15 sec. when I make the limit value 1 or 1000 without sorting.
What I expect from Neo4j was: "Start from any Post node, then jump to its Tags and find otherPosts that are tagged with the same tag. When there are 10 found stop traversing and return the results." I am pretty sure it is not doing this.
To make my expectation clear, assume that graph is this small and we use Limit 3 in the cypher query.
p1 - [t1, t2, t3] // Post1 is tagged with t1, t2 and t3
p2 - [t2, t3, t4]
p3 - [t3, t4, t5]
What I expect is:
Start form p1 (or any Post node)
Jump to t1
No other posts are tagged with t1
Jump to t2
p2 is tagged with t2 (1 of 3)
No other posts are tagged with t2
Jump to t3
p2 is tagged with t3 (2 of 3)
p3 is tagged with t3 (3 of 3)
we reached the limit, break
But, it seems like Limit is applied after traversing all data.
So, my question is now: Did Neo4j found all the matches and returned 10 of them or did it stop searching after first 10 matches? And of course, Why?
UPDATE 2
After helpful answers I managed to decrease the scope of my question so I tried below queries.
// 3 sec.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, count(t)
LIMIT 1;
// 3 sec.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, count(t)
LIMIT 1000;
// 100 ms.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, t.Name
LIMIT 1;
// 150 ms.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, t.Name
LIMIT 1000;
So, I still do not know why but, using aggregation methods (I tried collect(t.Name) instead of count) breaks the expected (at least my expectations :) behaviour of limit functionality.
This query will result in a global graph lookup, at least for neo4j 2.1.7 and below.
I would first matching the nodes and then expanding the path
MATCH (post:Post)
MATCH (post)-[:TAGS]->(t)<-[:TAGS]-(otherPost)
RETURN post, count(t) as cnt, otherPost
ORDER BY cnt DESC
LIMIT 10;
And this is the execution plan, as you can see by matching first the post nodes only (so labels index) it costs you only retrieving those and following relationships
ColumnFilter
|
+Top
|
+EagerAggregation
|
+Filter
|
+SimplePatternMatcher
|
+NodeByLabel
+----------------------+--------+--------+----------------------------------------------+------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+--------+--------+----------------------------------------------+------------------------------------------------------------------------------------------------+
| ColumnFilter | 10 | 0 | | keep columns post, cnt, otherPost |
| Top | 10 | 0 | | { AUTOINT0}; Cached( INTERNAL_AGGREGATEc24f01bf-69cc-4bd9-9aed-be257028194b of type Integer) |
| EagerAggregation | 9900 | 0 | | post, otherPost |
| Filter | 134234 | 0 | | NOT( UNNAMED30 == UNNAMED43) |
| SimplePatternMatcher | 134234 | 0 | t, UNNAMED43, UNNAMED30, post, otherPost | |
| NodeByLabel | 100 | 101 | post, post | :Post |
+----------------------+--------+--------+----------------------------------------------+------------------------------------------------------------------------------------------------+
Total database accesses: 101
And here a blog post explaining why I removed labels except for the first part of the query : http://graphaware.com/neo4j/2015/01/16/neo4j-graph-model-design-labels-versus-indexed-properties.html
What Christophe said and
Try to reduce the cardinality in between:
match (p1:Post)-[r1:TAGGED]->(t:Tag)
WITH tag, count(*) as freq, collect(distinct p1.Title) as posts
MATCH (tag)<-[r2:TAGGED]-(p2:Post)
return posts, freq, p2.Title, count(r2)
limit 10
how do I express the following in Cypher
"Return all nodes with at least one incoming edge of type A and no outgoing edges".
Best Regards
You can use a pattern to exclude nodes from the result subset like this:
MATCH ()-[:A]->(n) WHERE NOT (n)-->() RETURN n
Try
MATCH (n)
WHERE ()-[:A]->n AND NOT n-->()
RETURN n
or
MATCH ()-[:A]->(n)
WHERE NOT n-->()
RETURN DISTINCT n
Edit
Pattern expressions can be used both for pattern matching and as predicates for filtering. If used in the MATCH clause, the paths that answer the pattern are included in the result. If used for filtering, in the WHERE clause, the pattern serves as a limiting condition on the paths that have previously been matched. The result is limited, not extended to include the filter condition. When a pattern is used as a predicate for filtering, the negation of that predicate is also a predicate that can be used as a filter condition. No path answers to the negation of a pattern (if there is such a thing) so negations of patterns cannot be used in the MATCH clause. The phrase
Return all nodes with at least one incoming edge of type A and no outgoing edges
involves two patterns on nodes n, namely any incoming relationship [:A] on n and any outgoing relationship on n. The second must be interpreted as a pattern for a predicate filter condition since it involves a negation, not any outgoing relationship on n. The first, however, can be interpreted either as a pattern to match along with n, or as another pattern predicate filter condition.
These two interpretations give rise to the two cypher queries above. The first query matches all nodes and uses both patterns to filter the result. The second matches the incoming relationship on n along with n and uses the second pattern to filter the results.
The first query will match every node only once before the filtering happens. It will therefore return one result item per node that meets the criteria. The second query will match the pattern any incoming relationship [:A] on n once for each path, i.e. once for each incoming relationship on n. It may therefore contain a node multiple times in the result, hence the DISTINCT keyword to remove doubles.
If the items of interest are precisely the nodes, then using both patterns for predicates in the WHERE clause seems to me the correct interpretation. It is also more efficient since it needs to find only zero or one incoming [:A] on n to resolve the predicate. If the incoming relationships are also of interest, then some version of the second query is the right choice. One would need to bind the relationship and do something useful with it, such as return it.
Below are the execution plans for the two queries executed on a 'fresh' neo4j console.
First query:
----
Filter
|
+AllNodes
+----------+------+--------+-------------+------------------------------------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------+------+--------+-------------+------------------------------------------------------------------------------------------------------------------------------+
| Filter | 0 | 0 | | (nonEmpty(PathExpression((17)-[ UNNAMED18:A]->(n), true)) AND NOT(nonEmpty(PathExpression((n)-[ UNNAMED36]->(40), true)))) |
| AllNodes | 6 | 7 | n, n | |
+----------+------+--------+-------------+------------------------------------------------------------------------------------------------------------------------------+
Second query:
----
Distinct
|
+Filter
|
+TraversalMatcher
+------------------+------+--------+-------------+--------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+------+--------+-------------+--------------------------------------------------------------+
| Distinct | 0 | 0 | | |
| Filter | 0 | 0 | | NOT(nonEmpty(PathExpression((n)-[ UNNAMED30]->(34), true))) |
| TraversalMatcher | 0 | 13 | | n, UNNAMED8, n |
+------------------+------+--------+-------------+--------------------------------------------------------------+
I have a graph like this:
(2)<-[0:CHILD]-(1)-[1:CHILD]->(3)
In words: Node 1,2 and 3 (all with names); Edges 0 and 1
I write the following cypher-query:
START nodes = node(1,2,3), relationship = relationship(0,1)
RETURN nodes, relationship
and got as a result:
==> +-----------------------------------------------+
==> | nodes | relationship |
==> +-----------------------------------------------+
==> | Node[1]{name->"Risikogruppe2"} | :CHILD[0] {} |
==> | Node[1]{name->"Risikogruppe2"} | :CHILD[1] {} |
==> | Node[2]{name->"Beruf 1"} | :CHILD[0] {} |
==> | Node[2]{name->"Beruf 1"} | :CHILD[1] {} |
==> | Node[3]{name->"Beruf 2"} | :CHILD[0] {} |
==> | Node[3]{name->"Beruf 2"} | :CHILD[1] {} |
==> +-----------------------------------------------+
==> 6 rows, 0 ms
now my question:
why I became all nodes twice and relationships three time? I just want to get all of it one time.
thanks for your time ^^
The way Cypher works is very similar to SQL. When you create your variables in your START clause, you're sort of doing a from nodes, relationships in SQL (tables). The reason you're getting a cartesian product of all of the possible values for the two, is because you're not doing any sort of match or where to filter them, so it's basically like:
select *
from nodes, relationships
Where you forgot to put the foreign key relationship between the tables.
In Cypher, you do this by doing a match, usually:
start n=node(1,2,3), r=relationship(0,1)
match n-[r]-m // find where the n nodes and the r relationships point (to m)
return *
But since you have no match, you get a cartesian product.
You should only see the nodes and relationships once, unless you do some matching.
Tried to reproduce your problem, but I haven't been able to.
http://tinyurl.com/cobd8oq
Is it possible for you to create an console.neo4j.org example of your problem?
Thanks,
Andrés