Looking for a little assistance on a Cypher query. Given a set of customers peer who own book p, I am able to retrieve a set of customers target who own at least one book also owned by peer but who don't own p. This is accomplished using the following query:
match
(p:Book {isbn:"123456"})<-[:owns]-(peer:Customer)
-[:owns]->(other:Book)<-[o:owns]-(target:Customer)
WHERE NOT( (target)-[:owns]->(p))
return target.name
limit 10;
My next step is to determine how many other books each member of the target set own, and order those members accordingly. I've attempted several variations based on the Neo4j documentation and SO answers, but am having no luck. For instance I tried using with:
match
(p:Book {isbn:"123456"})<-[:owns]-(peer:Customer)
-[:owns]->(other:Book)<-[o:owns]-(target:Customer)
WHERE NOT( (target)-[:owns]->(p))
WITH target, count(o) as co
WHERE co > 1
return target.name
limit 10;
I also tried what seems to my novice eye was the most reasonable query:
match
(p:Book {isbn:"123456"})<-[:owns]-(peer:Customer)
-[:owns]->(other:Book)<-[o:owns]-(target:Customer)
WHERE NOT( (target)-[:owns]->(p))
return target.name, count(o)
limit 10;
In both of these cases, the query just runs without end (upwards of 10 minutes before I stop execution). Any insight into what I'm doing wrong?
EDIT
As it turns out this latter query does execute but takes 15 minutes to complete and is reporting incorrect numbers, as evidenced here:
+-------------------------------+
| target.name | count(o) |
+-------------------------------+
| "John Smith" | 12840 |
| "Mary Moore" | 11501 |
+-------------------------------+
I'm looking for the number of books each customer specifically owns, not sure where these 12840 and 11501 numbers are coming from really. Any thoughts?
How about this one:
MATCH (p:Book {isbn:"123456"})<-[:owns]-(peer:Customer)
WITH distinct peer, p
MATCH (peer)-[:owns]->(other:Book)
WITH distinct other, p
MATCH (other)<-[o:owns]-(target:Customer)
WHERE NOT((target)-[:owns]->(p))
RETURN target.name, count(o)
LIMIT 10;
Related
I am trying to compute the transitive closure of an undirected graph in Neo4j using the following Cypher Query ("E" is the label that every edge of the graph has):
MATCH (a) -[:E*]- (b) WHERE ID(a) < ID(b) RETURN DISTINCT a, b
I tried to execute this query on a graph with 10k nodes and around 150k edges, but even after 8 hours it did not finish. I find this surprising, because even the most naive SQL solutions are much faster and I expected that Neo4j would be much more efficient for these kind of standard graph queries. So is there something that I am missing, maybe some tuning of the Neo4j server or a better way to write the query?
Edit
Here is the result of EXPLAINing the above query:
+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
908 ms
Compiler CYPHER 3.3
Planner COST
Runtime INTERPRETED
+-----------------------+----------------+------------------+--------------------------------+
| Operator | Estimated Rows | Variables | Other |
+-----------------------+----------------+------------------+--------------------------------+
| +ProduceResults | 14069 | a, b | |
| | +----------------+------------------+--------------------------------+
| +Distinct | 14069 | a, b | a, b |
| | +----------------+------------------+--------------------------------+
| +Filter | 14809 | anon[11], a, b | ID(a) < ID(b) |
| | +----------------+------------------+--------------------------------+
| +VarLengthExpand(All) | 49364 | anon[11], b -- a | (a)-[:E*]-(b) |
| | +----------------+------------------+--------------------------------+
| +AllNodesScan | 40012 | a | |
+-----------------------+----------------+------------------+--------------------------------+
Total database accesses: ?
You can limit the direction, but it requires the graph to be directed.
After doing some testing and profiling of my own, I found that for even very small sets of data (Randomly-generated sets of 10 nodes with 2 random edges on each), making the query be only for a single direction cut down on database hits by a factor of 10000 (from 2266909 to 149 database hits).
Adding a direction to your query (and thus forcing the graph to be directed) cuts down the search space by a great deal, but it requires the graph to be directed.
I also tried simply adding a reverse relationship for each directed one, to see if that would have similar performance. It did not; it did not complete before 5 minutes had passed, at which point I killed it.
Unfortunately, you are not doing anything wrong, but your query is massive.
Neo4J being a graph database does not mean that all mathematical operations involving graphs will be extremely fast; they are still subject to performance constraints, up to and including the transitive closure operation.
The query you have written is an unbounded path search for every single pair of nodes. The node pairs are bounded, but not in a very meaningful way (the bound of ID(a) < ID(b) just means that the search only needs to be done one way; there are still 10k! (as in factorial) possible sets of nodes in the result set.
And then, that's only after every single path is checked. Searching for the entire transitive closure of a graph the size that you specified will be extremely expensive performance-wise.
The SQL that you posted is not performing the same operation.
You mentioned in the comments that you tried this query in a relational table in a recursive form:
WITH RECURSIVE temp_tc AS (
SELECT v AS a, v AS b FROM nodes
UNION SELECT a,b FROM edges g
UNION SELECT t.a,g.b FROM temp_tc t, edges g WHERE t.b = g.a
)
SELECT a, b FROM temp_tc;
I should note that this query is not performing the same thing that Neo4J does when it tries to find all paths. Before Neo4J can start to pare down your results, it must generate a result set that consists of every single path in the entire graph.
The SQL and relational query does not do that; it starts from the list of links, but that recursive query has the effect of removing any potential duplicate links; it discovers other links as its searching for the links of others; e.g. if the graph looks like (A)-(B)-(C), that query will find that B connects to C in the process of finding that A connects to C.
With the Neo4J, every path must be discovered separately.
If this is your general use-case, it is possible that Neo4J is not a good choice if speed is a concern.
I am working on a project that uses Node.js, Cypher, and Neo4j. The project's front end occasionally needs to QUICKLY pull a random user. I have seen this query on the internet:
MATCH (n:User) WHERE rand() < 0.1 RETURN n LIMIT 21
but I have no idea what this does. It seems pretty fast, but I would like to understand it. A breakdown of what I know:
MATCH | Match some nodes
(n:User) | Let's call this node n, and it has to be of type User
WHERE | Specify conditions for node match
rand() | Return a random number from 0 to 0.9999...
< | Less than
0.1 | ??
RETURN | Give back the matched node(s)
n | Our node(s)
LIMIT 21 | Don't return more than 21 nodes
What does the rand() and 0.1 do? Does it somehow limit the potential nodes to return?
If this helps, I have around 10,000 nodes
As your question already states, a WHERE clause specifies the conditions for a MATCH to succeed. So, WHERE rand() < 0.1 means the MATCH has a 10% probability of succeeding.
I have
50K Post nodes
40K Tag nodes
125K TAGGED relationships (meaning average 2,5 tags per post)
in my graph and below query causes a "Java heap space" error.
match (p1:Post)-[r1:TAGGED]->(t:Tag)<-[r2:TAGGED]-(p2:Post)
return p1.Title, count(r1), p2.Title, count(r2)
limit 10
What I expected was some repeated rows depending on number of shared tags. I was not sure how limit would work (stop after first 10 posts or tags). But, since I have limit 10 I did not expect this query to traverse all the graph. It seems like it does.
UPDATE 1
With a few changes, Christophe Willemsen's query returns 10 rows in 15 sec.
// I need label for the otherPost because Users are also TAGGED
MATCH (post:Post)-[:TAGGED]->(t)<-[:TAGGED]-(otherPost:Post)
RETURN post.Title, count(t) as cnt, otherPost.Title
// ORDER BY cnt DESC // for now I do not need this
LIMIT 10;
I thought "ORDER BY" clause may cause traversal of all possible paths so I removed the clause but it is still 15 sec. It is also 15 sec. when I make the limit value 1 or 1000 without sorting.
What I expect from Neo4j was: "Start from any Post node, then jump to its Tags and find otherPosts that are tagged with the same tag. When there are 10 found stop traversing and return the results." I am pretty sure it is not doing this.
To make my expectation clear, assume that graph is this small and we use Limit 3 in the cypher query.
p1 - [t1, t2, t3] // Post1 is tagged with t1, t2 and t3
p2 - [t2, t3, t4]
p3 - [t3, t4, t5]
What I expect is:
Start form p1 (or any Post node)
Jump to t1
No other posts are tagged with t1
Jump to t2
p2 is tagged with t2 (1 of 3)
No other posts are tagged with t2
Jump to t3
p2 is tagged with t3 (2 of 3)
p3 is tagged with t3 (3 of 3)
we reached the limit, break
But, it seems like Limit is applied after traversing all data.
So, my question is now: Did Neo4j found all the matches and returned 10 of them or did it stop searching after first 10 matches? And of course, Why?
UPDATE 2
After helpful answers I managed to decrease the scope of my question so I tried below queries.
// 3 sec.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, count(t)
LIMIT 1;
// 3 sec.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, count(t)
LIMIT 1000;
// 100 ms.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, t.Name
LIMIT 1;
// 150 ms.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, t.Name
LIMIT 1000;
So, I still do not know why but, using aggregation methods (I tried collect(t.Name) instead of count) breaks the expected (at least my expectations :) behaviour of limit functionality.
This query will result in a global graph lookup, at least for neo4j 2.1.7 and below.
I would first matching the nodes and then expanding the path
MATCH (post:Post)
MATCH (post)-[:TAGS]->(t)<-[:TAGS]-(otherPost)
RETURN post, count(t) as cnt, otherPost
ORDER BY cnt DESC
LIMIT 10;
And this is the execution plan, as you can see by matching first the post nodes only (so labels index) it costs you only retrieving those and following relationships
ColumnFilter
|
+Top
|
+EagerAggregation
|
+Filter
|
+SimplePatternMatcher
|
+NodeByLabel
+----------------------+--------+--------+----------------------------------------------+------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+--------+--------+----------------------------------------------+------------------------------------------------------------------------------------------------+
| ColumnFilter | 10 | 0 | | keep columns post, cnt, otherPost |
| Top | 10 | 0 | | { AUTOINT0}; Cached( INTERNAL_AGGREGATEc24f01bf-69cc-4bd9-9aed-be257028194b of type Integer) |
| EagerAggregation | 9900 | 0 | | post, otherPost |
| Filter | 134234 | 0 | | NOT( UNNAMED30 == UNNAMED43) |
| SimplePatternMatcher | 134234 | 0 | t, UNNAMED43, UNNAMED30, post, otherPost | |
| NodeByLabel | 100 | 101 | post, post | :Post |
+----------------------+--------+--------+----------------------------------------------+------------------------------------------------------------------------------------------------+
Total database accesses: 101
And here a blog post explaining why I removed labels except for the first part of the query : http://graphaware.com/neo4j/2015/01/16/neo4j-graph-model-design-labels-versus-indexed-properties.html
What Christophe said and
Try to reduce the cardinality in between:
match (p1:Post)-[r1:TAGGED]->(t:Tag)
WITH tag, count(*) as freq, collect(distinct p1.Title) as posts
MATCH (tag)<-[r2:TAGGED]-(p2:Post)
return posts, freq, p2.Title, count(r2)
limit 10
I have a graph which has states following each other in time. Each of the states can have a number of actions that happened (0..n) and a number of recommendations (0..n) assigned by some software.
I can do a query on cypher like this
start n=node:name(name="State")
match a<-[:hasAction]-s-[:isA]->n
s-[l?:hasRecommendation]->r
where l.likelihood>0.2
return distinct s.name as state, collect(a.name) as actions,
r.name as recommendation, l.likelihood as likelihood
order by s.name asc, l.likelihood desc
which gives me a table like this
state | actions | recommendation | likelihood
--------------------------------------------------
State 1 | [a1,a2,a3] | a1 | 0.25
State 1 | [a1,a2,a3] | a4 | 0.05
State 2 | [a2,a3] | a3 | 0.56
State 2 | [a2,a3] | a2 | 0.34
State 2 | [a2,a3] | a1 | 0.15
If I process that table manually, I can filter these results and have only the top 2 results for each state for example. This is time consuming and very unelegant.
My problem is, that I never know how many recommendations a state has, so I can't use limit/skip here. Ideally I'd like it to return only a set amount of states (e.g 100) including their top recommendations - this query could return between 0 and 100*n lines.
Is there a better way to achieve this in cypher?
The easy way to achieve this is to select the states that have recommendations and limit the result to 100 first, and then only retrieve the top 2 recommendations for these 100 states by dynamically computing the percentiles for each state, something like this,
start n=node:name(name="State")
Match s-[:isA]->n, s-[?:hasRecommendation]->r
With distinct s
Order by s.name
limit 100
Match s-[?:hasRecommendation]->r
With s, (count(r)-1.0) / count(r) as p
Match s-[l?:hasRecommendation]->r
With s, percentile_disc(l.likelihood, p) as m
start n=node:name(name="State")
match a<-[:hasAction]-s-[:isA]->n,
s-[l?:hasRecommendation]->r
where l.likelihood>= m
return distinct s.name as state, collect(a.name) as actions,
r.name as recommendation, l.likelihood as likelihood
order by s.name asc, l.likelihood desc
It's a bit verbose, but Cypher does not support nested functions for aggregations. so I have to get the "count" and the "percentile" with two separate queries.
Im currently experimenting a bit with cypher. I have a simple setup of components beeing connected to a merchant by a realtionship "sells" having a property "price"
(merchant-[:sells{price:10}]->component)
I made a cypher query which calculates the lowest price, if you buy products from the same merchant.
MATCH sup-[s:sells]->component
WITH SUM(s.price) AS total, sup
RETURN sup, total
ORDER BY total ASC
Now while this is working, I have an issue finding the cheapest price(s), in case 2 or more suppliers are tied. Id like to get something like
_________________________
| price | supplier |
-------------------------
| 60 | conrad |
| | amazon |
-------------------------
You can view my setup here:
http://console.neo4j.org/?id=wpz165
EDIT:
Ok, i found a way although it isnt pretty.
MATCH sup-[s:sells]->component
WITH SUM(s.price) AS minprice, sup
ORDER BY minprice
LIMIT 1
MATCH sup2-[s2:sells]->component2
WITH SUM(s2.price) AS total2, sup2, minprice
WHERE total2 = minprice
RETURN minprice, sup2
How does this work? Well the first part finds the lowest price(by ordering and only returning the first row). The second part runs the whole query again, and filters out items which dont have the lowest price...so the whole query is run two times.
any better ideas???
For my aesthetic this is less ugly though it does require three WITH clauses.
Sum price by supplier for all components
Find Minimum price
Return all suppliers with minimum price
MATCH sup-[s:sells]->component
WITH sup, SUM(s.price) AS price_sum
MATCH sup, price_sum
WITH MIN(price_sum) AS price_min
MATCH sup2-[s2:sells]->component2
WITH sup2, SUM(s2.price) AS price_sum2, price_min
WHERE price_sum2 = price_min
RETURN sup2, price_sum2