I got a lot of nodes, some with similar values in field X, I want to select by distinct X values and take all the popular nodes (order by some other field Y) with all their properties.
Example:
ID | X | Y | Name
1 | A | 100 | David
2 | A | 10 | Chris
3 | B | 5 | Brad
4 | B | 25 | Amber
Should return:
1 | A | 100 | David
4 | B | 25 | Amber
I managed to get the list by distinct X:
MATCH (u:NodeType)
RETURN DISTINCT u.X
I need to find the most popular (highest value of Y) nodes to join with my distinct nodes (which are now only a single property) and return whole nodes (with all the properties).
You are looking for an arg max-style query. I recently answered a similar problem using collect:
MATCH (u:NodeType)
WITH u
ORDER BY u.Y DESC
WITH u.X AS X, collect(u)[0] AS u
RETURN u
The idea is the following:
Order by the value of Y (descending).
Implicitly group by the values of X and for the aggregating function, use collect to gather other values to a list. The elements of the list are the nodes (which are still stored according to a descending order of Y).
For each collected list, select the first element with [0].
Maybe the query is a bit easier to read if you perform the last step in a separate clause (and not in the WITH clause that performs the collect):
MATCH (u:NodeType)
WITH u
ORDER BY u.Y DESC
WITH u.X AS X, collect(u) AS us
RETURN us[0] AS u
Related
In neo4j my database consists of chains of nodes. For each distinct stucture/layout (does graph theory has a better word?), I want to count the number of chains. For example, the database consists of 9 nodes and 5 relationships as this:
(:a)->(:b)
(:b)->(:a)
(:a)->(:b)
(:a)->(:b)->(:b)
where (:a) is a node with label a. Properties on nodes and relationships are irrelevant.
The result of the counting should be:
------------------------
| Structure | n |
------------------------
| (:a)->(:b) | 2 |
| (:b)->(:a) | 1 |
| (:a)->(:b)->(:b) | 1 |
------------------------
Is there a query that can achieve this?
Appendix
Query to create test data:
create (:a)-[:r]->(:b), (:b)-[:r]->(:a), (:a)-[:r]->(:b), (:a)-[:r]->(:b)-[:r]->(:b)
EDIT:
Thanks for the clarification.
We can get the equivalent of what you want, a capture of the path pattern using the labels present:
MATCH path = (start)-[*]->(end)
WHERE NOT ()-->(start) and NOT (end)-->()
RETURN [node in nodes(path) | labels(node)[0]] as structure, count(path) as n
This will give you a list of the labels of the nodes (the first label present for each...remember that nodes can be multi-labeled, which may throw off your results).
As for getting it into that exact format in your example, that's a different thing. We could do this with some text functions in APOC Procedures, specifically apoc.text.join().
We would need to first add formatting around the extraction of the first label to add the prefixed : as well as the parenthesis. Then we could use apoc.text.join() to get a string where the nodes are joined by your desired '->' symbol:
MATCH path = (start)-[*]->(end)
WHERE NOT ()-->(start) and NOT (end)-->()
WITH [node in nodes(path) | labels(node)[0]] as structure, count(path) as n
RETURN apoc.text.join([label in structure | '(:' + label + ')'], '->') as structure, n
I have company metric data which I want to query
----------------------------------------------------
|Metrics | Year | Qtr | Department | Value|
----------------------------------------------------
|Revenue| 2017 | Q1 | Dep1 | 2000045|
|Revenue| 2017 | Q2 | Dep1 | 2000046|
|Revenue| 2017 | Q2 | Dep2 | 2000047|
|Revenue| 2017 | Q3 | Dep2 | 2000048|
|Revenue| 2017 | Q3 | Dep3 | 2000049|
|Sales | 2017 | Q1 | Dep1 | 2000041|
|Sales | 2017 | Q1 | Dep2 | 2000052|
|Sales | 2017 | Q2 | Dep1 | 2000053|
-----------------------------------------------------
Now I model the above data like this
Year, Qtr and Department as nodes like
(d:DIM {name:"2012","type":"year")
Value as nodes like
(v:VALUE {value:2000053})
and metrics as relationships like
(d:DIM {name:"2012","type":"year") - [:REVENUE]-> (v:VALUE {value:2000053})
So for each record there might be three relationships with VALUE node.
Now comes the query part:
Given dimension the query should get the values, like given Year 2017, Qtr q1 should return values corresponding to this filter if I add Dep 1 then it should further filter the result.
I tried some queries like
Match (d:DIM)-[:REVENUE]->(v:VALUE)
where d.name in ["2017","q1"]
Return d,v
But this query provides UNION of 2017 and q1 not the intersection I am looking for.
And further, I might do group by using the type attributes.
There are a couple ways you could do this. While I'd personally recommend using separate node labels for Year, Qtr, and Department (maybe Metrics too), I'll use your current model.
The piece you need for an intersection is the ALL() predicate, which operates on a list of values. In this case, we'll collect all matching :DIM nodes. To make the match efficient, we'll want to match to v nodes from the first item of the list, then ensure v is connected to the remaining items in the list (faster than filtering from all :VALUE nodes).
match (d:DIM)
where d.name in ["2017","q1"]
with collect(d) as dims
with head(dims) as head, tail(dims) as dims
match (head)-[:REVENUE]->(v)
where all(dim in dims where (dim)-[:REVENUE]->(v))
return dims, v
Alternately, if you have APOC Procedures installed, you can make use of an intersection function:
match (d:DIM)-[:REVENUE]->(v)
where d.name in ["2017","q1"]
with d, collect(v) as values
with collect(d) as dims, collect(values) as allValues // list of lists
with dims, reduce(inter = head(allValues), values in tail(allValues) | apoc.coll.intersection(inter, values)) as values
return dims, values
EDIT
I remembered another approach that is likely to be faster. Give this one a try:
with ["2017","q1"] as dimInput
with dimInput, size(dimInput) as dimCnt
match (d:DIM)
where d.name in dimInput
match (d)-[:REVENUE]->(v)
with v, dimCnt, count(distinct d) as cnt
where dimCnt = cnt
return v
I have
50K Post nodes
40K Tag nodes
125K TAGGED relationships (meaning average 2,5 tags per post)
in my graph and below query causes a "Java heap space" error.
match (p1:Post)-[r1:TAGGED]->(t:Tag)<-[r2:TAGGED]-(p2:Post)
return p1.Title, count(r1), p2.Title, count(r2)
limit 10
What I expected was some repeated rows depending on number of shared tags. I was not sure how limit would work (stop after first 10 posts or tags). But, since I have limit 10 I did not expect this query to traverse all the graph. It seems like it does.
UPDATE 1
With a few changes, Christophe Willemsen's query returns 10 rows in 15 sec.
// I need label for the otherPost because Users are also TAGGED
MATCH (post:Post)-[:TAGGED]->(t)<-[:TAGGED]-(otherPost:Post)
RETURN post.Title, count(t) as cnt, otherPost.Title
// ORDER BY cnt DESC // for now I do not need this
LIMIT 10;
I thought "ORDER BY" clause may cause traversal of all possible paths so I removed the clause but it is still 15 sec. It is also 15 sec. when I make the limit value 1 or 1000 without sorting.
What I expect from Neo4j was: "Start from any Post node, then jump to its Tags and find otherPosts that are tagged with the same tag. When there are 10 found stop traversing and return the results." I am pretty sure it is not doing this.
To make my expectation clear, assume that graph is this small and we use Limit 3 in the cypher query.
p1 - [t1, t2, t3] // Post1 is tagged with t1, t2 and t3
p2 - [t2, t3, t4]
p3 - [t3, t4, t5]
What I expect is:
Start form p1 (or any Post node)
Jump to t1
No other posts are tagged with t1
Jump to t2
p2 is tagged with t2 (1 of 3)
No other posts are tagged with t2
Jump to t3
p2 is tagged with t3 (2 of 3)
p3 is tagged with t3 (3 of 3)
we reached the limit, break
But, it seems like Limit is applied after traversing all data.
So, my question is now: Did Neo4j found all the matches and returned 10 of them or did it stop searching after first 10 matches? And of course, Why?
UPDATE 2
After helpful answers I managed to decrease the scope of my question so I tried below queries.
// 3 sec.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, count(t)
LIMIT 1;
// 3 sec.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, count(t)
LIMIT 1000;
// 100 ms.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, t.Name
LIMIT 1;
// 150 ms.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, t.Name
LIMIT 1000;
So, I still do not know why but, using aggregation methods (I tried collect(t.Name) instead of count) breaks the expected (at least my expectations :) behaviour of limit functionality.
This query will result in a global graph lookup, at least for neo4j 2.1.7 and below.
I would first matching the nodes and then expanding the path
MATCH (post:Post)
MATCH (post)-[:TAGS]->(t)<-[:TAGS]-(otherPost)
RETURN post, count(t) as cnt, otherPost
ORDER BY cnt DESC
LIMIT 10;
And this is the execution plan, as you can see by matching first the post nodes only (so labels index) it costs you only retrieving those and following relationships
ColumnFilter
|
+Top
|
+EagerAggregation
|
+Filter
|
+SimplePatternMatcher
|
+NodeByLabel
+----------------------+--------+--------+----------------------------------------------+------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+--------+--------+----------------------------------------------+------------------------------------------------------------------------------------------------+
| ColumnFilter | 10 | 0 | | keep columns post, cnt, otherPost |
| Top | 10 | 0 | | { AUTOINT0}; Cached( INTERNAL_AGGREGATEc24f01bf-69cc-4bd9-9aed-be257028194b of type Integer) |
| EagerAggregation | 9900 | 0 | | post, otherPost |
| Filter | 134234 | 0 | | NOT( UNNAMED30 == UNNAMED43) |
| SimplePatternMatcher | 134234 | 0 | t, UNNAMED43, UNNAMED30, post, otherPost | |
| NodeByLabel | 100 | 101 | post, post | :Post |
+----------------------+--------+--------+----------------------------------------------+------------------------------------------------------------------------------------------------+
Total database accesses: 101
And here a blog post explaining why I removed labels except for the first part of the query : http://graphaware.com/neo4j/2015/01/16/neo4j-graph-model-design-labels-versus-indexed-properties.html
What Christophe said and
Try to reduce the cardinality in between:
match (p1:Post)-[r1:TAGGED]->(t:Tag)
WITH tag, count(*) as freq, collect(distinct p1.Title) as posts
MATCH (tag)<-[r2:TAGGED]-(p2:Post)
return posts, freq, p2.Title, count(r2)
limit 10
I'm using neo4j 2.1.7 Recently i was experimenting with Match queries, searching for nodes with several labels. And i found out, that generally query
Match (p:A:B) return count(p) as number
and
Match (p:B:A) return count(p) as number
works different time, extremely in cases when you have for example 2 millions of Nodes A and 0 of Nodes B.
So do labels order effects search time? Is this future is documented anywhere?
Neo4j internally maintains a labelscan store - that's basically a lookup to quickly get all nodes carrying a definied label A.
When doing a query like
MATCH (n:A:B) return count(n)
labelscanstore is used to find all A nodes and then they're filtered if those nodes carry label B as well. If n(A) >> n(B) it's way more efficient to do MATCH (n:B:A) instead since you look up only a few B nodes and filter those for A.
You can use PROFILE MATCH (n:A:B) return count(n) to see the query plan. For Neo4j <= 2.1.x you'll see a different query plan depending on the order of the labels you've specified.
Starting with Neo4j 2.2 (milestone M03 available as of writing this reply) there's a cost based Cypher optimizer. Now Cypher is aware of node statistics and they are used to optimize the query.
As an example I've used the following statements to create some test data:
create (:A:B);
with 1 as a foreach (x in range(0,1000000) | create (:A));
with 1 as a foreach (x in range(0,100) | create (:B));
We have now 100 B nodes, 1M A nodes and 1 AB node. In 2.2 the two statements:
MATCH (n:B:A) return count(n)
MATCH (n:A:B) return count(n)
result in the exact same query plan (and therefore in the same execution speed):
+------------------+---------------+------+--------+-------------+---------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+------------------+---------------+------+--------+-------------+---------------+
| EagerAggregation | 3 | 1 | 0 | count(n) | |
| Filter | 12 | 1 | 12 | n | hasLabel(n:A) |
| NodeByLabelScan | 12 | 12 | 13 | n | :B |
+------------------+---------------+------+--------+-------------+---------------+
Since there are only few B nodes, it's cheaper to scan for B's and filter for A. Smart Cypher, isn't it ;-)
I have a graph which has states following each other in time. Each of the states can have a number of actions that happened (0..n) and a number of recommendations (0..n) assigned by some software.
I can do a query on cypher like this
start n=node:name(name="State")
match a<-[:hasAction]-s-[:isA]->n
s-[l?:hasRecommendation]->r
where l.likelihood>0.2
return distinct s.name as state, collect(a.name) as actions,
r.name as recommendation, l.likelihood as likelihood
order by s.name asc, l.likelihood desc
which gives me a table like this
state | actions | recommendation | likelihood
--------------------------------------------------
State 1 | [a1,a2,a3] | a1 | 0.25
State 1 | [a1,a2,a3] | a4 | 0.05
State 2 | [a2,a3] | a3 | 0.56
State 2 | [a2,a3] | a2 | 0.34
State 2 | [a2,a3] | a1 | 0.15
If I process that table manually, I can filter these results and have only the top 2 results for each state for example. This is time consuming and very unelegant.
My problem is, that I never know how many recommendations a state has, so I can't use limit/skip here. Ideally I'd like it to return only a set amount of states (e.g 100) including their top recommendations - this query could return between 0 and 100*n lines.
Is there a better way to achieve this in cypher?
The easy way to achieve this is to select the states that have recommendations and limit the result to 100 first, and then only retrieve the top 2 recommendations for these 100 states by dynamically computing the percentiles for each state, something like this,
start n=node:name(name="State")
Match s-[:isA]->n, s-[?:hasRecommendation]->r
With distinct s
Order by s.name
limit 100
Match s-[?:hasRecommendation]->r
With s, (count(r)-1.0) / count(r) as p
Match s-[l?:hasRecommendation]->r
With s, percentile_disc(l.likelihood, p) as m
start n=node:name(name="State")
match a<-[:hasAction]-s-[:isA]->n,
s-[l?:hasRecommendation]->r
where l.likelihood>= m
return distinct s.name as state, collect(a.name) as actions,
r.name as recommendation, l.likelihood as likelihood
order by s.name asc, l.likelihood desc
It's a bit verbose, but Cypher does not support nested functions for aggregations. so I have to get the "count" and the "percentile" with two separate queries.