Using cypher to get set amount of top results - neo4j

I have a graph which has states following each other in time. Each of the states can have a number of actions that happened (0..n) and a number of recommendations (0..n) assigned by some software.
I can do a query on cypher like this
start n=node:name(name="State")
match a<-[:hasAction]-s-[:isA]->n
s-[l?:hasRecommendation]->r
where l.likelihood>0.2
return distinct s.name as state, collect(a.name) as actions,
r.name as recommendation, l.likelihood as likelihood
order by s.name asc, l.likelihood desc
which gives me a table like this
state | actions | recommendation | likelihood
--------------------------------------------------
State 1 | [a1,a2,a3] | a1 | 0.25
State 1 | [a1,a2,a3] | a4 | 0.05
State 2 | [a2,a3] | a3 | 0.56
State 2 | [a2,a3] | a2 | 0.34
State 2 | [a2,a3] | a1 | 0.15
If I process that table manually, I can filter these results and have only the top 2 results for each state for example. This is time consuming and very unelegant.
My problem is, that I never know how many recommendations a state has, so I can't use limit/skip here. Ideally I'd like it to return only a set amount of states (e.g 100) including their top recommendations - this query could return between 0 and 100*n lines.
Is there a better way to achieve this in cypher?

The easy way to achieve this is to select the states that have recommendations and limit the result to 100 first, and then only retrieve the top 2 recommendations for these 100 states by dynamically computing the percentiles for each state, something like this,
start n=node:name(name="State")
Match s-[:isA]->n, s-[?:hasRecommendation]->r
With distinct s
Order by s.name
limit 100
Match s-[?:hasRecommendation]->r
With s, (count(r)-1.0) / count(r) as p
Match s-[l?:hasRecommendation]->r
With s, percentile_disc(l.likelihood, p) as m
start n=node:name(name="State")
match a<-[:hasAction]-s-[:isA]->n,
s-[l?:hasRecommendation]->r
where l.likelihood>= m
return distinct s.name as state, collect(a.name) as actions,
r.name as recommendation, l.likelihood as likelihood
order by s.name asc, l.likelihood desc
It's a bit verbose, but Cypher does not support nested functions for aggregations. so I have to get the "count" and the "percentile" with two separate queries.

Related

Query distinct property and return complete nodes

I got a lot of nodes, some with similar values in field X, I want to select by distinct X values and take all the popular nodes (order by some other field Y) with all their properties.
Example:
ID | X | Y | Name
1 | A | 100 | David
2 | A | 10 | Chris
3 | B | 5 | Brad
4 | B | 25 | Amber
Should return:
1 | A | 100 | David
4 | B | 25 | Amber
I managed to get the list by distinct X:
MATCH (u:NodeType)
RETURN DISTINCT u.X
I need to find the most popular (highest value of Y) nodes to join with my distinct nodes (which are now only a single property) and return whole nodes (with all the properties).
You are looking for an arg max-style query. I recently answered a similar problem using collect:
MATCH (u:NodeType)
WITH u
ORDER BY u.Y DESC
WITH u.X AS X, collect(u)[0] AS u
RETURN u
The idea is the following:
Order by the value of Y (descending).
Implicitly group by the values of X and for the aggregating function, use collect to gather other values to a list. The elements of the list are the nodes (which are still stored according to a descending order of Y).
For each collected list, select the first element with [0].
Maybe the query is a bit easier to read if you perform the last step in a separate clause (and not in the WITH clause that performs the collect):
MATCH (u:NodeType)
WITH u
ORDER BY u.Y DESC
WITH u.X AS X, collect(u) AS us
RETURN us[0] AS u

Neo4j Cypher - How to Count Multiple Property Values With Cypher Efficiently And Paginate Properly

I am struggling to get the proper cypher that is both efficient and allows pagination through skip and limit.
Here is the simple scenario: I have the related nodes (company)<-[A]-(set)<-[B]-(job) where there are multiple instances of (set) with distinct (job) instances related to them. The (job) nodes have a specific status property that can hold one of several states. We need to count the number of (job) nodes in a particular state per (set) and use skip and limit to paginate on the distinct (set) nodes.
So we can get a very efficient query for job.status counts using this.
match (c:Company {id: 'MY.co'})<-[:type_of]-(s:Set)<-[:job_for]-(j:Job)
return s.Description, j.Status, count(*) as StatusCount;
Which will give us a rows of the Set.Description, Job.Status, and JobStatus count. But we will get multiple rows for the Set based on the Job.Status. This is not conducive to paging over distinct sets though. Something like:
s.Description j.Status StatusCount
-------------------+--------------+----------------
Set 1 | Unassigned | 10
Set 1 | Completed | 2
Set 2 | Unassigned | 3
Set 1 | Reviewed | 10
Set 3 | Completed | 4
Set 2 | Reviewed | 7
What we are trying to achieve with the proper cypher is result rows based on distinct Sets. Something like this:
s.Description Unassigned Completed Reviewed
-------------------+--------------+-------------+----------
Set 1 | 10 | 2 | 10
Set 2 | 3 | 0 | 7
Set 3 | 0 | 4 | 0
This would then allow us to paginate over Sets using skip and limit properly.
I have tried many different approaches and cannot seem to find the right combination for this type of result. Anyone have any ideas? Thanks!
** EDIT - Using the answer provided by MIchael, here's how to get the status count values in java **
match (c:Company {id: 'MY.co'})<-[:type_of]-(s:Set)<-[:job_for]-(j:Job)
with s, j.Status as Status,count(*) as StatusCount
return s.Description, collect({Status:Status,StatusCount:StatusCount]) as StatusCounts;
List<Object> statusMaps = (List<Object>) row.get("StatusCounts");
for(Object statusEntry : statusMaps ) {
Map<String,Object> statusMap = (Map<String,Object>) statusEntry;
String status = (String) statusMap.get("Status");
Number count = statusMap.get("StatusCount");
}
You can use WITH and aggregation, and optionally a map result
match (c:Company {id: 'MY.co'})<-[:type_of]-(s:Set)<-[:job_for]-(j:Job)
with s, j.Status as Status,count(*) as StatusCount
return s.Description, collect([Status,StatusCount]);
or
match (c:Company {id: 'MY.co'})<-[:type_of]-(s:Set)<-[:job_for]-(j:Job)
with s, j.Status as Status,count(*) as StatusCount
return s.Description, collect({Status:Status,StatusCount:StatusCount]);

Why does this Cypher query take too much time?

I have
50K Post nodes
40K Tag nodes
125K TAGGED relationships (meaning average 2,5 tags per post)
in my graph and below query causes a "Java heap space" error.
match (p1:Post)-[r1:TAGGED]->(t:Tag)<-[r2:TAGGED]-(p2:Post)
return p1.Title, count(r1), p2.Title, count(r2)
limit 10
What I expected was some repeated rows depending on number of shared tags. I was not sure how limit would work (stop after first 10 posts or tags). But, since I have limit 10 I did not expect this query to traverse all the graph. It seems like it does.
UPDATE 1
With a few changes, Christophe Willemsen's query returns 10 rows in 15 sec.
// I need label for the otherPost because Users are also TAGGED
MATCH (post:Post)-[:TAGGED]->(t)<-[:TAGGED]-(otherPost:Post)
RETURN post.Title, count(t) as cnt, otherPost.Title
// ORDER BY cnt DESC // for now I do not need this
LIMIT 10;
I thought "ORDER BY" clause may cause traversal of all possible paths so I removed the clause but it is still 15 sec. It is also 15 sec. when I make the limit value 1 or 1000 without sorting.
What I expect from Neo4j was: "Start from any Post node, then jump to its Tags and find otherPosts that are tagged with the same tag. When there are 10 found stop traversing and return the results." I am pretty sure it is not doing this.
To make my expectation clear, assume that graph is this small and we use Limit 3 in the cypher query.
p1 - [t1, t2, t3] // Post1 is tagged with t1, t2 and t3
p2 - [t2, t3, t4]
p3 - [t3, t4, t5]
What I expect is:
Start form p1 (or any Post node)
Jump to t1
No other posts are tagged with t1
Jump to t2
p2 is tagged with t2 (1 of 3)
No other posts are tagged with t2
Jump to t3
p2 is tagged with t3 (2 of 3)
p3 is tagged with t3 (3 of 3)
we reached the limit, break
But, it seems like Limit is applied after traversing all data.
So, my question is now: Did Neo4j found all the matches and returned 10 of them or did it stop searching after first 10 matches? And of course, Why?
UPDATE 2
After helpful answers I managed to decrease the scope of my question so I tried below queries.
// 3 sec.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, count(t)
LIMIT 1;
// 3 sec.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, count(t)
LIMIT 1000;
// 100 ms.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, t.Name
LIMIT 1;
// 150 ms.
MATCH (p:Post)-[:TAGGED]->(t:Tag)
RETURN p.Title, t.Name
LIMIT 1000;
So, I still do not know why but, using aggregation methods (I tried collect(t.Name) instead of count) breaks the expected (at least my expectations :) behaviour of limit functionality.
This query will result in a global graph lookup, at least for neo4j 2.1.7 and below.
I would first matching the nodes and then expanding the path
MATCH (post:Post)
MATCH (post)-[:TAGS]->(t)<-[:TAGS]-(otherPost)
RETURN post, count(t) as cnt, otherPost
ORDER BY cnt DESC
LIMIT 10;
And this is the execution plan, as you can see by matching first the post nodes only (so labels index) it costs you only retrieving those and following relationships
ColumnFilter
|
+Top
|
+EagerAggregation
|
+Filter
|
+SimplePatternMatcher
|
+NodeByLabel
+----------------------+--------+--------+----------------------------------------------+------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+--------+--------+----------------------------------------------+------------------------------------------------------------------------------------------------+
| ColumnFilter | 10 | 0 | | keep columns post, cnt, otherPost |
| Top | 10 | 0 | | { AUTOINT0}; Cached( INTERNAL_AGGREGATEc24f01bf-69cc-4bd9-9aed-be257028194b of type Integer) |
| EagerAggregation | 9900 | 0 | | post, otherPost |
| Filter | 134234 | 0 | | NOT( UNNAMED30 == UNNAMED43) |
| SimplePatternMatcher | 134234 | 0 | t, UNNAMED43, UNNAMED30, post, otherPost | |
| NodeByLabel | 100 | 101 | post, post | :Post |
+----------------------+--------+--------+----------------------------------------------+------------------------------------------------------------------------------------------------+
Total database accesses: 101
And here a blog post explaining why I removed labels except for the first part of the query : http://graphaware.com/neo4j/2015/01/16/neo4j-graph-model-design-labels-versus-indexed-properties.html
What Christophe said and
Try to reduce the cardinality in between:
match (p1:Post)-[r1:TAGGED]->(t:Tag)
WITH tag, count(*) as freq, collect(distinct p1.Title) as posts
MATCH (tag)<-[r2:TAGGED]-(p2:Post)
return posts, freq, p2.Title, count(r2)
limit 10

Do labels order effects search time?

I'm using neo4j 2.1.7 Recently i was experimenting with Match queries, searching for nodes with several labels. And i found out, that generally query
Match (p:A:B) return count(p) as number
and
Match (p:B:A) return count(p) as number
works different time, extremely in cases when you have for example 2 millions of Nodes A and 0 of Nodes B.
So do labels order effects search time? Is this future is documented anywhere?
Neo4j internally maintains a labelscan store - that's basically a lookup to quickly get all nodes carrying a definied label A.
When doing a query like
MATCH (n:A:B) return count(n)
labelscanstore is used to find all A nodes and then they're filtered if those nodes carry label B as well. If n(A) >> n(B) it's way more efficient to do MATCH (n:B:A) instead since you look up only a few B nodes and filter those for A.
You can use PROFILE MATCH (n:A:B) return count(n) to see the query plan. For Neo4j <= 2.1.x you'll see a different query plan depending on the order of the labels you've specified.
Starting with Neo4j 2.2 (milestone M03 available as of writing this reply) there's a cost based Cypher optimizer. Now Cypher is aware of node statistics and they are used to optimize the query.
As an example I've used the following statements to create some test data:
create (:A:B);
with 1 as a foreach (x in range(0,1000000) | create (:A));
with 1 as a foreach (x in range(0,100) | create (:B));
We have now 100 B nodes, 1M A nodes and 1 AB node. In 2.2 the two statements:
MATCH (n:B:A) return count(n)
MATCH (n:A:B) return count(n)
result in the exact same query plan (and therefore in the same execution speed):
+------------------+---------------+------+--------+-------------+---------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+------------------+---------------+------+--------+-------------+---------------+
| EagerAggregation | 3 | 1 | 0 | count(n) | |
| Filter | 12 | 1 | 12 | n | hasLabel(n:A) |
| NodeByLabelScan | 12 | 12 | 13 | n | :B |
+------------------+---------------+------+--------+-------------+---------------+
Since there are only few B nodes, it's cheaper to scan for B's and filter for A. Smart Cypher, isn't it ;-)

Cypher: group by cheapest price

Im currently experimenting a bit with cypher. I have a simple setup of components beeing connected to a merchant by a realtionship "sells" having a property "price"
(merchant-[:sells{price:10}]->component)
I made a cypher query which calculates the lowest price, if you buy products from the same merchant.
MATCH sup-[s:sells]->component
WITH SUM(s.price) AS total, sup
RETURN sup, total
ORDER BY total ASC
Now while this is working, I have an issue finding the cheapest price(s), in case 2 or more suppliers are tied. Id like to get something like
_________________________
| price | supplier |
-------------------------
| 60 | conrad |
| | amazon |
-------------------------
You can view my setup here:
http://console.neo4j.org/?id=wpz165
EDIT:
Ok, i found a way although it isnt pretty.
MATCH sup-[s:sells]->component
WITH SUM(s.price) AS minprice, sup
ORDER BY minprice
LIMIT 1
MATCH sup2-[s2:sells]->component2
WITH SUM(s2.price) AS total2, sup2, minprice
WHERE total2 = minprice
RETURN minprice, sup2
How does this work? Well the first part finds the lowest price(by ordering and only returning the first row). The second part runs the whole query again, and filters out items which dont have the lowest price...so the whole query is run two times.
any better ideas???
For my aesthetic this is less ugly though it does require three WITH clauses.
Sum price by supplier for all components
Find Minimum price
Return all suppliers with minimum price
MATCH sup-[s:sells]->component
WITH sup, SUM(s.price) AS price_sum
MATCH sup, price_sum
WITH MIN(price_sum) AS price_min
MATCH sup2-[s2:sells]->component2
WITH sup2, SUM(s2.price) AS price_sum2, price_min
WHERE price_sum2 = price_min
RETURN sup2, price_sum2

Resources