neo4j complains about implicit grouping expressions - neo4j

I copy one of the example class query from neo4j cypher aggregation class
https://www.youtube.com/watch?v=wfMTg0ujVjk
Match (a:Actor)
Where a.born is not null
And a.name starts with 'Tom'
with count(a) as NumActors, collect(duration.between(date(a.born), date())) as Ages
Unwind Ages AS x
Return sum(x), sum(x)/NumActors
However, in the neo4j web console provided by the class. I am getting this error
Aggregation column contains implicit grouping expressions. For example, in 'RETURN n.a, n.a + n.b + count(*)' the aggregation expression 'n.a + n.b + count(*)' includes the implicit grouping key 'n.b'. It may be possible to rewrite the query by extracting these grouping/aggregation expressions into a preceding WITH clause. Illegal expression(s): NumActors (line 6, column 8 (offset: 183))
"Return sum(x), sum(x)/NumActors"
^
Apparently, this query works
Match (a:Actor)
Where a.born is not null
And a.name starts with 'Tom'
with count(a) as NumActors, collect(duration.between(date(a.born), date())) as Ages
Unwind Ages AS x
Return sum(x)
So it was complaining when we sum(x) with NumActors as an implicit aggregation key. How can I achieve the same goal by updating the query (computing sum and average)? my initial query syntax looks perfectly fine to me...

Use this function AVG() for this purpose:
Match (a:Actor)
Where a.born is not null
And a.name starts with 'Tom'
with collect(duration.between(date(a.born), date())) as Ages
Unwind Ages AS x
Return sum(x), avg(x)
Sample result:
╒══════════╤═════════╕
│"sum(x)" │"avg(x)" │
╞══════════╪═════════╡
│P276Y5M14D│P69Y1M11D│
└──────────┴─────────┘
reference:
https://neo4j.com/docs/cypher-manual/current/functions/aggregating/#functions-avg-duration
For the explanation of your error, neo4j allows aggregation expressions if it follows some requirements stated on this guideline: https://neo4j.com/docs/cypher-manual/current/functions/aggregating/#grouping-keys
If you insist to use your original query, this is the fix as stated in the error message of neo4j. You should use WITH to get the sum() first then use it to compute the average.
Match (a:Actor)
Where a.born is not null
And a.name starts with 'Tom'
with count(a) as NumActors, collect(duration.between(date(a.born), date())) as Ages
UNWIND Ages as x
WITH sum(x) as total, NumActors
RETURN total, total/NumActors as avg

This is a bit tricky. So I will try to simplify it as much as possible. Let's try to understand how aggregations work, for this query:
Match (a:Actor)
Where a.born is not null
And a.name starts with 'Tom'
with count(a) as NumActors, collect(duration.between(date(a.born), date())) as Ages
Unwind Ages AS x
Return sum(x)
We first match for actors having their name starting as Tom and have some birth date. Let's say we have 10 actors, after the MATCH and WHERE clause executes. Now comes the statement
with count(a) as NumActors, collect(duration.between(date(a.born), date())) as Ages
This performs two aggregate operations, count and collect, but note that we have not specified any grouping key specifically. Hence the output, of this query will be a single row with two columns NumActors and Ages. Now we unwind the ages array, so now we will have 10 rows, and finally, we return the sum, since there is no explicit grouping key specified the sum is calculated for all the data rows. Hence it works.
Let's consider your query:
Match (a:Actor)
Where a.born is not null
And a.name starts with 'Tom'
with count(a) as NumActors, collect(duration.between(date(a.born), date())) as Ages
Unwind Ages AS x
Return sum(x), sum(x)/NumActors
Everything till UNWIND stage will be the same as above. Now, notice the return statement:
Return sum(x), sum(x)/NumActors
Here, also no explicit grouping key is specified. So, the sum(x), will include all the rows, but the term sum(x)/NumActors, is erroneous because Neo4j is unable to determine, from which row the value of NumActors to pick up, since it is not specified as the grouping key. That's why you get the error. The message:
Aggregation column contains implicit grouping expressions. For example, in 'RETURN n.a, n.a + n.b + count(*)' the aggregation expression 'n.a + n.b + count(*)' includes the implicit grouping key 'n.b'. It may be possible to rewrite the query by extracting these grouping/aggregation expressions into a preceding WITH clause. Illegal expression(s): NumActors (line 6, column 8 (offset: 183))
"Return sum(x), sum(x)/NumActors"
Clearly says, that the key n.b is getting grouped, it can't be used in an aggregate expression. Similarly here, NumActors is getting grouped, we can't use it in aggregation expression. Try this:
Match (a:Actor)
Where a.born is not null
And a.name starts with 'Tom'
with count(a) as NumActors, collect(duration.between(date(a.born), date())) as Ages
Unwind Ages AS x
Return sum(x), sum(x)/COLLECT(DISTINCT NumActors)[0]

The error is caused by the fact that you used NumActors in the RETURN clause without making it an explicit grouping key.
OPTION 1
Specify NumActors as an explicit grouping key:
...
RETURN sum(x), sum(x)/NumActors, NumActors
OPTION 2
Don't do aggregation in the RETURN clause:
MATCH (a:Actor)
WHERE a.born IS NOT NULL AND a.name STARTS WITH 'Tom'
WITH COUNT(a) AS numActors, SUM(duration.between(DATE(a.born), DATE())) as sumAges
RETURN sumAges, sumAges/numActors
NOTE: The code above also simplifies (and speeds up) the query by doing the SUM in the first (and now, only) WITH clause. You don't need the wasteful COLLECT -> UNWIND -> SUM sequence.
OPTION 3
If you don't need to have the sum of all the ages returned, just directly calculate the average age:
MATCH (a:Actor)
WHERE a.born IS NOT NULL AND a.name STARTS WITH 'Tom'
RETURN AVG(duration.between(DATE(a.born), DATE())) as avgAge
ADDENDUM
There is another performance improvement that could be made to all the solutions, and it would also remove a possible error.
We should calculate the DATE() value once, put it in a variable, and use that variable to calculate all ages. Not only would it avoid repeated calls to DATE(), but it would guarantee that all ages are calculated using the same current date.
Without this tweak, then it is possible for us to use more than one current date during the calculation, which would give us an inconsistent result.
So, for instance, Option 3 could be turned into this:
WITH DATE() AS now
MATCH (a:Actor)
WHERE a.born IS NOT NULL AND a.name STARTS WITH 'Tom'
RETURN AVG(duration.between(DATE(a.born), now)) as avgAge

Related

cypher distinct is returning duplicate using with parameter

MATCH (c:someNode) WHERE LOWER(c.erpId) contains (LOWER("1"))
OR LOWER(c.constructionYear) contains (LOWER("1"))
OR LOWER(c.label) contains (LOWER("1"))
OR LOWER(c.name) contains (LOWER("1"))
OR LOWER(c.description) contains (LOWER("1"))with collect(distinct c) as rows, count(c) as total
MATCH (c:someNode)-[adtype:OFFICIAL_someNode_ADDRESS]->(ad:anotherObject)
WHERE toString(ad.streetAddress) contains "1"
OR toString(ad.postalCity) contains "1"
with distinct rows+collect( c) as rows, count(c) +total as total
UNWIND rows AS part
RETURN part order by part.name SKIP 20 Limit 20
When I run the following cypher query it returns duplicate results. Also it the skip does not seem to work. What am I doing worng
When you use WITH DISTINCT a, b, c (or RETURN DISTINCT a, b, c), that just means that you want each resulting record ({a: ..., b: ..., c: ...}) to be distinct -- it does not affect in any way the contents of any lists that may be part of a, b, or c.
Below is a simplified query that might work for you. It does not use the LOWER() and TOSTRING() functions at all, as they appear to be superfluous. It also only uses a single MATCH/WHERE pair to find all the the nodes of interest. The pattern comprehension syntax is used as part of the WHERE clause to get a non-empty list of true value(s) iff there are any anotherObject node(s) of interest. Notice that DISTINCT is not needed.
MATCH (c:someNode)
WHERE
ANY(
x IN [c.erpId, c.constructionYear, c.label, c.name, c.description]
WHERE x CONTAINS "1") OR
[(c)-[:OFFICIAL_someNode_ADDRESS]->(ad:anotherObject)
WHERE ad.streetAddress CONTAINS "1" OR ad.postalCity CONTAINS "1"
| true][0]
RETURN c AS part
ORDER BY part.name SKIP 20 LIMIT 20;

Comparision operator with in properies of query of Neo4j

I am in need of get data of whether there is no relation exists between two labels and condition based data on one of labels. I found an answer following ::
MATCH (n:Label1)
WHERE NOT (n)-[:REL_1]-(:Label2)
OR (n)-[:REL_1]-(e:Label2 {id:1})
RETURN count(DISTINCT(n))
But What I need is like all of id>=5 data should come in result
If I perform a query like ::
MATCH (n:Label1)
WHERE NOT (n)-[:REL_1]-(:Label2)
OR (n)-[:REL_1]-(e:Label2)
WHERE e.id >= 5
RETURN count(DISTINCT(n))
It is producing error ::
Invalid input 'H': expected 'i/I' (line 1, column 94 (offset: 93))
[UPDATED]
A Cypher query cannot have 2 WHERE clauses in a row. In addition, you have not defined the e identifier.
This query should work (I assume you only want to count n if it does not have such a relationship OR if it has at least one in which e.id is at least 5):
MATCH (n:Label1)
OPTIONAL MATCH (n)-[:REL_1]-(e:Label2)
WITH n, e
WHERE e IS NULL OR e.id >= 5
RETURN count(DISTINCT n);
You can nest WHERE filters, you just have to give up syntactic sugar to do it.
MATCH (n:Label1)
WHERE ALL(p IN (n) - [:REL_1] - (:Label2) WHERE LAST(NODES(p))['id'] >= 5)
RETURN COUNT(n)
You reconstruct any match-filter construct you can dream of with the ANY, ALL, NONE toolset, which allow you to apply filters internally as well and nest the iterable component (between IN and `WHERE) to multiple depths.

Neo4j indices slow when querying across 2 labels

I've got a graph where each node has label either A or B, and an index on the id property for each label:
CREATE INDEX ON :A(id);
CREATE INDEX ON :B(id);
In this graph, I want to find the node(s) with id "42", but I don't know a-priori the label. To do this I am executing the following query:
MATCH (n {id:"42"}) WHERE (n:A OR n:B) RETURN n;
But this query takes 6 seconds to complete. However, doing either of:
MATCH (n:A {id:"42"}) RETURN n;
MATCH (n:B {id:"42"}) RETURN n;
Takes only ~10ms.
Am I not formulating my query correctly? What is the right way to formulate it so that it takes advantage of the installed indices?
Here is one way to use both indices. result will be a collection of matching nodes.
OPTIONAL MATCH (a:B {id:"42"})
OPTIONAL MATCH (b:A {id:"42"})
RETURN
(CASE WHEN a IS NULL THEN [] ELSE [a] END) +
(CASE WHEN b IS NULL THEN [] ELSE [b] END)
AS result;
You should use PROFILE to verify that the execution plan for your neo4j environment uses the NodeIndexSeek operation for both OPTIONAL MATCH clauses. If not, you can use the USING INDEX clause to give a hint to Cypher.
You should use UNION to make sure that both indexes are used. In your question you almost had the answer.
MATCH (n:A {id:"42"}) RETURN n
UNION
MATCH (n:B {id:"42"}) RETURN n
;
This will work. To check your query use profile or explain before your query statement to check if the indexes are used .
Indexes are formed and and used via a node label and property, and to use them you need to form your query the same way. That means queries w/out a label will scan all nodes with the results you got.

Neo4j, querying multiple lucene indexes while returning a pageable result

I've been trying to write a cypher query which enables me to get results from multiple lucene indexes, while enabling a pageable result.
This is as far as I got:
START u=node:Index1(lucene_expression1)
RETURN COLLECT(u) as clt
START u=node:Index2(lucene_expression2)
RETURN clt + COLLECT(u) as clt
UNWIND clt AS u
WITH DISTINCT u
RETURN u ORDER BY u.name SKIP 0 LIMIT 10
The problem is when the second index doesn't return any results,
no results are returned, ignoring the results returned by the first index.
I think this is because of the order of execution, that unless COLLECT or COUNT are the only returned fields, empty result set always returns empty result set.
Just to clarify, I know I can use UNION in order to get the full data set, but then I'll need to apply the paging outside of Neo4j, which I wish to avoid.
Thanks
Works for me:
START n=node:node_auto_index(name="Neo2")
WITH collect(n) AS c
START n=node:node_auto_index("name:Neo")
WITH c + collect(n) AS c2 UNWIND c2 AS n
RETURN n
SKIP 0 LIMIT 10
see: http://console.neo4j.org/r/wrokab

Neo4j Converting Boolean to an Int

I want to return the count of the union statement, but I'm having a little trouble with my return statement.
Given a venn diagram, the union is the sum of the "areas" of the two circles minus the intersection between them. I'm trying to emulate this, but I ran into a little trouble because Booleans don't convert into ints.
I'm trying to return something like this:
COUNT(DISTINCT a.name) + COUNT(DISTINCT b.name) - (a.name == b.name)
You can do CASE WHEN a.name = b.name THEN 1 ELSE 0 END (and do sum on that, or something). However, you might have dups if you're doing distinct of the other two--maybe you need to adjust something in the rest of your query to avoid duplicates, if you can give us more detail.
If we assume your original query looked like the first UNION example in the neo4j 2.1.5 cheat sheet:
MATCH (a)-[:KNOWS]->(b)
RETURN b.name
UNION
MATCH (a)-[:LOVES]->(b)
RETURN b.name
Then you can get the count of the number of distinct names in the UNION this way:
OPTIONAL MATCH (a)-[:KNOWS]->(b)
WITH COLLECT(DISTINCT b.name) AS n1
OPTIONAL MATCH (c)-[:LOVES]->(d)
WITH COLLECT(DISTINCT d.name) AS n2, n1
RETURN LENGTH(filter(x IN n2 WHERE NOT (x IN n1))) + LENGTH(n1)
I don't see a way to use an actual UNION statement to calculate the answer.
This may be a bit more cypher than you planned to write but I was recently in a similar situation and I ended up putting the sets and intersection in collections and figuring out the resulting difference.
I am sure there is a better way but this is what i came up with. Essentially, I found set 1 and set 2 and put them each in a collection. Then I found the intersection by finding all of the things that were the same and put them in another collection called the intersection. then I just filtered down each of set1, and set2 against the intersection. In the end I am left with two sets that contain the nodes out of the intersection.
match (things_in_set_1)
where <things in set 1 criteria>
with collect(things_in_set_1.name) as set1
match (things_in_set_2)
where <things in set 2 criteria>
with collect(things_in_set_2.name) as set2, set1
optional match (things_in_set_1),(things_in_set_2)
where things_in_set_1.name = things_in_set_2.name
with collect(things_in_set_1.name) as intersection, set1, set2
with filter( id IN set1 WHERE not(id in(intersection)) ) as set_unique_nodes1, set2, intersection
with filter( id IN set2 WHERE not(id in(intersection)) ) as set_unique_nodes2, set_unique_nodes1
return length(set_unique_nodes2) + length(set_unique_nodes1)

Resources