Cypher: One aggregate function on top of the other - neo4j

I am looking for a way to perform one aggregate function on top of the results of another one. In particular, I would like to join the following two queries:
MATCH (e :Event) - [:ATTENDED_BY] -> (a :Person)
WITH e, collect(a) AS attendants
WHERE ALL (a in attendants WHERE a.Company="XYZ")
RETURN e.name AS name, count(*) as number_occurrences
ORDER BY number_events DESC;
MATCH (e:Event) - [:ATTENDED_BY] -> (a :Person)
WITH e, collect(a) AS attendants
WHERE ALL (a in attendants WHERE a.Company="XYZ")
WITH e.name AS name, count(*) as number_occurrences
RETURN percentileDisc(number_occurrences,0.95) as percentile;
The first query gives all the event names wwhere only people from a single company ("XYZ") attended, as well as the number of occurrences of those events. The second one returns the minimum number of occurrences for the top 5% most frequent events. What I would like to get is the names and number of occurrences of these 5% most frequent events. Any suggestions?

I managed to solve the query using WITH clause, of course, but the key was to understand its usageproperly. The idea is that only the variables passed with the last WITH clause are visible further. That is why after we get the "percentile" variable in the first part of the query, we need to keep on passing it in the second part of the query in all the subsequent WITH clauses.
MATCH (e :Event) - [:ATTENDED_BY] -> (a :Person)
WITH e, collect(a) AS attendants
WHERE ALL (a in attendants WHERE a.Company="XYZ")
WITH e.name AS name, count(*) as number_occurences
WITH percentileDisc(number_occurencies,0.95) as percentile
MATCH (e :Event) - [:ATTENDED_BY] -> (a :Person)
WITH percentile, e, collect(a) AS attendants
WHERE ALL (a in attendants WHERE a.Company="XYZ")
WITH percentile, e.name AS name, count(*) as number_occurences
WHERE number_occurences > percentile
RETURN name, number_occurences
ORDER BY number_occurences DESC

Related

cypher: not possible to access variables declared before the WITH/RETURN

I have a neo4j DB where I have the following relations:
(:journal)<-[:BELONGS_TO_JOURNAL]-(:article)
(:person)-[:WROTE]->(article)
I would like to perform a query to find, among the authors of articles belinging to the journal that has most articles, the ones having written the highest number of articles.
The following query gives the journal having the highest number of articles:
match (j:journal)-[:BELONGS_TO_JOURNAL]-()
return j.name,
count(*) as articlesCount
order by articlesCount desc limit 1
And I thought about this other query to find the request:
match (j:journal)-[:BELONGS_TO_JOURNAL]-()
with j as j, count(*) as articlesCount
match (j)<-[:BELONGS_TO_JOURNAL]-(a:article)<-[:WROTE]-(p:person)
return p, count(*) as authorsCount order by articlesCount, authorsCount limit 1
but it gives problems because articlesCount cannot be used in the return since count() is used.
Any suggestions?
Try this:
MATCH (j:journal)-[:BELONGS_TO_JOURNAL]-(a:article)<-[:WROTE]-(p:person)
WITH j, p, count(a) AS articlesCount
ORDER BY articlesCount DESC
WITH j, COLLECT({author: p, articlesCount: articlesCount})[0] AS authorWithMostArticlesForAJournal
RETURN authorWithMostArticlesForAJournal.author AS author, authorWithMostArticlesForAJournal.articlesCount AS articlesCount
In this, we first calculate the articlesCount for each distinct combination of journal and person nodes. Then we sort the records in descending order of count. Finally, for each journal we get the top author, be collecting all in the list and picking the first element of the list.

Cypher count instances greater than

I am writing a query to display a graph comprising of all the journals and their publication place (cities). I would like to filter the query by selecting only the Cities which are the publication place of more than 3 journals.
My attempt does give me cities and the count but I cannot manage to have the journal.name and the relationship in the result
MATCH (j:journal)-[p:publication_city]->(c:City)
WITH c, count(c) as cnt
WHERE cnt > 3
RETURN c, cnt
ORDER BY cnt
Whatever change to add the journal variable in the query above (e.g. WITH c, count(c) as cnt, j) lead to empty result
Anyone who knows what I am doing wrong?
You can use COLLECT clause to get all journals with more than 3 publications. Then UNWIND to list them out one by one. UNWIND is like a "for loop" in sql.
MATCH (j:journal)-[:publication_city]-(c:city)
WITH c, count(c) as cnt, collect(j) as journals WHERE cnt > 3
UNWIND journals as journal
RETURN journal, c, cnt
ORDER BY cnt

DELETE the MIN counted data in neo4j

I want to delete some data after doing some counting on neo4j. This method can be done manually(counting the data then delete the data), but i need someone to point me whether it's possible or impossible to do this automatically(counting data and delete data in the same query). I couldn't find a way to return the least/minimal data after i did some counting using min() function in neo4j. I can probably do a workaround using order by and limit the data, but i need to be sure that there is no other option than this if i want to do this method.
This is the link to the data. The data is a custom event log that only consists of case_id and activity name.
So this is what i've already tried:
//LOAD DATA
LOAD CSV with headers FROM "file:///*.csv"
AS line
Create (:Activity {CaseId:line.Case_ID,
Name:line.Activity })
LOAD CSV with headers FROM "file:///*.csv"
AS line
Create (:CaseActivity {CaseId:line.Case_ID,
Name:line.Activity })
//SEQUENCE DISCOVERY
match (c:Activity)
with collect(c) AS Caselist
unwind range(0,Size(Caselist) - 2) as idx
with Caselist[idx] AS s1, Caselist[idx+1] AS s2
match (b:CaseActivity),(a:CaseActivity)
where s1.CaseId = s2.CaseId AND
s1.Name = a.Name AND
s2.Name = b.Name AND
s1.CaseId = a.CaseId AND
s2.CaseId = b.CaseId
merge (a)-[:NEXT {relation:"NEXT"}]->(b)
match(a:Activity)
with a.CaseId as id,
collect (a.Name) as Trace_Type
match(b:CaseActivity)
where id = b.CaseId
return count (distinct b.CaseId) as Frequencies, Trace_Type, Collect(distinct b.CaseId) as CaseId
order by Frequencies desc
Your question did not specify what you wanted to delete. This query assumes that you wanted your last query to delete the b nodes (and return some data about the deleted b nodes):
MATCH (a:Activity)
WITH a.CaseId as id, COLLECT(a.Name) AS Trace_Type
MATCH (b:CaseActivity)
WHERE id = b.CaseId
WITH
COUNT(distinct b.CaseId) AS Frequencies,
Trace_Type,
COLLECT(distinct b.CaseId) AS CaseId,
COLLECT(DISTINCT b) AS bs
FOREACH(x IN bs | DELETE x)
RETURN Frequencies, Trace_Type, CaseId
ORDER BY Frequencies DESC;
Variables containing values obtained from deleted b nodes (like Frequencies and CaseId) will still be valid after the nodes are deleted.
A tricky thing to note about your specific example is that your last WITH clause was using aggregation, with Trace_Type as the grouping key. In order for my answer to avoid changing the grouping key (and thereby possibly changing your returned results), I just added COLLECT(DISTINCT b) AS bs to the WITH clause. Then, since each bs is a list of b nodes (for a Trace_Type), I used FOREACH to delete the nodes in each list.

Sorting by Elements in Collection in Cypher Query

I'm working on an application using Neo4J and I'm having problems with the sorting in some of the queries. I have a list of stores that have promotions so I have a node for each store and a node for each promotion. Each store can have multiple promotions so it's a one to many relationship. Some of the promotions are featured (featured property = true) so I want those to appear first. I'm trying to construct a query that does the following:
Returns a list of stores with the promotoions as a collection (returning it like this is ideal for paging)
Sorts the stores so the ones with most featured promotions appear first
Sorts the collection so that the promotions that are featured appear first
So far I have the following:
MATCH (p:Promotion)-[r:BELONGS_TO_STORE]->(s:Store) WITH p, s, collect(p.featured) as featuredCol WITH p, s, LENGTH(FILTER(i IN featuredCol where i='true')) as featuredCount ORDER BY p.featured DESC, featuredCount DESC RETURN s, collect(p) skip 0 limit 10
First, I try to create a collection using the featured property using a WITH clause. Then, I try to create a second collection where the featured property is equal to true and then get the length in a second WITH clause. This sorts the collection with the promotions correctly but not the stores. I get an error if I try to add another sort at the end like this because the featuredCount variable is not in the RETURN clause. I don't want the featuredCount variable in the RETURN clause because it throws my pagination off.
Here is my second query:
MATCH (p:Promotion)-[r:BELONGS_TO_STORE]->(s:Store) WITH p, s, collect(p.featured) as featuredCol WITH p, s, LENGTH(FILTER(i IN featuredCol where i='true')) as featuredCount ORDER BY p.featured DESC, featuredCount DESC RETURN s, collect(p) ORDER BY featuredCount skip 0 limit 10
I'm very new to Neo4J so any help will be greatly appreciated.
Does this query (see this console) work for you?
MATCH (p:Promotion)-[r:BELONGS_TO_STORE]->(s:Store)
WITH p, s
ORDER BY p.featured DESC
WITH s, COLLECT(p) AS pColl
WITH s, pColl, REDUCE(n = 0, p IN pColl | CASE
WHEN p.featured
THEN n + 1
ELSE n END ) AS featuredCount
ORDER BY featuredCount DESC
RETURN s, pColl
LIMIT 10
This query performs the following steps:
It orders the matched rows so that the rows with featured promotions are first.
It aggregates all the p nodes for each distinct s into a pColl collection. The featured promotions still appear first within each pColl.
It calculates the number of featured promotions in each pColl, and orders the stores so that the ones with the most features promotions appear first.
It then returns the results.
Note: This query assumes that featured has a boolean value, not a string. (FYI: ORDER BY considers true to be greater than false). If this assumption is not correct, you can change the WHEN clause to WHEN p.featured = 'true'.

is there any way to find the incoming relationship property say "rel_name" for which all nodes in a collection have same value?

Each node have multiple incoming relationship with different properties.i want to find the incoming relationship property say "rel_name"
for ALL (x IN nodes(p)) have same value
You can try this (it returns rel_name with count of nodes and collection of nodes):
MATCH (a)<-[r]-()
WITH r.rel_name AS name, count(a) AS count, collect(a) AS coll
WHERE count > 1
RETURN name, count, coll
ORDER BY name
or this (it returns all duplicated nodes):
MATCH (a)<-[r]-()
WITH r.rel_name AS name, count(a) AS count, collect(a) AS coll
WHERE count > 1
UNWIND coll AS c
RETURN c

Resources