why DISTINCT is needed in this Cypher query? - neo4j

The below query is taken from neo4j movie review dataset sandbox:
MATCH (u:User {name: "Some User"})-[r:RATED]->(m:Movie)
WITH u, avg(r.rating) AS mean
MATCH (u)-[r:RATED]->(m:Movie)-[:IN_GENRE]->(g:Genre)
WHERE r.rating > mean
WITH u, g, COUNT(*) AS score
MATCH (g)<-[:IN_GENRE]-(rec:Movie)
WHERE NOT EXISTS((u)-[:RATED]->(rec))
RETURN rec.title AS recommendation, rec.year AS year, COLLECT(DISTINCT g.name) AS genres, SUM(score) AS sscore
ORDER BY sscore DESC LIMIT 10
what I can not understand is: why the DISTINCT keyword is required in the query's return statement?. Because the expected results from the last MATCH statement is something like this:
g1,x
g1,y
...
g2,z
g2,v
g2,m
...
gn,m
gn,b
gn,x
where g1,g2,..gn are the set of genres and x,y,z,v,m,b... are a set of movies (in addition there is a user and score column deleted for readability).
So according to my understanding what this query is returning: For each movie return its genres and the sum of their scores.

Assumptions:
Every Movie has a unique title. (This is required for the query to work as is.)
Every Genre has a unique name.
Every Movie has at most one IN_GENRE relationship to each distinct Genre.
Given the above assumptions, you are correct that the DISTINCT is not necessary. That is because the RETURN clause is using rec.title as one of the aggregation grouping keys.

Related

Neo4j: query to find nodes with most relationship

I'm trying to find which movie has the most number of actors in it in my database.Here's what i came up with but it kept giving me blank.
MATCH (m:Movie)
WITH m, SIZE(()-[:ACTED_IN]->(m)) as actorCnt
MATCH (a)-[:ACTED_IN]->(m)
RETURN m, a
Maybe you did not wait long enough, because your query is trying to return all the actors for every movie.
This query should return a list of the actors for the (single) movie with the most actors:
MATCH (m:Movie)
WITH m
ORDER BY SIZE(()-[:ACTED_IN]->(m)) DESC
LIMIT 1
RETURN m, [(a)-[:ACTED_IN]->(m)|a] AS actors
It orders the movies by descending number of actors, takes just the first one, and returns it and a list of all its actors.

aggregate count changes when multiple returns

I am sending a cypher query through php.
match (n:person)-[:watched]->(m:movie)
where m.Title in $mycollection
return count(distinct n.id);
this returns the number of people who have watched movies in my collection.
I actually want to return the list of names, and return n.name works fine.
When I try to return n.name and count(distinct n.id) at the same time, I lose the total count and get the count per row.
match (n:person)-[:watched]->(m:movie)
where m.Title in $mycollection
return n.name, count(distinct n.id);
does not work. The count column appears as 1 for each row.
As I'm using php, I've also tried:
$count = $result->getNodesCount();
to no avail. So I'm using php to count the array. But it feels like Cypher should be able to do it, right?
return n.name, count(distinct n.name) means "return each distinct n.name value and its number of distinct values". The number must always be 1, since a distinct value is, obviously, distinct.
If you are actually looking for the number of times each person had an outgoing relationship to a movie whose title is in $mycollection, do this instead (where count(*) counts the number of times a given n.name was matched):
MATCH (n:person)-->(m:movie)
WHERE m.Title in $mycollection
RETURN n.name, count(*);
Note that the above query omits the [watched] pattern found in your query, since that syntax (with no colon before watched) does no filtering at all. It merely assigns the relationship to a variable named watched, but that variable is not otherwise used, and is therefore superfluous.
If you had intended to use watched as the relationship type, then do this instead:
MATCH (n:person)-[:watched]->(m:movie)
WHERE m.Title in $mycollection
RETURN n.name, count(*);
This modified query returns the number of times each person watched a movie whose title is in $mycollection

Neo4j: Query to find the nodes with most relationships, and their connected nodes

I am using Neo4j CE 3.1.1 and I have a relationship WRITES between authors and books. I want to find the N (say N=10 for example) books with the largest number of authors. Following some examples I found, I came up with the query:
MATCH (a)-[r:WRITES]->(b)
RETURN r,
COUNT(r) ORDER BY COUNT(r) DESC LIMIT 10
When I execute this query in the Neo4j browser I get 10 books, but these do not look like the ones written by most authors, as they show only a few WRITES relationships to authors. If I change the query to
MATCH (a)-[r:WRITES]->(b)
RETURN b,
COUNT(r) ORDER BY COUNT(r) DESC LIMIT 10
Then I get the 10 books with the most authors, but I don't see their relationship to authors. To do so, I have to write additional queries explicitly stating the name of a book I found in the previous query:
MATCH ()-[r:WRITES]->(b)
WHERE b.title="Title of a book with many authors"
RETURN r
What am I doing wrong? Why isn't the first query working as expected?
Aggregations only have context based on the non-aggregation columns, and with your match, a unique relationship will only occur once in your results.
So your first query is asking for each relationship on a row, and the count of that particular relationship, which is 1.
You might rewrite this in a couple different ways.
One is to collect the authors and order on the size of the author list:
MATCH (a)-[:WRITES]->(b)
RETURN b, COLLECT(a) as authors
ORDER BY SIZE(authors) DESC LIMIT 10
You can always collect the author and its relationship, if the relationship itself is interesting to you.
EDIT
If you happen to have labels on your nodes (you absolutely SHOULD have labels on your nodes), you can try a different approach by matching to all books, getting the size of the incoming :WRITES relationships to each book, ordering and limiting on that, and then performing the match to the authors:
MATCH (b:Book)
WITH b, SIZE(()-[:WRITES]->(b)) as authorCnt
ORDER BY authorCnt DESC LIMIT 10
MATCH (a)-[:WRITES]->(b)
RETURN b, a
You can collect on the authors and/or return the relationship as well, depending on what you need from the output.
You are very close: after sorting, it is necessary to rediscover the authors. For example:
MATCH (a:Author)-[r:WRITES]->(b:Book)
WITH b,
COUNT(r) AS authorsCount
ORDER BY authorsCount DESC LIMIT 10
MATCH (b)<-[:WRITES]-(a:Author)
RETURN b,
COLLECT(a) AS authors
ORDER BY size(authors) DESC

Get the full graph of a query in Neo4j

Suppose tha I have the default database Movies and I want to find the total number of people that have participated in each movie, no matter their role (i.e. including the actors, the producers, the directors e.t.c.)
I have already done that using the query:
MATCH (m:Movie)<-[r]-(n:Person)
WITH m, COUNT(n) as count_people
RETURN m, count_people
ORDER BY count_people DESC
LIMIT 3
Ok, I have included some extra options but that doesn't really matter in my actual question. From the above query, I will get 3 movies.
Q. How can I enrich the above query, so I can get a graph including all the relationships regarding these 3 movies (i.e.DIRECTED, ACTED_IN,PRODUCED e.t.c)?
I know that I can deploy all the relationships regarding each movie through the buttons on each movie node, but I would like to know whether I can do so through cypher.
Use additional optional match:
MATCH (m:Movie)<--(n:Person)
WITH m,
COUNT(n) as count_people
ORDER BY count_people DESC
LIMIT 3
OPTIONAL MATCH p = (m)-[r]-(RN) WHERE type(r) IN ['DIRECTED', 'ACTED_IN', 'PRODUCED']
RETURN m,
collect(p) as graphPaths,
count_people
ORDER BY count_people DESC

How to group and count relationships in cypher neo4j

How can I quickly count the number of "posts" made by one person and group them by person in a cypher query?
Basically I have message label nodes and users that posted (Relationship) those messages. I want to count the number of messages posted by each user.
Its a group messages by sender ID and count the number of messages per user.
Here is what I have so far...
START n=node(*) MATCH (u:User)-[r:Posted]->(m:Message)
RETURN u, r, count(r)
ORDER BY count(r)
LIMIT 10
How about this?
MATCH (u:User)-[r:POSTED]->(m:Message)
RETURN id(u), count(m)
ORDER BY count(m)
Have you had a chance to check out the current reference card?
https://neo4j.com/docs/cypher-refcard/current/
EDIT:
Assuming that the relationship :POSTED is only used for posts then one could do something like this instead
MATCH (u:User {name: 'my user'})
RETURN u, size((u)-[:POSTED]->())
This is significantly cheaper as it does not force a traversal to the actual Message.

Resources