Using distinct in Neo4j queries - neo4j

I'm learning Neo4j and I see that some Match clause can retrieve multiple times the same node, so you need to specify DISTINCT to eliminate duplicates being nodes or aggregated values as for
MATCH (p:Person)-[:ACTED_IN]->(m)<-[:DIRECTED]-(d:Person)
collect(DISTINCT m.title) as movies
RETURN p.name as Actor, movies AS Movies, d.name AS Director
I'm wondering in what cases would one want to keep duplicates.
Many thanks

There are not any specific cases, in which you want duplicates. It basically depends on the functionality you are trying to achieve.
Consider this scenario: We want all the movie titles to which a person is linked somehow directly. In this case, you'll probably use DISTINCT, because a Person can be linked both as an Actor and Director to a movie. The query will be:
MATCH (p:Person)-[]->(m)
WITH p, collect(DISTINCT m.title) as movies
RETURN p.name as Actor, movies AS Movies
In another scenario, you just want the movies a person is linked to as an actor. In this case, there is no need to use DISTINCT, because a person will be linked to a movie as an Actor, ideally only once. So this would suffice:
MATCH (p:Person)-[:ACTED_IN]->(m)
WITH p, collect(m.title) as movies
RETURN p.name as Actor, movies AS Movies
Mostly aggregation operations are the places, where we use DISTINCT to remove duplicates, in counts, lists, etc. You can also use DISTINCT to remove duplicate rows from the output if there are any, but again it's functionality dependent, there are no hard and fast rules as such. If the query you are trying returns duplicates, and you don't want them, use DISTINCT, otherwise, let it be as it is.

Related

Linking nodes of a particular type to each other sequentially based on a node of another type

I want to link all nodes (of a type) associated with a node (different from previous nodes) to each other. I'll explain this with the help of a diagram. Given below is a dummy representation of a graph that I have created:-
COMMAND:
// Movies are unique values in the dataset.
LOAD CSV WITH HEADERS FROM "actors_movies.csv" AS dataset
CREATE (m:Movie{movie:dataset.name})
MERGE (a:Actor{name:dataset.actor})
MERGE (a)-[:ACTED{year:dataset.year}]->(m)
I want my graph to look like the following where if I query an actor, I should be able to traverse all the movies that they've acted in in a series:
I request for a query to create a graph mentioned above.
Okay, if you only need this for the graphical results, on a per-query per-actor basis, then you may be able to use virtual relationships via APOC Procedures. This will let you create fake virtual relationships that do not actually exist in the graph, but can be visualized in the graph result view. Keep in mind these only last for the duration of the query, they will not be saved to the graph, you'll need to create the virtual relationships with each query where you want to view them.
Here's an example which works for the movies graph (from :play movies in the neo4j browser):
MATCH (k:Person{name:'Keanu Reeves'})-[:ACTED_IN]->(m:Movie)
WITH k, m
ORDER BY m.released ASC
WITH k, apoc.coll.pairsMin(collect(m)) as pairs // list of pairs of adjacent nodes
UNWIND pairs as pair
CALL apoc.create.vRelationship(pair[0], 'NEXT_MOVIE', {year:pair[1].released}, pair[1]) YIELD rel
RETURN k, pair[0] as m1, pair[1] as m2, rel
Keep in mind that if you actually want to have these saved in the graph, you're going to need a path through these movies per actor, so the relationships you create would need to have something like an actorId property, that way when you need to MATCH to the path of movies for an actor you would need to make sure all :NEXT_MOVIE relationships would need to have that actor's id.
That's the only way you could possibly do this in a sane way, otherwise you wouldn't know which relationships to traverse since you need to have the context of which relationship belongs to which actor.

How to find relationships of 2 types of nodes in Neo4j?

I have two types of nodes stored in Neo4j db: Person and Group. I defined a relationship "IN" between Person and Group: (p:Person)-[:IN]->(g:Group).
The question is given 2 Person nodes, how can I find the relationships between them?
I know I can use Cypher queries such as
MATCH (p1:Person{pid: '11231'})-[:IN]->(g:Group)<-[:IN]-(p2:Person{pid: '1231231'})
RETURN p1,p2,g;
But how can I describe a multi-hop relationship so that Neo4j can find links between two not directly linked Person nodes? I don't know how many hops are required to link these two Person nodes.
You can use below Cypher query:
MATCH (p1:Person{pid: '11231'})-[:IN*]-(p2:Person{pid: '1231231'}) return p1,p2;
Note: the * in the relationship IN does the trick.
However, this is inefficient approach and you should limit the number of hops like this:
MATCH (p1:Person{pid: '11231'})-[:IN*1..5]-(p2:Person{pid: '1231231'}) return p1,p2;
More relationship match pattern can be found in the cypher refcard patterns section
https://neo4j.com/docs/cypher-refcard/current/

matching between arrays in neo4j

So, i trying to build a basic recommendation system, i first get what the people who liked this movie also liked (collaborative filtring)(user based), then i get a chunk of various data (movies), because lets say people who liked toy story may also like SCI-fi movies. but movies of this type is irrelative to toy story very much, so i want to filter the results again by its genres, toy story has 5 genres (Animation, Action, Adventure, etc) i want to only get movies which have share these genres in common.
this my cypher query
match (x)<-[:HAS_GENRE]-(ee:Movie{id:1})<-[:RATED{rating: 5}]
-(usr)-[:RATED{rating: 5}]->(another_movie)<-[:LINK]-(l2:Link),
(another_movie)-[:HAS_GENRE]->(y:Genre)
WHERE ALL (m IN x.name WHERE m IN y.name)
return distinct y.name, another_movie, l2.tmdbId limit 200
the first record i get back is star wars 1977, which has only Adventure genre matches toy story genres.. help me writing better cypher
There are a few things we can do to improve the query.
Collecting the genres should allow for the correct WHERE ALL clause later. We can also hold off on matching to the recommended movie's Link node until we filter down to the movies we want to return.
Give this one a try:
MATCH (x)<-[:HAS_GENRE]-(ee:Movie{id:1})
// collect genres so only one result row so far
WITH ee, COLLECT(x) as genres
MATCH (ee)<-[:RATED{rating: 5}]-()-[:RATED{rating: 5}]->(another_movie)
WITH genres, DISTINCT another_movie
// don't match on genre until previous query filters results on rating
MATCH (another_movie)-[:HAS_GENRE]->(y:Genre)
WITH genres, another_movie, COLLECT(y) as gs
WHERE size(genres) <= size(gs) AND ALL (genre IN genres WHERE genre IN gs)
WITH another_movie limit 200
// only after we limit results should we match to the link
MATCH (another_movie)<-[:LINK]-(l2:Link)
RETURN another_movie, l2.tmdbId
As movies are likely going to have many many ratings, the match to find movies both rated 5 is going to be the most expensive part of the query. If many of your queries rely on a rating of 5, you may want to consider creating a separate [:MAX_RATED] relationship whenever a user rates a movie a 5, and use those [:MAX_RATED] relationships for queries like these. That ensures that you don't initially match to a ton of rated movies that all have to be filtered by their rating value.
Alternately, if you want to consider recommendations based on average ratings for movies, you may want to consider caching a computed average of all ratings for every movie (maybe the computation gets rerun for all movies a couple times a day). If you add an index on the average rating property on movie nodes, it should provide faster matching to movies that are rated similarly.

In Neo4j for every disjoint subgraph return the node with the most relationships

I’m new to Neo4j and graph theory and I’m trying to figure out if I can use Neo4j to solve a problem I have. Please correct me if I’m using the wrong words to describe stuff. Since I’m new to the subject I haven’t really wrapped my head around what to call everything.
I think the easiest way to describe my problem is with a lot of pictures.
Let’s say you have two disjoint subgraphs that look like this.
From the subgraphs above I want to get a list of subgraphs that fulfills one of two criteria.
Criteria 1.
If a node has a unique relationship to another node, the nodes and relationship should be returned as a subgraph.
Criteria 2.
If the relations are not unique, I'd like the node with the most relationships to be returned, as a subgraph with its relationships and related nodes.
If other nodes come in tie in criteria 2, I want all subgraphs to be returned.
Or put in the context of this graph,
Give me the people who have unique games, and if there are other people having the same games, give me back the person with the most games. If they come in tie, return all people who come in tie.
Or actually, return the whole subgraph, not only the person.
To clarify what I am after here is a picture that describes the result I want to get. The ordering of the result is not important.
Disjoint subgraph A, because of Criteria 1, Andrew is the only person who has Bubble Bobble.
Disjoint subgraph B, because of Criteria 1, Johan is the only person who has Puzzle Bobble 1.
Disjoint subgraph C, because of Criteria 2, Julia since she has the most games.
Disjoint subgraph D, because of Criteria 2, Anna since she comes in tie with Julia having the most games.
Worth noting is that Johan's relationship to Puzzle Bobble 2 is not returned because it's not unique and he has not the most games.
Is this a problem you could solve with only Neo4j and is it a good idea?
If you could solve it how would you do it in Cypher?
Create script:
CREATE (p1:Person {name:"Johan"}),
(p2:Person {name:"Julia"}),
(p3:Person {name:"Anna"}),
(p4:Person {name:"Andrew"}),
(v1:Videogame {name:"Puzzle Bobble 1"}),
(v2:Videogame {name:"Puzzle Bobble 2"}),
(v3:Videogame {name:"Puzzle Bobble 3"}),
(v4:Videogame {name:"Puzzle Bobble 4"}),
(v5:Videogame {name:"Bubble Bobble"}),
(p1)-[:HAS]->(v1),
(p1)-[:HAS]->(v2),
(p2)-[:HAS]->(v2),
(p2)-[:HAS]->(v3),
(p2)-[:HAS]->(v4),
(p3)-[:HAS]->(v2),
(p3)-[:HAS]->(v3),
(p3)-[:HAS]->(v4),
(p4)-[:HAS]->(v5)
I feel like this solution might not be quite what you're looking for, but it could be a good start:
MATCH (game:Videogame)<-[:HAS]-(owner:Person)
OPTIONAL MATCH owner-[:HAS]->(other_game:Videogame)
WITH game, owner, count(other_game) AS other_game_count
ORDER BY other_game_count DESC
RETURN game, collect(owner)[0]
Here the query:
Finds all of the games and their owners (games without owners will not be matched)
Does an OPTIONAL MATCH against any other games those owners might own (by doing an optional match we're saying that it's OK if they own zero)
Pass through each game/owner pair along with a count of the number of other games owned by that owner, sorting so that those with the most games come first
RETURN the first owner for each game (the ORDER is preserved when doing the collect)

Cypher - odd behavior when matching for nodes that don't have a relationship

I've been trying to filter all nodes that are not connected to nodes of a given type, and discovered an odd behavior.
Specifically in my current small example, I had two actors connected to a single movie, and another movie with nothing connected to it.
This query worked fine:
MATCH (a:Actor) WHERE NOT (a)-->(:Movie) RETURN a
It returned no actors, as both actors starred in my one movie.
However, when I wrote it like this
MATCH (a:Actor),(m:Movie) WHERE NOT (a)-->(m) RETURN a
it returned both actors.
In reverse, the query
MATCH (m:Movie) WHERE NOT (m)<--(:Actor) RETURN m
worked as expected, returning the movie nobody starred in, but this time,
MATCH (m:Movie),(a:Actor) WHERE NOT (m)<--(a) RETURN m
also returned only the movie nobody starred in! What was odd, though, is that it returned 2 rows, both of them being the movie nobody starred in.
All in all, I'm completely confused.
The simple answer is you are generating a cartesian product when you are asking for all movies and all actors. When you filter out actors that aren't in the movie that works for one movie but not for the second movie where both actors did not act.
MATCH (a:Actor),(m:Movie) WHERE NOT (a)-->(m) RETURN a
What was your express goal with the second query?
If your data set is really 4 nodes small then try these queries in succession and I think you will see what is happening.
Full cartesian product
MATCH (a:Actor),(m:Movie) RETURN m,a
With the where clause added
MATCH (a:Actor),(m:Movie) WHERE NOT (a)-->(m) RETURN m,a

Resources