Is there a simpler version of this cypher query? - neo4j

I have constructed a query to find the people who follow each other and who have read books in the same genre. Here it is:
MATCH (u1:User)-[:READ]->(b1:Book)
WITH collect(DISTINCT b1.genre) AS genres,u1 AS user1
MATCH (u2:User)-[:READ]->(b2:Book)
WHERE (user1)<-[:FOLLOWS]->(u2) AND b2.genre IN genres
RETURN DISTINCT user1.username AS user1,u2.username AS user2
The idea is that we collect all the book genres for one of them, and if a book read by the other is in that list of genres (and they follow each other), then we return those users. This seems to work: we get a list of distinct pairs of individuals. I wonder, though, if there a quicker way to do this? My solution seems somewhat clumsy, but I found it surprisingly finicky trying to specify that they have read a book in the same genre without getting back all the pairs of books and duplicating individuals. For example, I
first wrote the following:
MATCH (b1:Book)<-[:READ]-(u1:User)-[:FOLLOWS]-(u2:User)-[:READ]->(b2:Book)
WHERE b1.genre = b2.genre
RETURN DISTINCT u1.username AS user1, u2.username AS user2
Which seems simpler, but in fact it returned repeated names for all the books that were read in the same genre. Is my solution the simplest, or is there a simpler one?

This is one way of rewriting the query
MATCH (n1:User)-[:FOLLOWS]-(n2:User)
MATCH (n1)-[:READ]->(book), (n2)-[:READ]->(book2)
WHERE book.genre = book2.genre
RETURN n1.username, n2.username, count(*)
Here is another collecting genres for each user
MATCH (n1:User)-[:FOLLOWS]-(n2:User)
WITH n1, n2,
[(n1)-[:READ]->(book) | book.genre] AS g1,
[(n2)-[:READ]->(book) | book.genre] AS g2
WHERE ANY(x IN g1 WHERE x IN g2)
RETURN n1, n2, count(*)
Note that sometimes longer queries are not especially better in the sense that the ways the data are retrieved need to make sense to yourself.
Your model however clearly shows that you would benefit from a bit of graph refactoring, extracting the genre into its own node, for eg
MATCH (n:Book)
MERGE (g:Genre {name: n.genre})
MERGE (n)-[:HAS_GENRE]->(g)
And this would be the new query which leverages a graph model
PROFILE
MATCH (n1:User)-[:FOLLOWS]-(n2:User)
WHERE (n1)-[:READ]->()-[:HAS_GENRE]->()<-[:HAS_GENRE]-()<-[:READ]-(n2)
RETURN n1.username, n2.username, count(*)

Related

Efficient way to find common relationship

I've recently started learning Cypher. I have a database containing four users and films. Users can have can have [:WATCHED] / [:WATCHLISTED] / [:FAVORITED] relationships with films.
I want to get the films which all four users have watched. Here's a working query I've written:
match (u1)-[:WATCHED]->(f)<-[:WATCHED]-(u2),
(u3)-[:WATCHED]->(f)<-[:WATCHED]-(u4)
return u1, u2, u3, u4, f
I wanted to know if there was a more efficient way to do this. Or any another way, which I can't of. I'm asking this out of curiosity.
You can do this for example :
MATCH (f:Film)
WHERE size((f)<-[:WATCHED]-()) = 4
RETURN f, [(f)<-[:WATCHED]-(u:User) | u] as watchers
Here I assume that there is only one relationship of type WATCHED between a user and a movie, even if the user has watched the movie many times.
To avoid having to hardcode a User node count, this query efficiently gets the count using the DB's internal statistics:
MATCH (u:User)
WITH COUNT(u) AS userCount
MATCH (f:Film)
WHERE SIZE((f)<-[:WATCHED]-()) = userCount
RETURN f;
This query does not return the users that watched the film, since that is literally all the users in the DB, and with a sufficiently large number of them your query can run out of memory -- or it can take a very long time for a client (like the neo4j Browser) to receive and process the results. I think the main point of a query like this is to find the films, not the users. If you really want to get all the users, a separate query will do: MATCH (u:Users) RETURN u.
You can use all:
https://neo4j.com/docs/developer-manual/current/cypher/functions/predicate/
This checks if a predicate is true for all elements.

neo4j find who has acted in all of the movies that someone acts in

I use database cineasts(actor and movie). It has relationship (:ACTOR)-[:ACTED_IN]->(:Movie). Now I want to find the actors who has acted in all of the movies that actor "abc" acts in.
My idea is first to get the movie collection of "abc" using WITH COLLECT. Then using ALL() to find the required actors. But I am not sure how to write the filter in ALL(). How to write it?
Take a look at this Neo4j knowledge base article on performing match intersection.
The kind of queries you're looking for are going to be similar.
For example, using the first technique mentioned, we can do something like this:
MATCH (abc:Actor{name:'abc'})-[:ACTED_IN]->(m:Movie)
WITH abc, collect(distinct m) as movies
WITH abc, movies, size(movies) as movieCnt
UNWIND movies as m
MATCH (m)<-[:ACTED_IN]-(a:Actor)
WHERE abc <> a
WITH a, collect(distinct m) as commonMovies, movieCnt
WHERE size(commonMovies) = movieCnt
RETURN a
If you wanted to use the alternate approach with ALL(), it might look like this:
MATCH (abc:Actor{name:'abc'})-[:ACTED_IN]->(m:Movie)
WITH abc, collect(distinct m) as movies
WITH abc, movies, head(movies) as first
MATCH (first)<-[:ACTED_IN]-(a:Actor)
WHERE abc <> a AND ALL(m in movies WHERE (m)<-[:ACTED_IN]-(a))
RETURN a
We start the match from the first of the movies collection so we start from a relevant set of :Actors instead of having to filter starting from all :Actor nodes. That can be improved further if we sort the movies by the number of actors ascending first, since that will lead to the narrowest starting pool of coactors.

Neo4j: multiple counts from multiple matches

Given a neo4j schema similar to
(:Person)-[:OWNS]-(:Book)-[:CATEGORIZED_AS]-(:Category)
I'm trying to write a query to get the count of books owned by each person as well as the count of books in each category so that I can calculate the percentage of books in each category for each person.
I've tried queries along the lines of
match (p:Person)-[:OWNS]-(b:Book)-[:CATEGORIZED_AS]-(c:Category)
where person.name in []
with p, b, c
match (p)-[:OWNS]-(b2:Book)-[:CATEGORIZED_AS]-(c2:Category)
with p, b, c, b2
return p.name, b.name, c.name,
count(distinct b) as count_books_in_category,
count(distinct b2) as count_books_total
But the query plan is absolutely horrible when trying to do the second match. I've tried to figure out different ways to write the query so that I can do the two different counts, but haven't figured out anything other than doing two matches. My schema isn't really about people and books. The :CATEGORIZED_AS relationship in my example is actually a few different relationship options, specified as [:option1|option2|option3]. So in my 2nd match I repeat the relationship options so that my total count is constrained by them.
Ideas? This feels similar to Neo4j - apply match to each result of previous match but there didn't seem to be a good answer for that one.
UNWIND is your friend here. First, calculate the total books per person, collecting them as you go.
Then unwind them so you can match which categories they belong to.
Aggregate by category and person, and you should get the number of books in each category, for a person
match (p:Person)-[:OWNS]->(b:Book)
with p,collect(b) as books, count(b) as total
with p,total,books
unwind books as book
match (book)-[:CATEGORIZED_AS]->(c)
return p,c, count(book) as subtotal, total

Select nodes that has all relationships in Neo4j

Suppose I have two kinds of nodes, Person and Competency. They are related by a KNOWS relationship. For example:
(:Person {id: 'thiago'})-[:KNOWS]->(:Competency {id: 'neo4j'})
How do I query this schema to find out all Person that knows all nodes of a set of Competency?
Suppose that I need to find every Person that knows "java" and "haskell" and I'm only interested in the nodes that knows all of the listed Competency nodes.
I've tried this query:
match (p:Person)-[:KNOWS]->(c:Competency) where c.id in ['java','haskell'] return p.id;
But I get back a list of all Person that knows either "java" or "haskell" and duplicated entries for those who knows both.
Adding a count(c) at the end of the query eliminates the duplicates:
match (p:Person)-[:KNOWS]->(c:Competency) where c.id in ['java','haskell'] return p.id, count(c);
Then, in this particular case, I can iterate the result and filter out results that the count is less than two to get the nodes I want.
I've found out that I could do it appending consecutive match clauses to keep filtering the nodes to get the result I want, in this case:
match (p:Person)-[:KNOWS]->(:Competency {id:'haskell'})
match (p)-[:KNOWS]->(:Competency {id:'java'})
return p.id;
Is this the only way to express this query? I mean, I need to create a query by concatenating strings? I'm looking for a solution to a fixed query with parameters.
with ['java','haskell'] as skills
match (p:Person)-[:KNOWS]->(c:Competency)
where c.id in skills
with p.id, count(*) as c1 ,size(skills) as c2
where c1 = c2
return p.id
One thing you can do, is to count the number of all skills, then find the users that have the number of skill relationships equals to the skills count :
MATCH (n:Skill) WITH count(n) as skillMax
MATCH (u:Person)-[:HAS]->(s:Skill)
WITH u, count(s) as skillsCount, skillMax
WHERE skillsCount = skillMax
RETURN u, skillsCount
Chris
Untested, but this might do the trick:
match (p:Person)-[:KNOWS]->(c:Competency)
with p, collect(c.id) as cs
where all(x in ['java', 'haskell'] where x in cs)
return p.id;
How about this...
WITH ['java','haskell'] AS comp_col
MATCH (p:Person)-[:KNOWS]->(c:Competency)
WHERE c.name in comp_col
WITH comp_col
, p
, count(*) AS total
WHERE total = length(comp_col)
RETURN p.name, total
Put the competencies you want in a collection.
Match all the people that have either of those competencies
Get the count of compentencies by person where they have the same number as in the competency collection from the start
I think this will work for what you need, but if you are building these queries programatically the best performance you get might be with successive match clauses. Especially if you knew which competencies were most/least common when building your queries, you could order the matches such that the least common were first and the most common were last. I think that would chunk down to your desired persons the fastest.
It would be interesting to see what the plan analyzer in the sheel says about the different approaches.

Neo4j cypher query with variable relationship path length

I'm moving my complex user database where users can be on one of many teams, be friends with each other and more to Neo4j. Doing this in a RDBMS was painful and slow, but is simple and blazing with Neo4j. :)
I was hoping there is a way to query for
a relationship that is 1 hop away and
another relationship that is 2 hops away
from the same query.
START n=node:myIndex(user='345')
MATCH n-[:IS_FRIEND|ON_TEAM*2]-m
RETURN DISTINCT m;
The reason is that users that are friends are one edge from each other, but users linked by teams are linked through that team node, so they are two edges away. This query does IS_FRIEND*2 and ON_TEAM*2, which gets teammates (yeah) and friends of friends (boo).
Is there a succinct way in Cypher to get both differing length relations in a single query?
I rewrote it to return a collection:
start person=node(1)
match person-[:IS_FRIEND]-friend
with person, collect(distinct friend) as friends
match person-[:ON_TEAM*2]-teammate
with person, friends, collect(distinct teammate) as teammates
return person, friends + filter(dupcheck in teammates: not(dupcheck in friends)) as teammates_and_friends
http://console.neo4j.org/r/oo4dvx
thanks for putting together the sample db, Werner.
I have created a small test database at http://console.neo4j.org/?id=sqyz7i
I have also created a query which will work as you described:
START n=node(1)
MATCH n-[:IS_FRIEND]-m
WITH collect(distinct id(m)) as a, n
MATCH n-[:ON_TEAM*2]-m
WITH collect(distinct id(m)) as b, a
START n=node(*)
WHERE id(n) in a + b
RETURN n

Resources