Looking at the example from GrandStack movies workshop https://github.com/grand-stack/grand-stack-movies-workshop/blob/master/neo4j-database/answers.md
The query proposed for recommended movies here
MATCH (m:Movie) WHERE m.movieId = $movieId
MATCH (m)-[:IN_GENRE]->(g:Genre)<-[:IN_GENRE]-(movie:Movie)
WITH m, movie, COUNT(*) AS genreOverlap
MATCH (m)<-[:RATED]-(:User)-[:RATED]->(movie:Movie)
WITH movie,genreOverlap, COUNT(*) AS userRatedScore
RETURN movie ORDER BY (0.9 * genreOverlap) + (0.1 * userRatedScore) DESC LIMIT 3
Wouldnt this query be biased in the sense that it will only calculate userRatedScore for movies that share at least one genre with the movie with Id $movieId?
How would a rewritten query look that computes both scores independently, meaning it would still calculate userRatedScore for a given movie even if it does not share genres with the movie with Id $movieId
If you'd like to ignore the weighting provided by Genre, then you could drop the part of the query that seeks it, something like:
MATCH (m:Movie) WHERE m.movieId = $movieId, (m)<-[:RATED]-(:User)-[:RATED]->(movie:Movie)
WITH movie, COUNT(*) AS userRatedScore
RETURN movie ORDER BY (0.1 * userRatedScore) DESC LIMIT 3
Related
The below query is taken from neo4j movie review dataset sandbox:
MATCH (u:User {name: "Some User"})-[r:RATED]->(m:Movie)
WITH u, avg(r.rating) AS mean
MATCH (u)-[r:RATED]->(m:Movie)-[:IN_GENRE]->(g:Genre)
WHERE r.rating > mean
WITH u, g, COUNT(*) AS score
MATCH (g)<-[:IN_GENRE]-(rec:Movie)
WHERE NOT EXISTS((u)-[:RATED]->(rec))
RETURN rec.title AS recommendation, rec.year AS year, COLLECT(DISTINCT g.name) AS genres, SUM(score) AS sscore
ORDER BY sscore DESC LIMIT 10
what I can not understand is: why the DISTINCT keyword is required in the query's return statement?. Because the expected results from the last MATCH statement is something like this:
g1,x
g1,y
...
g2,z
g2,v
g2,m
...
gn,m
gn,b
gn,x
where g1,g2,..gn are the set of genres and x,y,z,v,m,b... are a set of movies (in addition there is a user and score column deleted for readability).
So according to my understanding what this query is returning: For each movie return its genres and the sum of their scores.
Assumptions:
Every Movie has a unique title. (This is required for the query to work as is.)
Every Genre has a unique name.
Every Movie has at most one IN_GENRE relationship to each distinct Genre.
Given the above assumptions, you are correct that the DISTINCT is not necessary. That is because the RETURN clause is using rec.title as one of the aggregation grouping keys.
I've got a Neo4j database in which hashtags and tweets are stored.
Every tweet has a topic property, which defines the topic it belongs to.
If I run the following query, I get the most popular hashtags in the db, no matter the topic:
MATCH (h:Hashtag)
RETURN h.text AS hashtag, size( (h)<--() ) AS degree ORDER BY degree DESC
I'd like to get the most popular tags for a single topic.
I tried this:
MATCH (h:Hashtag)<--(t:Tweet{topic:'test'})
RETURN h.text AS hashtag, size( (h)<--(t) ) AS degree ORDER BY degree DESC
this
MATCH (h:Hashtag)
RETURN h.text AS hashtag, size( (h)<--(t:Tweet{topic:'test'}) ) AS degree ORDER BY degree DESC
while the next one takes forever to run
MATCH (h:Hashtag), (t:Tweet)
WHERE t.topic='test'
RETURN h.text AS hashtag, size( (h)<--(t) ) AS degree ORDER BY degree DESC
What should I do? Thanks.
In Cypher, when you return the results of an aggregation function you get an implicit "group by" with whatever you are returning alongside the aggregation function. SIZE() is not an aggregation (so you'll get the size of the pattern for each row without the group by/aggregation), but COUNT() is:
MATCH (t:Tweet {topic:'test'})-->(h:Hashtag)
RETURN h, COUNT(*) AS num ORDER BY num DESC LIMIT 10
This query is counts of Tweet nodes, grouped by Hashtag.
Suppose tha I have the default database Movies and I want to find the total number of people that have participated in each movie, no matter their role (i.e. including the actors, the producers, the directors e.t.c.)
I have already done that using the query:
MATCH (m:Movie)<-[r]-(n:Person)
WITH m, COUNT(n) as count_people
RETURN m, count_people
ORDER BY count_people DESC
LIMIT 3
Ok, I have included some extra options but that doesn't really matter in my actual question. From the above query, I will get 3 movies.
Q. How can I enrich the above query, so I can get a graph including all the relationships regarding these 3 movies (i.e.DIRECTED, ACTED_IN,PRODUCED e.t.c)?
I know that I can deploy all the relationships regarding each movie through the buttons on each movie node, but I would like to know whether I can do so through cypher.
Use additional optional match:
MATCH (m:Movie)<--(n:Person)
WITH m,
COUNT(n) as count_people
ORDER BY count_people DESC
LIMIT 3
OPTIONAL MATCH p = (m)-[r]-(RN) WHERE type(r) IN ['DIRECTED', 'ACTED_IN', 'PRODUCED']
RETURN m,
collect(p) as graphPaths,
count_people
ORDER BY count_people DESC
I have a movie database with users rating movies. I want to find the top 5 most similar users to user 1 (first MATCH which works fine) and recommend him the top rated movies watched by those similar users but not watched by user 1. I get the same movie multiple times even though I have "distinct" in my query. What am I doing wrong?
MATCH (target_user:User {id : 1})-[:RATED]->(m:Movie)
<-[:RATED]-(other_user:User)
WITH other_user, count(distinct m.title) AS num_common_movies, target_user
ORDER BY num_common_movies DESC
LIMIT 5
MATCH other_user-[rat_other_user:RATED]->(m2:Movie)
WHERE NOT (target_user-[:RATED]->m2)
WITH distinct m2.title as movietitle, rat_other_user.note AS rating,
other_user.id AS watched_by
RETURN movietitle, rating, watched_by
ORDER BY rating DESC
You dataset probably has many users who have watched and rated the same movies. When you execute that DISTINCT statement it is going to return a distinct row, not a distinct movie title. Different users will have rated the unwatched movies differently and have different names.
You will have to tune this for your particular use case but you can start from:
MATCH (target_user:User { uid : 1 })-[:RATED]->(m:Movie)<-[:RATED]-(other_user:User)
WITH other_user, count(DISTINCT m.title) AS num_common_movies, target_user
ORDER BY num_common_movies DESC
LIMIT 5
MATCH other_user-[rat_other_user:RATED]->(m2:Movie)
WHERE NOT (target_user-[:RATED]->m2)
RETURN DISTINCT m2.name AS movietitle, COLLECT(rat_other_user.note) AS ratings,
MAX(rat_other_user.note) AS maxi, AVG(rat_other_user.note) as aver, COLLECT(other_user.name) AS users
ORDER BY aver DESC
I added a console demo here.
Importantly the you are now aggregating your results per movie title.
I have an embedded neo4j server with ruby on rails.
These are the configurations:
neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=240M
neostore.propertystore.db.mapped_memory=230M
neostore.propertystore.db.strings.mapped_memory=1200M
neostore.propertystore.db.arrays.mapped_memory=130M
wrapper.java.initmemory=1024
wrapper.java.maxmemory=2048
There are around 15lakh movie nodes. The below query is taking around 5secs to execute.
MATCH (movie:Movie)
WITH movie, toInt(movie.reviews_count) + toInt(movie.ratings_count) AS weight
RETURN movie, weight as weight
ORDER BY weight DESC
SKIP skip_count
LIMIT 10
Here the skip_count varies as the user scroll for the results.
and this another query which aims to get the movies from a particular director takes around 9secs
MATCH (movie:Movie) , (director:Director)-[:Directed]->(movie)
WHERE director.name =~ '(?i)DIRECTOR_NAME'
WITH movie, toInt(movie.ratings_count) * toInt(movie.reviews_count) * toInt(movie.rating) AS total_weight
RETURN movie, total_weight
ORDER BY total_weight DESC, movie.rating DESC
LIMIT 10
How can I reduce the query execution time?
regarding first query:
You might make the weight ordering in the graph explicit by connecting all movie nodes using :NEXT_WEIGHT relationship in descending weight order, so the movies build up a linked list.
Your query would look like:
MATCH p=(:Movie {name:'<name of movie with highest weight>'})-[:NEXT_WEIGHT*..1000]-()
WHERE length(p)>skip_count AND length(p)<skip_count+limit
WITH p
ORDER BY length(p)
WITH last(nodes(p)) as movie
RETURN movie, toInt(movie.reviews_count) + toInt(movie.ratings_count) AS weight
regarding second query:
You should use a index to speed up the director lookup. Unfortunately index lookups are currently only supported for exact lookups. So either make sure the search string is correct in terms of upper/lower case or store a normalized version in another property:
MATCH (d:Director) set d.lowerName = LOWER(d.name)
Make sure to have a index on label Director and property LowerName:
CREATE INDEX ON :Director(lowerName)
And your query should look like:
MATCH (director:Director)-[:Directed]->(movie)
WHERE director.name = {directorName}
RETURN movie, toInt(movie.ratings_count) * toInt(movie.reviews_count) * toInt(movie.rating) AS total_weight
ORDER BY total_weight DESC, movie.rating DESC
LIMIT 10