Speeding up neo4j cypher query - neo4j

I have an embedded neo4j server with ruby on rails.
These are the configurations:
neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=240M
neostore.propertystore.db.mapped_memory=230M
neostore.propertystore.db.strings.mapped_memory=1200M
neostore.propertystore.db.arrays.mapped_memory=130M
wrapper.java.initmemory=1024
wrapper.java.maxmemory=2048
There are around 15lakh movie nodes. The below query is taking around 5secs to execute.
MATCH (movie:Movie)
WITH movie, toInt(movie.reviews_count) + toInt(movie.ratings_count) AS weight
RETURN movie, weight as weight
ORDER BY weight DESC
SKIP skip_count
LIMIT 10
Here the skip_count varies as the user scroll for the results.
and this another query which aims to get the movies from a particular director takes around 9secs
MATCH (movie:Movie) , (director:Director)-[:Directed]->(movie)
WHERE director.name =~ '(?i)DIRECTOR_NAME'
WITH movie, toInt(movie.ratings_count) * toInt(movie.reviews_count) * toInt(movie.rating) AS total_weight
RETURN movie, total_weight
ORDER BY total_weight DESC, movie.rating DESC
LIMIT 10
How can I reduce the query execution time?

regarding first query:
You might make the weight ordering in the graph explicit by connecting all movie nodes using :NEXT_WEIGHT relationship in descending weight order, so the movies build up a linked list.
Your query would look like:
MATCH p=(:Movie {name:'<name of movie with highest weight>'})-[:NEXT_WEIGHT*..1000]-()
WHERE length(p)>skip_count AND length(p)<skip_count+limit
WITH p
ORDER BY length(p)
WITH last(nodes(p)) as movie
RETURN movie, toInt(movie.reviews_count) + toInt(movie.ratings_count) AS weight
regarding second query:
You should use a index to speed up the director lookup. Unfortunately index lookups are currently only supported for exact lookups. So either make sure the search string is correct in terms of upper/lower case or store a normalized version in another property:
MATCH (d:Director) set d.lowerName = LOWER(d.name)
Make sure to have a index on label Director and property LowerName:
CREATE INDEX ON :Director(lowerName)
And your query should look like:
MATCH (director:Director)-[:Directed]->(movie)
WHERE director.name = {directorName}
RETURN movie, toInt(movie.ratings_count) * toInt(movie.reviews_count) * toInt(movie.rating) AS total_weight
ORDER BY total_weight DESC, movie.rating DESC
LIMIT 10

Related

Cypher recommendation score

Looking at the example from GrandStack movies workshop https://github.com/grand-stack/grand-stack-movies-workshop/blob/master/neo4j-database/answers.md
The query proposed for recommended movies here
MATCH (m:Movie) WHERE m.movieId = $movieId
MATCH (m)-[:IN_GENRE]->(g:Genre)<-[:IN_GENRE]-(movie:Movie)
WITH m, movie, COUNT(*) AS genreOverlap
MATCH (m)<-[:RATED]-(:User)-[:RATED]->(movie:Movie)
WITH movie,genreOverlap, COUNT(*) AS userRatedScore
RETURN movie ORDER BY (0.9 * genreOverlap) + (0.1 * userRatedScore) DESC LIMIT 3
Wouldnt this query be biased in the sense that it will only calculate userRatedScore for movies that share at least one genre with the movie with Id $movieId?
How would a rewritten query look that computes both scores independently, meaning it would still calculate userRatedScore for a given movie even if it does not share genres with the movie with Id $movieId
If you'd like to ignore the weighting provided by Genre, then you could drop the part of the query that seeks it, something like:
MATCH (m:Movie) WHERE m.movieId = $movieId, (m)<-[:RATED]-(:User)-[:RATED]->(movie:Movie)
WITH movie, COUNT(*) AS userRatedScore
RETURN movie ORDER BY (0.1 * userRatedScore) DESC LIMIT 3

Neo4j pipe data

Hi there I am on neo4j and I am having some trouble I have one query where I want to return a the a node (cuisine) with the highest percentage like so
// 1. Find the most_popular_cuisine
MATCH (n:restaurants)
WITH COUNT(n.cuisine) as total
MATCH (r:restaurants)
RETURN r.cuisine , 100 * count(*)/total as percentage
order by percentage desc
limit 1
I am trying to extend this even further by getting the top result and matching to that to get nodes with just that property like so
WITH COUNT(n.cuisine) as total
MATCH (r:restaurants)
WITH r.cuisine as cuisine , count(*) as cnt
MATCH (t:restaurants)
WHERE t.cuisine = cuisine AND count(*) = MAX(cnt)
RETURN t
I think you might be better off refactoring your model a little bit such that a :Cuisine is a label and each cuisine has its own node.
(:Restaurant)-[:OFFERS]->(:Cuisine)
or
(:Restaurant)-[:SPECIALIZES_IN]->(:Cuisine)
Then your query can look like this
MATCH (cuisine:Cuisine)
RETURN cuisine, size((cuisine)<-[:OFFERS]-()) AS number_of_restaurants
ORDER BY number_of_restaurants DESC
I wasn't able to use WITH r.cuisine as cuisine , count(*) as cnt in a WITH rather than a RETURN statement, so I had to resort to a slightly more long-winded approach.
There might be a more optimized way to do this, but this works too,
// Get all unique cuisines in a list
MATCH (n:Restaurants)
WITH COUNT(n.Cuisine) as total, COLLECT(DISTINCT(n.Cuisine)) as cuisineList
// Go through each cuisine and find the number of restaurants associated with each
UNWIND cuisineList as c
MATCH (r:Restaurants{Cuisine:c})
WITH total, r.Cuisine as c, count(r) as cnt
ORDER BY cnt DESC
WITH COLLECT({Cuisine: c, Count:cnt}) as list
// For the most popular cuisine, find all the restaurants offering it
MATCH (t:Restaurants{Cuisine:list[0].Cuisine})
RETURN t

Delete all but the top-k nodes of some query

I try to replicate the behaviour of the following SQL query in neo4j
DELETE FROM history
WHERE history.name = $modelName AND id NOT IN (
SELECT history.id
FROM history
JOIN model ON model.id = history.model_id
ORDER BY created DESC
LIMIT 10
)
I tried a lot of different queries, but basically I'm always struggling to incorporate finding the TOP-k elements. That's the closest I got to a solution.
MATCH (h:HISTORY)-[:HISTORY]-(m:MODEL)
WHERE h.name = $modelName
WITH h
MATCH (t:HISTORY)-[:HISTORY]-(m:MODEL)
WITH t ORDER BY t.created DESC LIMIT 10
WHERE NOT h IN t
DELETE h
With that query I get the error expected List<T> but was Node for the line WITH t ORDER BY t.created DESC LIMIT 10.
I tried changing it it COLLECT(t) AS t but then the error is expected Any, Map, Node or Relationship but was List<Node>.
So I'm pretty much stuck. Any idea how to write this query in Cypher?
Following that approach, you should reverse the order, matching to your top-k nodes, collecting them, and performing the match where the nodes matched aren't in the collection.
MATCH (t:HISTORY)-[:HISTORY]-(:MODEL)
WITH t ORDER BY t.created DESC LIMIT 10
WITH collect(t) as saved
MATCH (h:HISTORY)-[:HISTORY]-(:MODEL)
WHERE h.name = $modelName
AND NOT h in saved
DETACH DELETE h

Get the full graph of a query in Neo4j

Suppose tha I have the default database Movies and I want to find the total number of people that have participated in each movie, no matter their role (i.e. including the actors, the producers, the directors e.t.c.)
I have already done that using the query:
MATCH (m:Movie)<-[r]-(n:Person)
WITH m, COUNT(n) as count_people
RETURN m, count_people
ORDER BY count_people DESC
LIMIT 3
Ok, I have included some extra options but that doesn't really matter in my actual question. From the above query, I will get 3 movies.
Q. How can I enrich the above query, so I can get a graph including all the relationships regarding these 3 movies (i.e.DIRECTED, ACTED_IN,PRODUCED e.t.c)?
I know that I can deploy all the relationships regarding each movie through the buttons on each movie node, but I would like to know whether I can do so through cypher.
Use additional optional match:
MATCH (m:Movie)<--(n:Person)
WITH m,
COUNT(n) as count_people
ORDER BY count_people DESC
LIMIT 3
OPTIONAL MATCH p = (m)-[r]-(RN) WHERE type(r) IN ['DIRECTED', 'ACTED_IN', 'PRODUCED']
RETURN m,
collect(p) as graphPaths,
count_people
ORDER BY count_people DESC

Neo4j / Cypher query syntax feedback

I'm developing a kind of reddit service to learn Neo4j.
Everything works fine, I just want to get some feedback on the Cypher query to get the most recent news stories, the author and number of comments, likes and dislikes.
I'm using Neo4j 2.0.
MATCH comments = (n:news)-[:COMMENT]-(o)
MATCH likes = (n:news)-[:LIKES]-(p)
MATCH dislikes = (n:news)-[:DISLIKES]-(q)
MATCH (n:news)-[:POSTED_BY]-(r)
WITH n, r, count(comments) AS num_comments, count(likes) AS num_likes, count(dislikes) AS num_dislikes
ORDER BY n.post_date
LIMIT 20
RETURN *
o, p, q, r are all nodes with the label user. Should the label be added to the query to speed it up?
Is there anything else you see that I could optimize?
I think you're going to want to get rid of the multiple matches. Cypher will filter on each one, filtering through one another, rather than getting all the information.
I would also avoid the paths like comments, and rather do the count on the nodes you are saving. When you do MATCH xyz = (a)-[:COMMENT]-(b) then xyz is a path, which contains the source, relationship and destination node.
MATCH (news:news)-[:COMMENT]-(comment),(news:news)-[:LIKES]-(like),(news:news)-[:DISLIKES]-(dislike),(news:news)-[:POSTED_BY]-(posted_by)
WHERE news.post_date > 0
WITH news, posted_by, count(comment) AS num_comments, count(like) AS num_likes, count(dislike) AS num_dislikes
ORDER BY news.post_date
LIMIT 20
RETURN *
I would do something like this.
MATCH (n:news)-[:POSTED_BY]->(r)
WHERE n.post_date > {recent_start_time}
RETURN n, r,
length((n)<-[:COMMENT]-()) AS num_comments,
length((n)<-[:LIKES]-()) AS num_likes,
length((n)<-[:DISLIKES]-()) AS num_dislikes,
ORDER BY n.post_date DESC
LIMIT 20
To speed it up and have not neo search over all your posts, I would probably index the post-date field (assuming it doesn't contain time information). And then send this query in for today, yesterday etc. until you have your 20 posts.
MATCH (n:news {post_date: {day}})-[:POSTED_BY]->(r)
RETURN n, r,
length((n)<-[:COMMENT]-()) AS num_comments,
length((n)<-[:LIKES]-()) AS num_likes,
length((n)<-[:DISLIKES]-()) AS num_dislikes,
ORDER BY n.post_date DESC
LIMIT 20

Resources