I am having some extremely high query times and I'm unable to pinpoint the issue.
I am having a graph database with 6685 nodes, 26407 properties and 22921 relationships, running on an Amazon EC2 instance having 1.7GB RAM.
My use case is to map people to their various interest points and find for a given user, who are the people who have common interests with him.
I have data about 500 people in my db, and each person has an average of a little more than 100 different interest points related to him.
1) When I run this cypher query:
START u=node(5) MATCH (u)-[:interests]->(i)<-[:interests]-(o) RETURN o;
Here node(5) is a user node. So, I am trying to find all users who have the same ":interests" relation with user (u).
This query return 2557 rows and takes about 350ms.
2) When I sprinkle in a few extra MATCH conditions, the query time exponentially degrades.
For eg., if I want to find all users who have common interests with user (u) = node(5), and also share the same hometown, I wrote:
START u=node(5)
MATCH (u)-[:interests]->(i)<-[:interests]-(o)
WITH u,o,i
MATCH (u)-[:hometown]->(h)<-[:hometown]-(o)
RETURN u, o, i, h;
This query return 755 rows and takes about 2500ms!
3) If I add more constraints to the MATCH, like same gender, same alma mater etc., query times progressively worsen to >10,000 ms.
What am I doing wrong here?
Could you try stating the pattern as a whole in your first MATCH clause, i.e. MATCH (u)-[:interests]->(i)<-[:interests]-(o)-[:hometown]->(h)<-[:hometown]-(o) ?
Related
I have table in my website to show a list of links and the number of times they have been visited. Here's the cypher query I use to get such a data:
MATCH (u:USER {email: $email})-[:CREATED]->(l:URL)
OPTIONAL MATCH (l)<-[v:VISITED]-(:VISIT)
RETURN l, COUNT(v) AS count
LIMIT 10
I create a VISIT node for each visit for a URL in order to store analytics data for each visit. So in the above code, I grab the links that a user have created and count the visits for each one.
The problem is the above query is not performant. Now that data has got huge, it takes at least 8 seconds to resolve.
Any ways to improve this query?
For the :VISITED relationships, if those only connect :VISIT nodes to :URL nodes, then you can use the size() function on the pattern, excluding the node label, which will get the degree information from the :URL node itself without having to expand out (you can confirm this by doing a PROFILE or EXPLAIN of the plan and expand all elements, look for GetDegreePrimitive in the Projection operation).
Also, since you're using LIMIT 10 without any kind of ordering, it's better to do the LIMIT earlier so you only perform subsequent operations with the limited set of nodes rather than doing all the work for all the nodes then only keeping 10.
MATCH (u:USER {email: $email})-[:CREATED]->(l:URL)
WITH l
LIMIT 10
RETURN l, size((l)<-[:VISITED]-()) as count
Also, as noted by cybersam, you'll absolutely want an index on :USER(email) so lookup to your specific :USER node is fast.
In addition to #InverseFalcon's suggestions, you should either create an index or uniqueness constraint on :USER(email), to avoid having to scan through all USER nodes to find the one of interest.
If I had a million users and if I search them using IN Operator with more than 1000 custom ids which are unique indexed.
For example,in movie database given by neo4j
Let's say I need to get all movies where my list of actors ( > 1000) should acted in that movie and ordered by movie released date and distinct movie results.
Is that really good to have that operation on database and what are the time complexities if I execute that in single node instance and ha cluster.
This will give you a rough guide on the computational complexity involved in your calculation.
For each of your Actors Neo will look for all the Acted_In relationships going from that node. Lets assume that the average number of Acted_In relationships is 4 per Actor.
Therefore Neo will require 4 traversals per Actor.
Therefore for 1000 Actors that will be 4000 traversals.
Which for Neo is not a lot (they claim to do about 1 million a second, but of course this depends upon hardware)
Then, the Distinct aspect of the query is trivial for Neo as it knows which Nodes it has visited, so Neo would automatically have the unique list of Movie nodes, so this would be very quick.
If the Release date of the movie is indexed in Neo the ordering of the results would also be very quick.
So theoretically this query should run quickly (well under a second) and have minimal impact on the database
Here is what I'd do, I would start traversing from the actor with the lowest degree, i.e. the highest selectivity of your dataset. Then find the movies he acted in and check those movies against the rest of the actors.
The second option might be more efficient implementation wise. (There is also another trick that can speed up that one even more, let me know via email when you have the dataset to test it on).
MATCH (n:Actor) WHERE n.id IN {ids}
WITH n, SIZE( (n)-[:ACTED_IN]->() ) as degree
ORDER BY degree ASC
WITH collect(n) as actors WITH head(actors) as first, tail(actors) as rest, size(actors)-1 as number
// either
MATCH (n)-[:ACTED_IN]->(m)
WHERE size( (m)<-[:ACTED_IN]->() ) > number AND ALL(a in rest WHERE (a)-[:ACTED_IN]->(m))
RETURN m;
// or
MATCH (n)-[:ACTED_IN]->(m)
WHERE size( (m)<-[:ACTED_IN]->() ) > number
MATCH (m)<-[:ACTED_IN]-(a)
WHERE a IN rest
WITH m,count(*) as c, number
WHERE c = number
RETURN m;
I dont know if this make sense using Cypher or graph traversal, but i was trying to do sort of a "shortest path" query but not based on weighted relationship but rather aggregated properties.
Assume i have nodes labeled People and they all vists different homepages with a VISIT relationship to the homepage node. Each homepage node has hits stats depending on its popularity. Now i would like to match people that has a visit relationship to a homepage until i reach max X number of exposure (hits).
Why ? Becuase then i know a "expected" exposure strategy for a certain group of people.
Something like
Do
MATCH (n:People)-[:VISITS]-(sites)
while (reduce (x)<100000)
Of course this "Do while" is nothing i have seen in the Cypher syntax but wouldn't it be useful? or should this be on app level by just returning a DESC list and do the math on in the applicaton. Mabey it should also be matched with some case if the loop cant be satisfied.
MATCH (n:People)-[:VISITS]-sites
WITH reduce(hits=0, x IN collect(sites.dailyhits)| hits + x) AS totalhits
RETURN totalhits;
Can return the correct aggregated hits value (all), but i would like this function to run each matched pattern until it satisfy a value and the return the match (of course i miss other possible and mixes between pages becuase the match never traversal the entire graph..but at least i have got an answer of pages in a list that match the requirement if it makes sense) ?
Thanks!
Not sure how you'd aggregate, but there are several aggregation functions (avg, sum, etc). And... you can pass these to a 2nd part of the cypher query, with a WITH clause.
That said: Cypher also supports the ability to sort a result (ORDER BY), and the ability to limit the number of results given (LIMIT). I don't know what you'd sort by, but... just for fun, let's sort it arbitrarily on something:
MATCH (n:People)-[v:VISITS]->(site:Site)
WHERE site.url= "http://somename.com"
RETURN n
ORDER BY v.VisitCount DESC
LIMIT 1000
This would cap your return set at 1,000 people, for people who visit a given site.
neo4j noob here, on Neo4j 2.0.0 Community
I've got a graph database of 24,000 movies and 2700 users, and somewhere around 60,000 LIKE relationships between a user and a movie.
Let's say that I've got a specific movie (movie1) in mind.
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)
RETURN usersLikingMovie1;
I can quickly and easily find the users who liked the movie with the above query. I can follow this path further to get the users who liked the same movies that as the people who liked movie1. I call these generation 2 users
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)
RETURN usersGen2;
This query takes about 3 seconds and returns 1896 users.
Now I take this query one step further to get the movies liked by the users above (generation 2 movies)
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
RETURN moviesGen2;
This query causes neo4j to spin for several minutes at 100% cpu utilization and using 4GB of RAM. Then it sends back an exception "OutOfMemoryError: GC overhead limit exceeded".
I was hoping someone could help me out and explain to me the issue.
Is Neo4j not meant to handle a query of this depth in a performant manner?
Is there something wrong with my Cypher query?
Thanks for taking the time to read.
That's a pretty intense query, and the deeper you go the closer you're probably getting to a set of all users that ever rated any movie, since you're essentially just expanding out through the graph in tree form starting with your given movie. #Huston's WHERE and DISTINCT clauses will help to prune branches you've already seen, but you're still just expanding out through the tree.
The branching factor of your tree can be estimated with two values:
u, the average number of users that liked a movie (incoming to :Movie)
m, the average number of movies that each user liked (outgoing from :User)
For an estimate, your first step will return m users. On the next step, for each user you get all the movies each of them liked followed by all the users that liked all of those movies:
gen(1) => u
gen(2) => u * (m * u)
For each generation you'll tack on another m*u, so your third generation is:
gen(3) => u * (m * u) * (m * u)
Or more generically:
gen(n) => u^n * m^(n-1)
You could estimate your branching factors by computing the average of your likes/users and likes/movie, but that's probably very inaccurate since it gives you 22.2 likes/user and 2.5 likes/movie. Those numbers aren't reasonable for any movie that's worthy of rating. A better approach would be to take the median number of ratings or look at a histogram and use the peaks as your branching factors.
To put this in perspective, the average Netflix user rated 200 movies. The Netflix Prize training set had 17,770 movies, 480,189 users, and 100,480,507 ratings. That's 209 ratings/user and 5654 ratings/movie.
To keep things simple (and assuming your data set is much smaller), let's use:
m = 20 movie ratings/user
u = 100 users have rated/movie
Your query in gen-3 (without distincts) will return:
gen(3) = 100^3 * 20^2
= 400,000,000
400 million nodes (users)
Since you only have 2700 users, I think it's safe to say your query probably returns every user in your data set (rather, 148 thousand-ish copies of each user).
Your movie nodes in ASCII -- (n:Movie {movieid:"88cacfca-3def-4b2c-acb2-8e7f4f28be04"}) are 58 bytes minimum. If your users are about the same, let's say each node is 60 bytes, your storage requirement for this resultant set is:
400,000,000 nodes * 60 bytes
= 24,000,000,000 bytes
= 23,437,500 kb
= 22,888 Mb
= 22.35 Gb
So by my conservative estimates, your query requires 22 Gigabytes of storage. This seems quite reasonable that Neo4j would run out of memory.
My guess is that you're trying to find similarities in the patterns of users, but the query you're using is returning all the users in your dataset duplicated a bunch of times. Maybe you want to be asking questions of your data more like:
what users rate movies most like me?
what users rated most of the same movies as I rated
what movies have users that have rated similar movies to me watched that I haven't watched yet?
Cheers,
cm
To minimize the explosion that #cod3monk3y talks about, I'd limit the number of intermediate results.
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)
WITH distinct moviesGen1
MATCH (moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
RETURN moviesGen2;
or even like this
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)
WITH distinct moviesGen1
MATCH (moviesGen1)<-[:LIKES]-(usersGen2)
WITH distinct usersGen2
MATCH (usersGen2)-[:LIKES]->(moviesGen2)
RETURN distinct moviesGen2;
if you want to, you can use "profile start ..." in the neo4j shell to see how many hits / db-rows you create in between, starting with your query and then these two.
Cypher is a pattern matching language, and it is important to remember that the MATCH clause will always find a pattern everywhere it exists in the Graph.
The problem with the MATCH clause you are using is that sometimes Cypher will find different patterns where 'usersGen2' is the same as 'usersLikingMovie1' and where 'movie1' is the same as 'movieGen1' across different patterns. So, in essence, Cypher finds the pattern every single time it exists in the Graph, is holding it in memory for the duration of the query, and then returning all the moviesGen2 nodes, which could actually be the same node n number of times.
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)
If you explicitly tell Cypher that the movies and users should be different for each match pattern it should solve the issue. Try this? Additionally, The DISTINCT parameter will make sure you only grab each 'moviesGen2' node once.
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
WHERE movie1 <> moviesGen2 AND usersLikingMovie1 <> usersGen2
RETURN DISTINCT moviesGen2;
Additionally, in 2.0, the start clause is not required. So you can actually leave out the START clause all together (However - only if you are NOT using a legacy index and use labels)...
Hope this works... Please correct my answer if there are syntax errors...
I'm trying to understand efficient usage patterns with neo4j, specifically in reference to high degree nodes. To give an idea of what I'm talking about, I have User nodes that have attributes which I have modeled as nodes. So there are relationships in my table such as
(:User)-[:HAS_ATTRIB]->(:AgeCategory)
and so on and so forth. The problem is that some of these AgeCategory nodes are very high degree, on the order of 100k, and queries such as
MATCH (u:User)-->(:AgeCategory)<--(v:User), (u)-->(:FavoriteLanguage)<--(v)
WHERE u.uid = "AAA111" AND v.uid <> u.uid
RETURN v.uid
(matching all users that share the same age category and favorite language as AAA111) are very, very slow, since you have to run over the FavoriteLanguage linked list once for every element in the AgeCategory linked list (or at least that's how I understand it).
I think it's pretty clear from the fact that this query takes minutes to resolve that I'm doing something wrong, but I am curious what the right procedure for dealing with queries like this is. Should I pull down the matching users from each query individually and compare them with an in-memory hash? Is there a way to put an index on the relationships on a node? Is this even a good idea for a schema to begin with?
My intuition is it would be more efficient to first retrieve the two end points (AgeCategory and FavoriteLanguage) for the given node u, and then query the middle node v for a path with these two fixed end points.
To prove that, I created a test graph with the following components,
A node u:User with u.uid = 'AAA111'
A node c:AgeCategory
A node l:FavoriteLanguage
A relationship between u and c, u-[:HAS_AGE]->c
A relationship between u and l, u-[:LIKE_LANGUAGE]->l
100,000 nodes v, each of which shar the same c:AgeCategory and l:FavoriteLanguage with the ndoe u, that is each v connects to l and c, v-[:HAS_AGE]->c, v-[:LIKE_LANGUAGE]->l
I run the following query 10 times, and got the average running time 10500 millis.
Match l:FavoriteLanguage<-[:LIKE_LANGUAGE]-u:User-[:HAS_AGE]->c:AgeCategory
Where u.uid = 'AAA111'
With l,c
Match l<-[:LIKE_LANGUAGE]-v:User-[:HAS_AGE]->c
Where v.uid <> 'AAA111'
Return v.uid
With 10,000 v nodes, this query takes around 2000 millis, your query takes about 27000 millis.
With 100,000 v node, this query takes around 10500 millis, it seems to take forever with your original query.
So you might give this query a try and see if it can improve the performance with your graph.