Neo4j Performance - IN Operator Cypher Query - neo4j

If I had a million users and if I search them using IN Operator with more than 1000 custom ids which are unique indexed.
For example,in movie database given by neo4j
Let's say I need to get all movies where my list of actors ( > 1000) should acted in that movie and ordered by movie released date and distinct movie results.
Is that really good to have that operation on database and what are the time complexities if I execute that in single node instance and ha cluster.

This will give you a rough guide on the computational complexity involved in your calculation.
For each of your Actors Neo will look for all the Acted_In relationships going from that node. Lets assume that the average number of Acted_In relationships is 4 per Actor.
Therefore Neo will require 4 traversals per Actor.
Therefore for 1000 Actors that will be 4000 traversals.
Which for Neo is not a lot (they claim to do about 1 million a second, but of course this depends upon hardware)
Then, the Distinct aspect of the query is trivial for Neo as it knows which Nodes it has visited, so Neo would automatically have the unique list of Movie nodes, so this would be very quick.
If the Release date of the movie is indexed in Neo the ordering of the results would also be very quick.
So theoretically this query should run quickly (well under a second) and have minimal impact on the database

Here is what I'd do, I would start traversing from the actor with the lowest degree, i.e. the highest selectivity of your dataset. Then find the movies he acted in and check those movies against the rest of the actors.
The second option might be more efficient implementation wise. (There is also another trick that can speed up that one even more, let me know via email when you have the dataset to test it on).
MATCH (n:Actor) WHERE n.id IN {ids}
WITH n, SIZE( (n)-[:ACTED_IN]->() ) as degree
ORDER BY degree ASC
WITH collect(n) as actors WITH head(actors) as first, tail(actors) as rest, size(actors)-1 as number
// either
MATCH (n)-[:ACTED_IN]->(m)
WHERE size( (m)<-[:ACTED_IN]->() ) > number AND ALL(a in rest WHERE (a)-[:ACTED_IN]->(m))
RETURN m;
// or
MATCH (n)-[:ACTED_IN]->(m)
WHERE size( (m)<-[:ACTED_IN]->() ) > number
MATCH (m)<-[:ACTED_IN]-(a)
WHERE a IN rest
WITH m,count(*) as c, number
WHERE c = number
RETURN m;

Related

How to calculate custom degree based on the node label or other conditions?

I have a scenario where I need to calcula a custom degree between the first node (:employee) where it should only be incremented to another node when this node's label is :natural or :relative, but not when it is :legal.
Example:
The thing is I'm having trouble generating this custom degree property as I needed it.
So far I've tried playing with FOREACH and CASE but had no luck. The closest I got to getting some sort of calculated custom degree is this:
match p = (:employee)-[*5..5]-()
WITH distinct nodes(p) AS nodes
FOREACH(i IN RANGE(0, size(nodes)) |
FOREACH(node IN [nodes[i]] |
SET node.degree = i
))
return *
limit 1
But even this isn't right, as despite having 5 distinct nodes, I get SIZE(nodes) = 6, as the :legal node is accounted for twice for some reason.
Does anyone know how to achieve my goal within a single cypher query?
Also, if you know why the :legal node is account for twice, please let me know. I suspect it is because it has 2 :natural nodes related to it, but don't know the inner workings that make it appear twice.
More context:
:employee nodes are, well, employees of an organization
:relative nodes are relatives to an employee
:natural nodes are natural persons that may or may not be related to a :legal
:legal nodes are companies (legal persons) that may, or may not, be related to an :employee, :relative, :natural or another :legal on an IS_PARTNER relationship when, in real life, they are part of the board of directors or are shareholders of that company (:legal).
custom degree is what I aim to create and will define how close one node is to another given some conditions to this project (specified below).
All nodes have a total_contracts property that are the total amount of money received through contracts.
The objective is to find any employees with relationships to another node that has total_contracts > 0 and are up to custom degree <= 3, as employees may be receiving money from external sources, when they shouldn't.
As for why I need this custom degree ignoring the distance when it is a :legal node, is because we threat companies as the same distance as the natural person that is a partner.
On the illustrated example above, the employee has a son, DIEGO, that is a shareholder of a company (ALLURE) and has 2 other business partners (JOSE and ROSIEL). When I ask what's the degree of the son to the employee, I should get 1, as they are directly related; when I ask whats the degree of JOSE to the employee I should get 2, as JOSE is related to DIEGO through ALLURE and we shouldn't increment the custom degree when it is a company, only when its a person.
The trick with this type of graph is making sure we avoid paths that loop back to the same nodes (which is definitely going to happen quite a lot because you're using multiple relationships between nodes instead of just one...you may want to make sure this is necessary in your model).
The easiest way to do that is via APOC Procedures, as you can adjust the uniqueness of traversals so that nodes are unique in each path.
So for example, for a specific start node (let's say the :employee has empId:1 just for the sake of mocking up a lookup of the node, we'll calculate a degree for all nodes within 5 hops of the starting node. The idea here is that we'll take the length of the path (the number of hops) - the number of :legal nodes in the path (by filtering the nodes in the path for just :legal nodes, then getting the size of that filtered list).
MATCH (e:employee {empId:1})
CALL apoc.path.expandConfig(e, {minLevel:1, maxLevel:5, uniqueness:'NODE_PATH'}) YIELD path
WITH e, last(nodes(path)) as endNode,
length(path) - size([x in nodes(path) WHERE x:legal]) as customDegree
RETURN e, endNode, customDegree

Finding actors not connected to Kevin Bacon, efficiently

Using neo4j cypher, what query would efficiently find actors who are not connected to Kevin Bacon? We can say that 'not connected' means that an actor is not connected to Kevin Bacon by at least 10 hops for simplicity.
Here is what I have attempted:
MATCH (kb:Actor {name:'Kevin Bacon'})-[*1..10]-(h:Actor) with h
MATCH (a)-[:ACTS_IN]->(m)
WHERE a <> h
RETURN DISTINCT h.name
However, this query runs for 3 days. How can I do this more efficiently?
(A) Your first MATCH finds every actor that is connected within 10 hops to Kevin Bacon. The result of this clause is a number (M) of rows (and if an actor is connected in, say, 7 different ways to Kevin, then that actor is represented in 7 rows).
(B) Your second MATCH finds every actor that has acted in a movie. If this MATCH clause were standalone, then it would require N rows, where N is the number of ACTS_IN relationships (and if an actor acted in, say, 9 movies, then that actor would be represented in 9 rows). However, since the clause comes right after another MATCH clause, you get a cartesian product and the actual number of result rows is M*N.
So, your query requires a lot of storage and performs a (potentially large) number of redundant comparisons, and your results can contain duplicate names. To reduce the storage requirements and the number of actor comparisons (in your WHERE clause): you should cause the results of A and B to have distinct actors, and eliminate the cartesian product.
The following query should do that. It first collects a single list (in a single row) of every distinct actor that is connected within 10 hops to Kevin Bacon (as hs), and then finds all (distinct) actors not in that collection:
MATCH (kb:Actor {name:'Kevin Bacon'})-[*..10]-(h:Actor)
WITH COLLECT(DISTINCT h) AS hs
MATCH (a:Actor)
WHERE NOT a IN hs
RETURN a.name;
(This query also saves even more time by not bothering to test whether an actor has acted in a movie.)
The performance would still depend on how long it takes to perform the variable length path search in the first MATCH, however.

Cypher query to get subsets of different node labels, with relations

Let's assume this use case;
We have few nodes (labeled Big) and each having a simple integer ID property.
Each Big node has a relation with millions of (labeled Small) nodes.
such as :
(Small)-[:BELONGS_TO]->(Big)
How can I phrase a Cypher query to represent the following in natural language:
For each Big node in the range of ids between 4-7, get me 10 of Small nodes that belongs to it.
The supposed result would give 2 Big nodes, 20 Small nodes, and 20 Relations
The needed result would be represented by this graph:
2 Big nodes, each with a subset of 10 of Small nodes that belongs to them
What I've tried but failed (it only shows 1 big node (id=5) along with 10 of its related Small nodes, but doesn't show the second node (id=6):
MATCH (s:Small)-[:BELONGS_TO]->(b:Big)
Where 4<b.bigID<7
return b,s limit 10
I guess I need a more complex compound query.
Hope I could phrase my question in an understandable way!
As stdob-- says, you can't use limit here, at least not in this way, as it limits the entire result set.
While the aggregation solution will return you the right answer, you'll still pay the cost for the expansion to those millions of nodes. You need a solution that will lazily get the first ten for each.
Using APOC Procedures, you can use apoc.cypher.run() to effectively perform a subquery. The query will be run per-row, so if you limit the rows first, you can call this and use LIMIT within the subquery, and it will properly limit to 10 results per row, lazily expanding so you don't pay for an expansion to millions of nodes.
MATCH (b:Big)
WHERE 4 < b.bigID < 7
CALL apoc.cypher.run('
MATCH (s:Small)-[:BELONGS_TO]->(b)
RETURN s LIMIT 10',
{b:b}) YIELD value
RETURN b, value.s
Your query does not work because the limit applies to the entire previous flow.
You need to use aggregation function collect:
MATCH (s:Small)-[:BELONGS_TO]->(b:Big) Where 4<b.bigID<7
With b,
collect(distinct s)[..10] as smalls
return b,
smalls

Neo4j crashes on 4th degree Cypher query

neo4j noob here, on Neo4j 2.0.0 Community
I've got a graph database of 24,000 movies and 2700 users, and somewhere around 60,000 LIKE relationships between a user and a movie.
Let's say that I've got a specific movie (movie1) in mind.
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)
RETURN usersLikingMovie1;
I can quickly and easily find the users who liked the movie with the above query. I can follow this path further to get the users who liked the same movies that as the people who liked movie1. I call these generation 2 users
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)
RETURN usersGen2;
This query takes about 3 seconds and returns 1896 users.
Now I take this query one step further to get the movies liked by the users above (generation 2 movies)
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
RETURN moviesGen2;
This query causes neo4j to spin for several minutes at 100% cpu utilization and using 4GB of RAM. Then it sends back an exception "OutOfMemoryError: GC overhead limit exceeded".
I was hoping someone could help me out and explain to me the issue.
Is Neo4j not meant to handle a query of this depth in a performant manner?
Is there something wrong with my Cypher query?
Thanks for taking the time to read.
That's a pretty intense query, and the deeper you go the closer you're probably getting to a set of all users that ever rated any movie, since you're essentially just expanding out through the graph in tree form starting with your given movie. #Huston's WHERE and DISTINCT clauses will help to prune branches you've already seen, but you're still just expanding out through the tree.
The branching factor of your tree can be estimated with two values:
u, the average number of users that liked a movie (incoming to :Movie)
m, the average number of movies that each user liked (outgoing from :User)
For an estimate, your first step will return m users. On the next step, for each user you get all the movies each of them liked followed by all the users that liked all of those movies:
gen(1) => u
gen(2) => u * (m * u)
For each generation you'll tack on another m*u, so your third generation is:
gen(3) => u * (m * u) * (m * u)
Or more generically:
gen(n) => u^n * m^(n-1)
You could estimate your branching factors by computing the average of your likes/users and likes/movie, but that's probably very inaccurate since it gives you 22.2 likes/user and 2.5 likes/movie. Those numbers aren't reasonable for any movie that's worthy of rating. A better approach would be to take the median number of ratings or look at a histogram and use the peaks as your branching factors.
To put this in perspective, the average Netflix user rated 200 movies. The Netflix Prize training set had 17,770 movies, 480,189 users, and 100,480,507 ratings. That's 209 ratings/user and 5654 ratings/movie.
To keep things simple (and assuming your data set is much smaller), let's use:
m = 20 movie ratings/user
u = 100 users have rated/movie
Your query in gen-3 (without distincts) will return:
gen(3) = 100^3 * 20^2
= 400,000,000
400 million nodes (users)
Since you only have 2700 users, I think it's safe to say your query probably returns every user in your data set (rather, 148 thousand-ish copies of each user).
Your movie nodes in ASCII -- (n:Movie {movieid:"88cacfca-3def-4b2c-acb2-8e7f4f28be04"}) are 58 bytes minimum. If your users are about the same, let's say each node is 60 bytes, your storage requirement for this resultant set is:
400,000,000 nodes * 60 bytes
= 24,000,000,000 bytes
= 23,437,500 kb
= 22,888 Mb
= 22.35 Gb
So by my conservative estimates, your query requires 22 Gigabytes of storage. This seems quite reasonable that Neo4j would run out of memory.
My guess is that you're trying to find similarities in the patterns of users, but the query you're using is returning all the users in your dataset duplicated a bunch of times. Maybe you want to be asking questions of your data more like:
what users rate movies most like me?
what users rated most of the same movies as I rated
what movies have users that have rated similar movies to me watched that I haven't watched yet?
Cheers,
cm
To minimize the explosion that #cod3monk3y talks about, I'd limit the number of intermediate results.
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)
WITH distinct moviesGen1
MATCH (moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
RETURN moviesGen2;
or even like this
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)
WITH distinct moviesGen1
MATCH (moviesGen1)<-[:LIKES]-(usersGen2)
WITH distinct usersGen2
MATCH (usersGen2)-[:LIKES]->(moviesGen2)
RETURN distinct moviesGen2;
if you want to, you can use "profile start ..." in the neo4j shell to see how many hits / db-rows you create in between, starting with your query and then these two.
Cypher is a pattern matching language, and it is important to remember that the MATCH clause will always find a pattern everywhere it exists in the Graph.
The problem with the MATCH clause you are using is that sometimes Cypher will find different patterns where 'usersGen2' is the same as 'usersLikingMovie1' and where 'movie1' is the same as 'movieGen1' across different patterns. So, in essence, Cypher finds the pattern every single time it exists in the Graph, is holding it in memory for the duration of the query, and then returning all the moviesGen2 nodes, which could actually be the same node n number of times.
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)
If you explicitly tell Cypher that the movies and users should be different for each match pattern it should solve the issue. Try this? Additionally, The DISTINCT parameter will make sure you only grab each 'moviesGen2' node once.
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
WHERE movie1 <> moviesGen2 AND usersLikingMovie1 <> usersGen2
RETURN DISTINCT moviesGen2;
Additionally, in 2.0, the start clause is not required. So you can actually leave out the START clause all together (However - only if you are NOT using a legacy index and use labels)...
Hope this works... Please correct my answer if there are syntax errors...

High degree nodes in neo4j

I'm trying to understand efficient usage patterns with neo4j, specifically in reference to high degree nodes. To give an idea of what I'm talking about, I have User nodes that have attributes which I have modeled as nodes. So there are relationships in my table such as
(:User)-[:HAS_ATTRIB]->(:AgeCategory)
and so on and so forth. The problem is that some of these AgeCategory nodes are very high degree, on the order of 100k, and queries such as
MATCH (u:User)-->(:AgeCategory)<--(v:User), (u)-->(:FavoriteLanguage)<--(v)
WHERE u.uid = "AAA111" AND v.uid <> u.uid
RETURN v.uid
(matching all users that share the same age category and favorite language as AAA111) are very, very slow, since you have to run over the FavoriteLanguage linked list once for every element in the AgeCategory linked list (or at least that's how I understand it).
I think it's pretty clear from the fact that this query takes minutes to resolve that I'm doing something wrong, but I am curious what the right procedure for dealing with queries like this is. Should I pull down the matching users from each query individually and compare them with an in-memory hash? Is there a way to put an index on the relationships on a node? Is this even a good idea for a schema to begin with?
My intuition is it would be more efficient to first retrieve the two end points (AgeCategory and FavoriteLanguage) for the given node u, and then query the middle node v for a path with these two fixed end points.
To prove that, I created a test graph with the following components,
A node u:User with u.uid = 'AAA111'
A node c:AgeCategory
A node l:FavoriteLanguage
A relationship between u and c, u-[:HAS_AGE]->c
A relationship between u and l, u-[:LIKE_LANGUAGE]->l
100,000 nodes v, each of which shar the same c:AgeCategory and l:FavoriteLanguage with the ndoe u, that is each v connects to l and c, v-[:HAS_AGE]->c, v-[:LIKE_LANGUAGE]->l
I run the following query 10 times, and got the average running time 10500 millis.
Match l:FavoriteLanguage<-[:LIKE_LANGUAGE]-u:User-[:HAS_AGE]->c:AgeCategory
Where u.uid = 'AAA111'
With l,c
Match l<-[:LIKE_LANGUAGE]-v:User-[:HAS_AGE]->c
Where v.uid <> 'AAA111'
Return v.uid
With 10,000 v nodes, this query takes around 2000 millis, your query takes about 27000 millis.
With 100,000 v node, this query takes around 10500 millis, it seems to take forever with your original query.
So you might give this query a try and see if it can improve the performance with your graph.

Resources