Finding actors not connected to Kevin Bacon, efficiently - neo4j

Using neo4j cypher, what query would efficiently find actors who are not connected to Kevin Bacon? We can say that 'not connected' means that an actor is not connected to Kevin Bacon by at least 10 hops for simplicity.
Here is what I have attempted:
MATCH (kb:Actor {name:'Kevin Bacon'})-[*1..10]-(h:Actor) with h
MATCH (a)-[:ACTS_IN]->(m)
WHERE a <> h
RETURN DISTINCT h.name
However, this query runs for 3 days. How can I do this more efficiently?

(A) Your first MATCH finds every actor that is connected within 10 hops to Kevin Bacon. The result of this clause is a number (M) of rows (and if an actor is connected in, say, 7 different ways to Kevin, then that actor is represented in 7 rows).
(B) Your second MATCH finds every actor that has acted in a movie. If this MATCH clause were standalone, then it would require N rows, where N is the number of ACTS_IN relationships (and if an actor acted in, say, 9 movies, then that actor would be represented in 9 rows). However, since the clause comes right after another MATCH clause, you get a cartesian product and the actual number of result rows is M*N.
So, your query requires a lot of storage and performs a (potentially large) number of redundant comparisons, and your results can contain duplicate names. To reduce the storage requirements and the number of actor comparisons (in your WHERE clause): you should cause the results of A and B to have distinct actors, and eliminate the cartesian product.
The following query should do that. It first collects a single list (in a single row) of every distinct actor that is connected within 10 hops to Kevin Bacon (as hs), and then finds all (distinct) actors not in that collection:
MATCH (kb:Actor {name:'Kevin Bacon'})-[*..10]-(h:Actor)
WITH COLLECT(DISTINCT h) AS hs
MATCH (a:Actor)
WHERE NOT a IN hs
RETURN a.name;
(This query also saves even more time by not bothering to test whether an actor has acted in a movie.)
The performance would still depend on how long it takes to perform the variable length path search in the first MATCH, however.

Related

Cypher query to get subsets of different node labels, with relations

Let's assume this use case;
We have few nodes (labeled Big) and each having a simple integer ID property.
Each Big node has a relation with millions of (labeled Small) nodes.
such as :
(Small)-[:BELONGS_TO]->(Big)
How can I phrase a Cypher query to represent the following in natural language:
For each Big node in the range of ids between 4-7, get me 10 of Small nodes that belongs to it.
The supposed result would give 2 Big nodes, 20 Small nodes, and 20 Relations
The needed result would be represented by this graph:
2 Big nodes, each with a subset of 10 of Small nodes that belongs to them
What I've tried but failed (it only shows 1 big node (id=5) along with 10 of its related Small nodes, but doesn't show the second node (id=6):
MATCH (s:Small)-[:BELONGS_TO]->(b:Big)
Where 4<b.bigID<7
return b,s limit 10
I guess I need a more complex compound query.
Hope I could phrase my question in an understandable way!
As stdob-- says, you can't use limit here, at least not in this way, as it limits the entire result set.
While the aggregation solution will return you the right answer, you'll still pay the cost for the expansion to those millions of nodes. You need a solution that will lazily get the first ten for each.
Using APOC Procedures, you can use apoc.cypher.run() to effectively perform a subquery. The query will be run per-row, so if you limit the rows first, you can call this and use LIMIT within the subquery, and it will properly limit to 10 results per row, lazily expanding so you don't pay for an expansion to millions of nodes.
MATCH (b:Big)
WHERE 4 < b.bigID < 7
CALL apoc.cypher.run('
MATCH (s:Small)-[:BELONGS_TO]->(b)
RETURN s LIMIT 10',
{b:b}) YIELD value
RETURN b, value.s
Your query does not work because the limit applies to the entire previous flow.
You need to use aggregation function collect:
MATCH (s:Small)-[:BELONGS_TO]->(b:Big) Where 4<b.bigID<7
With b,
collect(distinct s)[..10] as smalls
return b,
smalls

Neo4j Performance - IN Operator Cypher Query

If I had a million users and if I search them using IN Operator with more than 1000 custom ids which are unique indexed.
For example,in movie database given by neo4j
Let's say I need to get all movies where my list of actors ( > 1000) should acted in that movie and ordered by movie released date and distinct movie results.
Is that really good to have that operation on database and what are the time complexities if I execute that in single node instance and ha cluster.
This will give you a rough guide on the computational complexity involved in your calculation.
For each of your Actors Neo will look for all the Acted_In relationships going from that node. Lets assume that the average number of Acted_In relationships is 4 per Actor.
Therefore Neo will require 4 traversals per Actor.
Therefore for 1000 Actors that will be 4000 traversals.
Which for Neo is not a lot (they claim to do about 1 million a second, but of course this depends upon hardware)
Then, the Distinct aspect of the query is trivial for Neo as it knows which Nodes it has visited, so Neo would automatically have the unique list of Movie nodes, so this would be very quick.
If the Release date of the movie is indexed in Neo the ordering of the results would also be very quick.
So theoretically this query should run quickly (well under a second) and have minimal impact on the database
Here is what I'd do, I would start traversing from the actor with the lowest degree, i.e. the highest selectivity of your dataset. Then find the movies he acted in and check those movies against the rest of the actors.
The second option might be more efficient implementation wise. (There is also another trick that can speed up that one even more, let me know via email when you have the dataset to test it on).
MATCH (n:Actor) WHERE n.id IN {ids}
WITH n, SIZE( (n)-[:ACTED_IN]->() ) as degree
ORDER BY degree ASC
WITH collect(n) as actors WITH head(actors) as first, tail(actors) as rest, size(actors)-1 as number
// either
MATCH (n)-[:ACTED_IN]->(m)
WHERE size( (m)<-[:ACTED_IN]->() ) > number AND ALL(a in rest WHERE (a)-[:ACTED_IN]->(m))
RETURN m;
// or
MATCH (n)-[:ACTED_IN]->(m)
WHERE size( (m)<-[:ACTED_IN]->() ) > number
MATCH (m)<-[:ACTED_IN]-(a)
WHERE a IN rest
WITH m,count(*) as c, number
WHERE c = number
RETURN m;

Efficient duplicate node finding in neo4j

A feature request for the next Neo4j version: Neo4j already supports indices that keep properties in a sorted order, allowing fast lookups. Eg. for a person's first name, one might have an index that looks like:
Alice
Bob
Carol
Dave
Emily
(....)
so one can look up "Dave" with binary search (O(log n)) instead of linear scanning (O(n)).
However, one can also use an index to efficiently find duplicates (nodes which have the same value for some property). Eg., if one wants a list of every group of "person" nodes sharing the same first name, what Neo4j 2.3 seems to do (via EXPLAIN in Cypher) is run a comparison of each node's first name against every other first name, which is O(N^2). Eg. this query:
EXPLAIN MATCH (a:person) WITH a MATCH (b:person) WHERE a.name = b.name RETURN a, b LIMIT 5
shows a CartesianProduct step followed by a Filter step. But with an index on first names, one can do a linear scan over a list like:
Alice
Alice
Alice
Bob
Carol
Carol
Dave
Emily
Frank
Frank
Frank
(....)
comparing item #1 to #2, #2 to #3, and so on, to build an ordered list of all the duplicates in O(n) time per scan. Neo4j doesn't seem to support that, but it would be very useful for my application, so I'd like to put in a request.
I have a couple of suggestions for what you might try, but if you find them insufficient (and nobody else has any better ideas), I would suggest submitting new feature ideas to the Neo4j GitHub issues list.
So I was wondering if maybe Neo4j considers properties special. If you have an index on a label/property (which you can create with CREATE INDEX ON :person(name)), then comparing a property with a string should be pretty efficient. I tried passing the name through as just a variable and it seems to have fewer DB hits in my small test DB:
MATCH (a:person)
WITH a, a.name AS name
MATCH (b:person)
WHERE name = b.name
RETURN a, b LIMIT 5
That seems to give me fewer DB hits when I PROFILE it.
Another way to go about it, since you're talking about the same set of objects, is to group the nodes by name and then pull out the pairs for each group. Like so:
MATCH (a:person)
WITH a.name AS name, collect(a) AS people
UNWIND people AS a
UNWIND people AS b
WITH name, a, b
WHERE a <> b
RETURN a, b LIMIT 50
Here we collect up an array for each unique name (we could also lower/upper if we wanted to be case-insensitive) and then UNWIND twice to get a cartesian product of the array. Since we're working on a group-by-group basis, this should be much faster than comparing every node to every other node.

Neo4j crashes on 4th degree Cypher query

neo4j noob here, on Neo4j 2.0.0 Community
I've got a graph database of 24,000 movies and 2700 users, and somewhere around 60,000 LIKE relationships between a user and a movie.
Let's say that I've got a specific movie (movie1) in mind.
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)
RETURN usersLikingMovie1;
I can quickly and easily find the users who liked the movie with the above query. I can follow this path further to get the users who liked the same movies that as the people who liked movie1. I call these generation 2 users
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)
RETURN usersGen2;
This query takes about 3 seconds and returns 1896 users.
Now I take this query one step further to get the movies liked by the users above (generation 2 movies)
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
RETURN moviesGen2;
This query causes neo4j to spin for several minutes at 100% cpu utilization and using 4GB of RAM. Then it sends back an exception "OutOfMemoryError: GC overhead limit exceeded".
I was hoping someone could help me out and explain to me the issue.
Is Neo4j not meant to handle a query of this depth in a performant manner?
Is there something wrong with my Cypher query?
Thanks for taking the time to read.
That's a pretty intense query, and the deeper you go the closer you're probably getting to a set of all users that ever rated any movie, since you're essentially just expanding out through the graph in tree form starting with your given movie. #Huston's WHERE and DISTINCT clauses will help to prune branches you've already seen, but you're still just expanding out through the tree.
The branching factor of your tree can be estimated with two values:
u, the average number of users that liked a movie (incoming to :Movie)
m, the average number of movies that each user liked (outgoing from :User)
For an estimate, your first step will return m users. On the next step, for each user you get all the movies each of them liked followed by all the users that liked all of those movies:
gen(1) => u
gen(2) => u * (m * u)
For each generation you'll tack on another m*u, so your third generation is:
gen(3) => u * (m * u) * (m * u)
Or more generically:
gen(n) => u^n * m^(n-1)
You could estimate your branching factors by computing the average of your likes/users and likes/movie, but that's probably very inaccurate since it gives you 22.2 likes/user and 2.5 likes/movie. Those numbers aren't reasonable for any movie that's worthy of rating. A better approach would be to take the median number of ratings or look at a histogram and use the peaks as your branching factors.
To put this in perspective, the average Netflix user rated 200 movies. The Netflix Prize training set had 17,770 movies, 480,189 users, and 100,480,507 ratings. That's 209 ratings/user and 5654 ratings/movie.
To keep things simple (and assuming your data set is much smaller), let's use:
m = 20 movie ratings/user
u = 100 users have rated/movie
Your query in gen-3 (without distincts) will return:
gen(3) = 100^3 * 20^2
= 400,000,000
400 million nodes (users)
Since you only have 2700 users, I think it's safe to say your query probably returns every user in your data set (rather, 148 thousand-ish copies of each user).
Your movie nodes in ASCII -- (n:Movie {movieid:"88cacfca-3def-4b2c-acb2-8e7f4f28be04"}) are 58 bytes minimum. If your users are about the same, let's say each node is 60 bytes, your storage requirement for this resultant set is:
400,000,000 nodes * 60 bytes
= 24,000,000,000 bytes
= 23,437,500 kb
= 22,888 Mb
= 22.35 Gb
So by my conservative estimates, your query requires 22 Gigabytes of storage. This seems quite reasonable that Neo4j would run out of memory.
My guess is that you're trying to find similarities in the patterns of users, but the query you're using is returning all the users in your dataset duplicated a bunch of times. Maybe you want to be asking questions of your data more like:
what users rate movies most like me?
what users rated most of the same movies as I rated
what movies have users that have rated similar movies to me watched that I haven't watched yet?
Cheers,
cm
To minimize the explosion that #cod3monk3y talks about, I'd limit the number of intermediate results.
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)
WITH distinct moviesGen1
MATCH (moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
RETURN moviesGen2;
or even like this
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)
WITH distinct moviesGen1
MATCH (moviesGen1)<-[:LIKES]-(usersGen2)
WITH distinct usersGen2
MATCH (usersGen2)-[:LIKES]->(moviesGen2)
RETURN distinct moviesGen2;
if you want to, you can use "profile start ..." in the neo4j shell to see how many hits / db-rows you create in between, starting with your query and then these two.
Cypher is a pattern matching language, and it is important to remember that the MATCH clause will always find a pattern everywhere it exists in the Graph.
The problem with the MATCH clause you are using is that sometimes Cypher will find different patterns where 'usersGen2' is the same as 'usersLikingMovie1' and where 'movie1' is the same as 'movieGen1' across different patterns. So, in essence, Cypher finds the pattern every single time it exists in the Graph, is holding it in memory for the duration of the query, and then returning all the moviesGen2 nodes, which could actually be the same node n number of times.
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)
If you explicitly tell Cypher that the movies and users should be different for each match pattern it should solve the issue. Try this? Additionally, The DISTINCT parameter will make sure you only grab each 'moviesGen2' node once.
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
WHERE movie1 <> moviesGen2 AND usersLikingMovie1 <> usersGen2
RETURN DISTINCT moviesGen2;
Additionally, in 2.0, the start clause is not required. So you can actually leave out the START clause all together (However - only if you are NOT using a legacy index and use labels)...
Hope this works... Please correct my answer if there are syntax errors...

High degree nodes in neo4j

I'm trying to understand efficient usage patterns with neo4j, specifically in reference to high degree nodes. To give an idea of what I'm talking about, I have User nodes that have attributes which I have modeled as nodes. So there are relationships in my table such as
(:User)-[:HAS_ATTRIB]->(:AgeCategory)
and so on and so forth. The problem is that some of these AgeCategory nodes are very high degree, on the order of 100k, and queries such as
MATCH (u:User)-->(:AgeCategory)<--(v:User), (u)-->(:FavoriteLanguage)<--(v)
WHERE u.uid = "AAA111" AND v.uid <> u.uid
RETURN v.uid
(matching all users that share the same age category and favorite language as AAA111) are very, very slow, since you have to run over the FavoriteLanguage linked list once for every element in the AgeCategory linked list (or at least that's how I understand it).
I think it's pretty clear from the fact that this query takes minutes to resolve that I'm doing something wrong, but I am curious what the right procedure for dealing with queries like this is. Should I pull down the matching users from each query individually and compare them with an in-memory hash? Is there a way to put an index on the relationships on a node? Is this even a good idea for a schema to begin with?
My intuition is it would be more efficient to first retrieve the two end points (AgeCategory and FavoriteLanguage) for the given node u, and then query the middle node v for a path with these two fixed end points.
To prove that, I created a test graph with the following components,
A node u:User with u.uid = 'AAA111'
A node c:AgeCategory
A node l:FavoriteLanguage
A relationship between u and c, u-[:HAS_AGE]->c
A relationship between u and l, u-[:LIKE_LANGUAGE]->l
100,000 nodes v, each of which shar the same c:AgeCategory and l:FavoriteLanguage with the ndoe u, that is each v connects to l and c, v-[:HAS_AGE]->c, v-[:LIKE_LANGUAGE]->l
I run the following query 10 times, and got the average running time 10500 millis.
Match l:FavoriteLanguage<-[:LIKE_LANGUAGE]-u:User-[:HAS_AGE]->c:AgeCategory
Where u.uid = 'AAA111'
With l,c
Match l<-[:LIKE_LANGUAGE]-v:User-[:HAS_AGE]->c
Where v.uid <> 'AAA111'
Return v.uid
With 10,000 v nodes, this query takes around 2000 millis, your query takes about 27000 millis.
With 100,000 v node, this query takes around 10500 millis, it seems to take forever with your original query.
So you might give this query a try and see if it can improve the performance with your graph.

Resources