Optimize neo4j cypher query with very large dataset - neo4j

I'm trying to figure out how to optimize a cypher query on a very large dataset. I'm trying to find 2nd or 3rd degree friends in the same city. My current cypher query is, which takes over 1 minute to run:
match (n:User {id: 123})-[:LIVES_IN]->()<-[:LIVES_IN]-(u:User), (n)-[:FRIENDS_WITH*2..3]-(u) WHERE u.age >= 20 AND u.age <= 36 return u limit 100
There are approximately 500K User nodes and 500M FRIENDS_WITH relationships. I already have indexes on the id and age properties. The query seems to be choking on the FRIENDS_WITH requirement. Is there any way to think about this in a different way or optimize the cypher to make it real-time (i.e., max time 1-2 seconds)?
Here's the profile of the query:
Thanks.

Create index on id property for label User:
CREATE INDEX ON :User(id)
See documentation for schema indexes for more information http://neo4j.com/docs/stable/query-schema-index.html
If that doesn't help add a result of PROFILE query and we might be able to help you more
PROFILE MATCH ... rest of your query
Also it might be worth trying rewriting the query the following way:
MATCH (n:User {id: 123})-[:LIVES_IN]->()<-[:LIVES_IN]-(u:User),
(n)-[:FRIENDS_WITH*2..3]-(u)
WHERE u.age >= 20 AND u.age <= 36
return u limit 100

Related

How can I optimise my neo4j cypher query?

Please check my Cypher below, I am getting result with the query below() with low records but as records increases it take a long time about 1601152 ms:
i found suggestion to add USING INDEX and and I apply the USING INDEX in query.
PROFILE MATCH (m:Movie)-[:IN_APP]->(a:App {app_id: '1'})<-[:USER_IN]-(p:Person)-[:WATCHED]->(ma:Movie)-[:HAS_TAG]->(t:Tag)<-[:HAS_TAG]-(mb:Movie)-[:IN_APP]->(a)
USING INDEX a:App(app_id) WHERE p.person_id= '1'
AND NOT (p:Person)-[:WATCHED]-(mb)
RETURN DISTINCT(mb.movie_id) , mb.title, mb.imdb_rating, mb.runtime, mb.award, mb.watch_count, COLLECT(DISTINCT(t.tag_id)) as Tag, count(DISTINCT(t.tag_id)) as matched_tags
ORDER BY matched_tags DESC SKIP 0 LIMIT 50
Can you help me out what can I do?
I am trying to find 100 movies for recommendation on basis of tags, as 100 movies which I do not watch and match with tags of Movies I watched.
The following query may work better for you [assuming you have indexes on both :App(app_id) and :Person(person_id)]. By the way, I presumed that in your query the identifier ma should have been m (or vice versa).
MATCH (m:Movie)-[:IN_APP]->(a:App {app_id: '1'})<-[:USER_IN]-(p:Person {person_id: '1'})-[:WATCHED]->(m)
WITH a, p, COLLECT(m) AS movies
UNWIND movies AS movie
MATCH (movie)-[:HAS_TAG]->(t)<-[:HAS_TAG]-(mb:Movie)-[:IN_APP]->(a)
WHERE NOT mb IN movies
WITH DISTINCT mb, t
RETURN mb.movie_id, mb.title, mb.imdb_rating, mb.runtime, mb.award, mb.watch_count, COLLECT(t.tag_id) as Tag, COUNT(t.tag_id) as matched_tags
ORDER BY matched_tags DESC SKIP 0 LIMIT 50;
If you PROFILE this query, you should see that it performs NodeIndexSeek operations (instead of the much slower NodeByLabelScan) to quickly execute the first MATCH. The query also collects all the movies watched by the specified person and uses that collection later to speed up the WHERE clause (which no longer needs hit the DB). In addition, the query removed some labels from some of the node patterns (where doing so seemed likely to be unambiguous) to speed up processing further.

Optimizing Cypher Query - Neo4j

I have the following query
MATCH (User1 )-[:VIEWED]->(page)<-[:VIEWED]- (User2 )
RETURN User1.userId,User2.userId, count(page) as cnt
Its a relatively simple query to find co-page view counts between users.
Its just too slow, and I have to terminate it after some time.
Details
User consists of about 150k Nodes
Page consists of about 180k Nodes
User -VIEWS-> Page has about 380k Relationships
User has 7 attributes, and Page has about 5 attributes.
Both User and Page are indexed on UserId and PageId respectively.
Heap Size is 512mb (tried to run on 1g too)
What would be some of the ways to optimize this query as I think the count of the nodes and relationships are not a lot.
Use Labels
Always use Node labels in your patterns.
MATCH (u1:User)-[:VIEWED]->(p:Page)<-[:VIEWED]-(u2:User)
RETURN u1.userId, u2.userId, count(p) AS cnt;
Don't match on duplicate pairs of users
This query will be executed for all pairs of users (that share a viewed page) twice. Each user will be mapped to User1 and then each user will also be mapped to User2. To limit this:
MATCH (u1:User)-[:VIEWED]->(p:Page)<-[:VIEWED]-(u2:User)
WHERE id(u1) > id(u2)
RETURN u1.userId, u2.userId, count(p) AS cnt;
Query for a specific user
If you can bind either side of the pattern the query will be much faster. Do you need to execute this query for all pairs of users? Would it make sense to execute it relative to a single user only? For example:
MATCH (u1:User {name: "Bob"})-[:VIEWED]->(p:Page)<-[:VIEWED]-(u2:User)
WHERE NOT u1=u2
RETURN u1.userId, u2.userId, count(p) AS cnt;
As you are trying different queries you can prepend EXPLAIN or PROFILE to the Cypher query to see the execution plan and number of data hits. More info here.

Efficiently Exporting Relationships From Neo4J

I have a relatively small but growing database (2M nodes, 5M relationships). Relationships often change. I periodically need to export the list of relationships for some other computations.
At present, I use a paginated query, but it gets slow as the value of skip increases
MATCH (a)-[r]->(b) RETURN ID(a) AS id1, ID(b) AS id2, TYPE(r) AS r_type
SKIP %d LIMIT 1000
I am using py2neo. The relevant bit of code:
while (count <= num_records):
for record in graph.cypher.stream(cq % (skip, limit)):
id1 = record["id1"]
id2 = record["id2"]
r_type = record["r_type"]
Is there a better / more efficient way to do this?
Thanks in advance.
You don't have to skip / limit in the first place.
Neo can easily output gigabytes of data.
See this blog post for another way of doing that: http://neo4j.com/blog/export-csv-from-neo4j-curl-cypher-jq/
You can also use Save as CSV in Neo4j Browser after you ran a query.

Cypher performance in graph with large number of relatinships from one node

I have a Neo4j graph (ver. 2.2.2) with large number of relationships. For examaple: 1 node "Group", 300000 nodes "Data", 300000 relationships from "Group" to all existing nodes "Data". I need to check if there is a relationship between set of Data nodes and specific Group node (for example for 200 nodes). But Cypher query I used is very slow. I tried many modifications of this cypher but with no result.
Cypher to create graph:
FOREACH (r IN range(1,300000) | CREATE (:Data {id:r}));
CREATE (:Group);
MATCH (g:Group),(d:Data) create (g)-[:READ]->(d);
Query 1: COST. 600003 total db hits in 730 ms.
Acceptable but I asked only for 1 node.
PROFILE MATCH (d:Data)<-[:READ]-(g:Group) WHERE id(d) IN [10000] AND id(g)=300000 RETURN id(d);
Query 2: COST. 600003 total db hits in 25793 ms.
Not acceptable.
You need to replace "..." with real numbers of nodes from 10000 to 10199
PROFILE MATCH (d:Data)<-[:READ]-(g:Group) WHERE id(d) IN [10000,10001,10002 " ..." ,10198,10199] AND id(g)=300000 RETURN id(d);
Query 3: COST. 1000 total db hits in 309 ms.
This is only one solution I found to make query acceptable. I returned all ids of nodes "Group" and manualy filter result in my code to return only relationships to node with id 300000
You need to replace "..." with real numbers of nodes from 10000 to 10199
PROFILE MATCH (d:Data)<-[:READ]-(g:Group) WHERE id(d) IN [10000,10001,10002 " ..." ,10198,10199] RETURN id(d), id(g);
Question 1: Total DB hits in query 1 is surprising but I accept that physical model of neoj defines how this query is executed - it needs to look into every existing relation from node "Group". I accept that. But why is so big difference in execution time between query 1 and query 2 if number of db hits is the same (and exucution plan is the same)? I'm only returning id of node, not large set of properties.
Question 2: Is a query 3 the only one solution to optimize this query?
Apparently there is an issue with Cypher in 2.2.x with the seekById.
You can prefix your query with PLANNER RULE in order to make use of the previous Cypher planner, but you'll have to split your pattern in two for making it really fast, tested e.g. :
PLANNER RULE
MATCH (d:Data) WHERE id(d) IN [30]
MATCH (g:Group) WHERE id(g) = 300992
MATCH (d)<-[:READ]-(g)
RETURN id(d)

Neo4j query for getting first few nodes with highest degree

I am a total beginner with Neo4j and need help. Is there a query for getting the first few nodes with highest degree?
I have nodes called P and nodes called A. There are only links between P and A nodes. I want to have the first 10 nodes P which have the most links to nodes A.
My idea was the following query, but it took so much time!
MATCH (P1:P)-[r]->(A1:A)
RETURN P1.name AS P_name, COUNT(A1) AS A_no
ORDER BY no DESC
LIMIT 10
Is there something wrong with my query?
Best,
Mowi
How many nodes do you have in your db?
I'd probably not use cypher for that, the Java API actually has a node.getDegree() method which is much much faster.
Your query could be sped up a bit by
MATCH (P1:P)-->()
RETURN id(P1),count(*) as degree
ORDER BY degree DESC LIMIT 10
you could also try:
MATCH (P1:P)
RETURN id(P1),size((P1)-->()) as degree
ORDER BY degree DESC LIMIT 10
for limiting the nodes:
MATCH (P1:P)
WHERE P1.foo = "bar"
WITH P1 limit 10000
MATCH (P1)-->()
RETURN id(P1),count(*) as degree
ORDER BY degree DESC LIMIT 10

Resources