Neo4j Recommendation Cypher Query Optimization - neo4j

I am using Neo4j community edition embedded in java application for recommendation purpose. I made a custom function which contains a complex logic of comparing two entities, namely product and users. Both entities are present as nodes in graph and has more than 20 properties each for comparison purpose. For eg. I am calling this function in following format:
match (e:User {user_id:"some-id"}) with e
match (f:Product {product_id:"some-id"}) with e,f
return e,f,findComparisonValue(e,f) as pref_value;
This function call on an average takes about 4-5 ms to run. Now, to recommend best product to a particular user, I wrote a cypher query which iterates over all products, calculate the pref_value and rank them. My cypher query looks like this:
MATCH (source:User) WHERE id(source)={id} with source
MATCH (reco:Product) WHERE reco.is_active='t'
with reco, source, findComparisonValue(source, reco) as score_result
RETURN distinct reco, score_result.score as score, score_result.params as params, score_result.matched_keywords as matched_keywords
order by score desc
Some insights on graph structure:
Total Number of nodes: 2 million
Total Number of relationships: 20 million
Total Number of Users: 0.2 million
Total Number of Products: 1.8 million
The above cypher query is taking more than 10 seconds as it is iterating over all the products. On top of this cypher query, I am using graphaware-reco module for my recommendation needs (Using precompute, filteing, post processing etc). I thought of parallelising this but community edition does not support clustering. Now, as number of users in system is increasing day by day, I need to think of a scalable solution.
Can anyone help me out here, on how to optimize the query.

As others have commented, doing a significant calculation potentially millions of times in a single query is going to be slow, and does not take advantage of neo4j's strengths. You should investigate modifying your data model and calculation so that you can leverage relationships and/or indexes.
In the meantime, there are a number of things to suggest with your second query:
Make sure you have created an index for :Product(is_active), so that it is not necessary to scan all products. (By the way, if that property is actually supposed to be a boolean, then consider making it a boolean rather than a string.)
The RETURN clause should not need the DISTINCT operator, since all the result rows should be distinct anyway. This is because every reco value is already distinct. Removing that keyword should improve performance.

Related

Performance Issue with neo4j

There is DataSet at my Notebook’s Virtual Machine:
2 million unique Customers [:VISITED] 40000 unique Merchants.
Every [:VISIT] has properties: amount (double) and dt (date).
Every Customer has property “pty_id” (Integer).
And every Merchant has mcht_id (String) property.
One Customer may visit one Merchant for more than one time. And of course, one Customer may visit many Merchants. So there are 43 978 539 relationships in my graph between Customers and Merchants.
I have created Indexes:
CREATE INDEX on :Customer(pty_id)
CREATE INDEX  on :Merchant(mcht_id)
Parameters of my VM are:
Oracle (RedHat) Linux 7 with 2 core i7, 2 GB RAM
Parameters of my Neo4j 3.5.7 config:
- dbms.memory.heap.max_size=1024m
- dbms.memory.pagecache.size=512m
My task is:
Get top 10 Customers ordered by total_amount who spent their money at NOT specified Merchant(M) but visit that Merchants which have been visited by Customers who visit this specified Merchant(M)
My Solution is:
Let’s M will have mcht_id = "0000000DA5"
Then the CYPHER query will be:
MATCH
(c:Customer)-[r:VISITED]->(mm:Merchant)<-[:VISITED]-(cc:Customer)-[:VISITED]->(m:Merchant {mcht_id: "0000000DA5"})
WHERE
NOT (c)-[:VISITED]->(m)
WITH
DISTINCT c as uc
MATCH
(uc:Customer)-[rr:VISITED]->()
RETURN
uc.pty_id
,round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10;
Result is OK. I receive my answer:
uc.pty_id - v_amt: 1433798 - 348925.94; 739510 - 339169.83; 374933 -
327962.95 and so on.
The problem is that this result I have received after 437613 ms! It’s about 7 minutes!!! My estimated time for this query was about 10-20 seconds….
My Question is: What am I doing wrong???
There's a few things to improve here.
First, for graph-wide queries in a graph with millions of nodes and 50 million relationships, 1G of heap and 512M of pagecache is far too low. We usually recommend around 8-10G of heap minimum for medium to large graphs (this is your "scratch space" memory as a query executes), and to try to get as much of the graph size as possible in pagecache if you can to minimize cache misses as you traverse the graph. Neo4j likes memory. Memory is relatively cheap. You can use neo4j-admin memrec to get a recommendation of how to configure your memory settings, but in general you need to run this on a machine with more memory.
And if we're talking about hardware recommendations, usage of SSDs is highly recommended, for when you do need to hit the disk.
As for the query itself, notice in the query plan you posted that your DISTINCT operation drops the number of rows from the neighborhood of 26-35 million to only 153k rows, that's significant. Your most expensive step here (WHERE
NOT (c)-[:VISITED]->(m)) is the Expand(Into) operation on the right side of the plan, with nearly 1 billion db hits. This is happening too early in the query - you should be doing this AFTER your DISTINCT operation, so it operates on only 153k rows instead of 35 million.
You can also improve upon this so you don't even have to hit the graph to do that step of the filtering. Instead of using that WHERE NOT <pattern> approach, you can pre-match to the customers who visited the first merchant, gather them into a list, and keep them around, and instead of using negation of the pattern (where it has to actually expand out all :VISITED relationships of those customers and see if any was the original merchant), we instead do a list membership check, and ensure they aren't one of the 1k or so customers who visited the original merchant. That will happen in memory, since we already collected that list, so it shouldn't hit the graph. In any case you should do DISTINCT before this check.
In your RETURN you're performing an aggregation with respect to a node's unique property, so you're paying the cost of projecting that property across 4 million rows BEFORE the cardinality drops from the aggregation to 153k rows, meaning you're projecting out that property redundantly across a great many duplicate :Customer nodes before they become distinct from the aggregation. That's redundant and expensive property access you can avoid by aggregating with respect to the node instead, and then do your property access after the aggregation, and also after your sort and limit, so you only have to project out 10 properties.
So putting that all together, try this out:
MATCH
(cc:Customer)-[:VISITED]->(m:Merchant {mcht_id: "0000000DA5"})
WITH m, collect(DISTINCT cc) as visitors
UNWIND visitors as cc
MATCH (uc:Customer)-[:VISITED]->(mm:Merchant)<-[:VISITED]-(cc)
WHERE
mm <> m
WITH
DISTINCT visitors, uc
WHERE NOT uc IN visitors
MATCH
(uc:Customer)-[rr:VISITED]->()
WITH
uc, round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10
RETURN uc.pty_id, v_amt;
EDIT
Okay, let's try something else. I suspect that what we're encountering here is a great deal of duplicates during expansion (many visitors may have visited the same merchants). Cypher won't eliminate duplicates during traversal unless you explicitly ask for it (as it may need this info for doing aggregations such as counting of occurrences), and this query is highly dependent on getting distinct nodes during expansion.
If you can install APOC Procedures, we can make use of some expansion procs which let us change how Cypher expands, only visiting each distinct node once across all paths. That may improve the timing here. At the least it will show us if the slowdown we're seeing is related to deduplication of nodes during expansion, or if it's something else.
MATCH (m:Merchant {mcht_id: "0000000DA5"})
CALL apoc.path.expandConfig(m, {uniqueness:'NODE_GLOBAL', relationshipFilter:'VISITED', minLevel:3, maxLevel:3}) YIELD path
WITH last(nodes(path)) as uc
MATCH
(uc:Customer)-[rr:VISITED]->()
WITH
uc
,round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10
RETURN uc.pty_id, v_amt;
While this is a more complicated approach, one neat thing is that with NODE_GLOBAL uniqueness (ensuring we only visit each node once across all expanded paths) and bfs expansion, we don't need to include WHERE NOT (c)-[:VISITED]->(m) since this will naturally be ruled out; we would have already visited every visitor of m, and since they've already been visited, we cannot visit them again, so none of them will appear in the final result set at 3 hops.
Give this a try and run it a couple times to get that into pagecache (or as much as possible...with 512MB pagecache you may not be able to get all of the traversed structure into memory).
I have tested all optimised query on Neo4j and on Oracle. Results are:
Oracle - 2.197 sec
Neo4j - 5.326 sec
You can see details here: http://homme.io/41163#run
And there is more complimentared for Neo4j case at http://homme.io/41721.

cypher performance for multiple hops /

I'm running my cypher queryies on a very large social network (over 1B records). I'm trying to get all paths between two person with variable relationship lengths. I get a reasonable response time running a query for a single relationship length (between 0.5 -2 seconds) [the person ids are index].
MATCH paths=( (pr1:person)-[*0..1]-(pr2:person) )
WHERE pr1.id='123456'
RETURN paths
However when I run the query with multiple lengths (i.e. 2 or more) my response time goes up to several minutes. Assuming that each person has in average the same number of connection I should be running my queries for 2-3 minutes Max (but I get up to 5+ min).
MATCH paths=( (pr1:person)-[*0..2]-(pr2:person) )
pr1.id='123456'
RETURN paths
I tried to use the EXPLAIN did not show extreme values for the VarLengthExpand(All) .
Maybe the traversing is not using the index for the pr2.
Is there anyway to improve the performance of my query?
Since variable-length relationship searches have exponential complexity, your *0..2 query might be generating a very large number of paths, which can cause the neo4j server (or your client code, like the neo4j browser) to run a long time or even run out of memory.
This query might be able to finish and show you how many matching paths there are:
MATCH (pr1:person)-[*0..2]-(:person)
WHERE pr1.id='123456'
RETURN COUNT(*);
If the returned number is very large, then you should modify your query to reduce the size of the result. For example, you can adding a LIMIT clause after your original RETURN clause to limit the number of returned paths.
By the way, the EXPLAIN clause just estimates the query cost, and can be way off. The PROFILE clause performs the actual query, and gives you an accurate accounting of the DB hits (however, if your query never finishes running, then a PROFILE of it will also never finish).
Rather than using the explain, try the "profile" instead.

Relationship between two nodes not being formed in Neo4j [duplicate]

I'm defining the relationship between two entities, Gene and Chromosome, in what I think is the simple and normal way, after importing the data from CSV:
MATCH (g:Gene),(c:Chromosome)
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);
Yet, when I do so, neo4j (browser UI) complains:
This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (c)).
I don't see what the issue is. chromosomeID is a very straightforward foreign key.
The browser is telling you that:
It is handling your query by doing a comparison between every Gene instance and every Chromosome instance. If your DB has G genes and C chromosomes, then the complexity of the query is O(GC). For instance, if we are working with the human genome, there are 46 chromosomes and maybe 25000 genes, so the DB would have to do 1150000 comparisons.
You might be able to improve the complexity (and performance) by altering your query. For example, if we created an index on :Gene(chromosomeID), and altered the query so that we initially matched just on the node with the smallest cardinality (the 46 chromosomes), we would only do O(G) (or 25000) "comparisons" -- and those comparisons would actually be quick index lookups! This is approach should be much faster.
Once we have created the index, we can use this query:
MATCH (c:Chromosome)
WITH c
MATCH (g:Gene)
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);
It uses a WITH clause to force the first MATCH clause to execute first, avoiding the cartesian product. The second MATCH (and WHERE) clause uses the results of the first MATCH clause and the index to quickly get the exact genes that belong to each chromosome.
[UPDATE]
The WITH clause was helpful when this answer was originally written. The Cypher planner in newer versions of neo4j (like 4.0.3) now generate the same plan even if the WITH is omitted, and without creating a cartesian product. You can always PROFILE both versions of your query to see the effect with/without the WITH.
As logisima mentions in the comments, this is just a warning. Matching a cartesian product is slow. In your case it should be OK since you want to connect previously unconnected Gene and Chromosome nodes and you know the size of the cartesian product. There are not too many chromosomes and a smallish number of genes. If you would MATCH e.g. genes on proteins the query might blow.
I think the warning is intended for other problematic queries:
if you MATCH a cartesian product but you don't know if there is a relationship you could use OPTIONAL MATCH
if you want to MATCH both a Gene and a Chromosome without any relationships, you should split up the query
In case your query takes too long or does not finish, here is another question giving some hints how to optimize cartesian products: How to optimize Neo4j Cypher queries with multiple node matches (Cartesian Product)

Why does neo4j warn: "This query builds a cartesian product between disconnected patterns"?

I'm defining the relationship between two entities, Gene and Chromosome, in what I think is the simple and normal way, after importing the data from CSV:
MATCH (g:Gene),(c:Chromosome)
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);
Yet, when I do so, neo4j (browser UI) complains:
This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (c)).
I don't see what the issue is. chromosomeID is a very straightforward foreign key.
The browser is telling you that:
It is handling your query by doing a comparison between every Gene instance and every Chromosome instance. If your DB has G genes and C chromosomes, then the complexity of the query is O(GC). For instance, if we are working with the human genome, there are 46 chromosomes and maybe 25000 genes, so the DB would have to do 1150000 comparisons.
You might be able to improve the complexity (and performance) by altering your query. For example, if we created an index on :Gene(chromosomeID), and altered the query so that we initially matched just on the node with the smallest cardinality (the 46 chromosomes), we would only do O(G) (or 25000) "comparisons" -- and those comparisons would actually be quick index lookups! This is approach should be much faster.
Once we have created the index, we can use this query:
MATCH (c:Chromosome)
WITH c
MATCH (g:Gene)
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);
It uses a WITH clause to force the first MATCH clause to execute first, avoiding the cartesian product. The second MATCH (and WHERE) clause uses the results of the first MATCH clause and the index to quickly get the exact genes that belong to each chromosome.
[UPDATE]
The WITH clause was helpful when this answer was originally written. The Cypher planner in newer versions of neo4j (like 4.0.3) now generate the same plan even if the WITH is omitted, and without creating a cartesian product. You can always PROFILE both versions of your query to see the effect with/without the WITH.
As logisima mentions in the comments, this is just a warning. Matching a cartesian product is slow. In your case it should be OK since you want to connect previously unconnected Gene and Chromosome nodes and you know the size of the cartesian product. There are not too many chromosomes and a smallish number of genes. If you would MATCH e.g. genes on proteins the query might blow.
I think the warning is intended for other problematic queries:
if you MATCH a cartesian product but you don't know if there is a relationship you could use OPTIONAL MATCH
if you want to MATCH both a Gene and a Chromosome without any relationships, you should split up the query
In case your query takes too long or does not finish, here is another question giving some hints how to optimize cartesian products: How to optimize Neo4j Cypher queries with multiple node matches (Cartesian Product)

Neo4j - Performance of lookups on edge properties in WHERE clause

I've got a data model that I want to roughly model after the article posted in this Graphgist.
I'm curious as to the performance that I can expect on the WHERE clause in the case that a given set of 2 nodes has a large number of relationships between them with 'from' and 'to' parameters defined on each edge. When you do a match query like this where you have let's say 100 SELLS relationships, how does Neo4J handle performance of filtering down the edges to just the one(s) that matter based on the WHERE criteria:
MATCH (s:Shop{shop_id:1})-[r1:SELLS]->(p:Product)
WHERE (r1.from <= 1391558400000 AND r1.to > 1391558400000)
MATCH (p)-[r2:STATE]->(ps:ProductState)
WHERE (r2.from <= 1391558400000 AND r2.to > 1391558400000)
RETURN p.product_id AS productId,
ps.name AS product,
ps.price AS price
ORDER BY price DESC
I haven't found a way to index properties on an edge directly so I'm assuming that either the query optimizer can take care of something like this or it just literally traverses the array of edges and finds the one(s) that match.
Neo4j will just traverse all relationships and read the property value. There are by default no indexes on relationships properties (this can be achieved with the legacy indexes : check documentation).
Concerning performance, bear in mind that Neo4j is very fast at traversing relationships so while your query is "very expensive", Neo4j can traverse 2 to 4 million relationships per second and per core depending on your hardware configuration.
So, to summarize, for 100 relationships it will run like a flash, but it is not optimized at all currently, so you'll see some drawbacks if you need to run the same operations on 1million relationships for example.

Resources