cypher performance for multiple hops / - neo4j

I'm running my cypher queryies on a very large social network (over 1B records). I'm trying to get all paths between two person with variable relationship lengths. I get a reasonable response time running a query for a single relationship length (between 0.5 -2 seconds) [the person ids are index].
MATCH paths=( (pr1:person)-[*0..1]-(pr2:person) )
WHERE pr1.id='123456'
RETURN paths
However when I run the query with multiple lengths (i.e. 2 or more) my response time goes up to several minutes. Assuming that each person has in average the same number of connection I should be running my queries for 2-3 minutes Max (but I get up to 5+ min).
MATCH paths=( (pr1:person)-[*0..2]-(pr2:person) )
pr1.id='123456'
RETURN paths
I tried to use the EXPLAIN did not show extreme values for the VarLengthExpand(All) .
Maybe the traversing is not using the index for the pr2.
Is there anyway to improve the performance of my query?

Since variable-length relationship searches have exponential complexity, your *0..2 query might be generating a very large number of paths, which can cause the neo4j server (or your client code, like the neo4j browser) to run a long time or even run out of memory.
This query might be able to finish and show you how many matching paths there are:
MATCH (pr1:person)-[*0..2]-(:person)
WHERE pr1.id='123456'
RETURN COUNT(*);
If the returned number is very large, then you should modify your query to reduce the size of the result. For example, you can adding a LIMIT clause after your original RETURN clause to limit the number of returned paths.
By the way, the EXPLAIN clause just estimates the query cost, and can be way off. The PROFILE clause performs the actual query, and gives you an accurate accounting of the DB hits (however, if your query never finishes running, then a PROFILE of it will also never finish).

Rather than using the explain, try the "profile" instead.

Related

Get start and end nodes of specific path in a large graph

I have a large graph (1,068,029 nodes and 2,602,897 relationships), and I work with it via the python API and make requests to the graph in my program flow.
I have the following queries -
First query
MATCH
(start_node)--(o:observed_data)--(i:indicator)--(m:malware)--(end_node:attack_pattern)
WHERE start_node.id in [id_list]
RETURN start_node.id, end_node.name
Second query
MATCH
(start_node)--(o1:observed_data)--(h:MD5)--(o2:observed_data)--(i:indicator)--(m:malware)--(end_node:attack_pattern)
WHERE start_node.id in [id_list]
RETURN start_node.id, end_node.name
When I trying to preform the first query with id_list of size 75,000 its passes OK and returns the wanted output, but when I trying to preform the second query - the graph gets stuck, even when I decreasing the id_list to 20,000.
The id_list is even larger than 75,000 but I split it into chunks in order to make the graph's response time faster, but if I will split it to too many chunks I will increase the number of requests to the graph, and increase the program run-time.
My question is - Is there a library's function of some sort (APOC or something like that) that performs the same action but in less time? Or maybe you have another solution that solves this problem without decreasing the id_list under 50,000?
The (start_node) in your MATCH patterns should specify a label (like (start_node:Foo)), to avoid having to scan every node in the DB. Also, you should create an index (or uniqueness constraint) for that start node.
You should make all the relationships in your MATCH patterns directional, if appropriate. That is, put an arrow on either end.
You should specify the relationship types in your patterns as well (like ()-[:BAR]->()), so that the query would not be forced to evaluate all relationship types.

Performance Issue with neo4j

There is DataSet at my Notebook’s Virtual Machine:
2 million unique Customers [:VISITED] 40000 unique Merchants.
Every [:VISIT] has properties: amount (double) and dt (date).
Every Customer has property “pty_id” (Integer).
And every Merchant has mcht_id (String) property.
One Customer may visit one Merchant for more than one time. And of course, one Customer may visit many Merchants. So there are 43 978 539 relationships in my graph between Customers and Merchants.
I have created Indexes:
CREATE INDEX on :Customer(pty_id)
CREATE INDEX  on :Merchant(mcht_id)
Parameters of my VM are:
Oracle (RedHat) Linux 7 with 2 core i7, 2 GB RAM
Parameters of my Neo4j 3.5.7 config:
- dbms.memory.heap.max_size=1024m
- dbms.memory.pagecache.size=512m
My task is:
Get top 10 Customers ordered by total_amount who spent their money at NOT specified Merchant(M) but visit that Merchants which have been visited by Customers who visit this specified Merchant(M)
My Solution is:
Let’s M will have mcht_id = "0000000DA5"
Then the CYPHER query will be:
MATCH
(c:Customer)-[r:VISITED]->(mm:Merchant)<-[:VISITED]-(cc:Customer)-[:VISITED]->(m:Merchant {mcht_id: "0000000DA5"})
WHERE
NOT (c)-[:VISITED]->(m)
WITH
DISTINCT c as uc
MATCH
(uc:Customer)-[rr:VISITED]->()
RETURN
uc.pty_id
,round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10;
Result is OK. I receive my answer:
uc.pty_id - v_amt: 1433798 - 348925.94; 739510 - 339169.83; 374933 -
327962.95 and so on.
The problem is that this result I have received after 437613 ms! It’s about 7 minutes!!! My estimated time for this query was about 10-20 seconds….
My Question is: What am I doing wrong???
There's a few things to improve here.
First, for graph-wide queries in a graph with millions of nodes and 50 million relationships, 1G of heap and 512M of pagecache is far too low. We usually recommend around 8-10G of heap minimum for medium to large graphs (this is your "scratch space" memory as a query executes), and to try to get as much of the graph size as possible in pagecache if you can to minimize cache misses as you traverse the graph. Neo4j likes memory. Memory is relatively cheap. You can use neo4j-admin memrec to get a recommendation of how to configure your memory settings, but in general you need to run this on a machine with more memory.
And if we're talking about hardware recommendations, usage of SSDs is highly recommended, for when you do need to hit the disk.
As for the query itself, notice in the query plan you posted that your DISTINCT operation drops the number of rows from the neighborhood of 26-35 million to only 153k rows, that's significant. Your most expensive step here (WHERE
NOT (c)-[:VISITED]->(m)) is the Expand(Into) operation on the right side of the plan, with nearly 1 billion db hits. This is happening too early in the query - you should be doing this AFTER your DISTINCT operation, so it operates on only 153k rows instead of 35 million.
You can also improve upon this so you don't even have to hit the graph to do that step of the filtering. Instead of using that WHERE NOT <pattern> approach, you can pre-match to the customers who visited the first merchant, gather them into a list, and keep them around, and instead of using negation of the pattern (where it has to actually expand out all :VISITED relationships of those customers and see if any was the original merchant), we instead do a list membership check, and ensure they aren't one of the 1k or so customers who visited the original merchant. That will happen in memory, since we already collected that list, so it shouldn't hit the graph. In any case you should do DISTINCT before this check.
In your RETURN you're performing an aggregation with respect to a node's unique property, so you're paying the cost of projecting that property across 4 million rows BEFORE the cardinality drops from the aggregation to 153k rows, meaning you're projecting out that property redundantly across a great many duplicate :Customer nodes before they become distinct from the aggregation. That's redundant and expensive property access you can avoid by aggregating with respect to the node instead, and then do your property access after the aggregation, and also after your sort and limit, so you only have to project out 10 properties.
So putting that all together, try this out:
MATCH
(cc:Customer)-[:VISITED]->(m:Merchant {mcht_id: "0000000DA5"})
WITH m, collect(DISTINCT cc) as visitors
UNWIND visitors as cc
MATCH (uc:Customer)-[:VISITED]->(mm:Merchant)<-[:VISITED]-(cc)
WHERE
mm <> m
WITH
DISTINCT visitors, uc
WHERE NOT uc IN visitors
MATCH
(uc:Customer)-[rr:VISITED]->()
WITH
uc, round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10
RETURN uc.pty_id, v_amt;
EDIT
Okay, let's try something else. I suspect that what we're encountering here is a great deal of duplicates during expansion (many visitors may have visited the same merchants). Cypher won't eliminate duplicates during traversal unless you explicitly ask for it (as it may need this info for doing aggregations such as counting of occurrences), and this query is highly dependent on getting distinct nodes during expansion.
If you can install APOC Procedures, we can make use of some expansion procs which let us change how Cypher expands, only visiting each distinct node once across all paths. That may improve the timing here. At the least it will show us if the slowdown we're seeing is related to deduplication of nodes during expansion, or if it's something else.
MATCH (m:Merchant {mcht_id: "0000000DA5"})
CALL apoc.path.expandConfig(m, {uniqueness:'NODE_GLOBAL', relationshipFilter:'VISITED', minLevel:3, maxLevel:3}) YIELD path
WITH last(nodes(path)) as uc
MATCH
(uc:Customer)-[rr:VISITED]->()
WITH
uc
,round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10
RETURN uc.pty_id, v_amt;
While this is a more complicated approach, one neat thing is that with NODE_GLOBAL uniqueness (ensuring we only visit each node once across all expanded paths) and bfs expansion, we don't need to include WHERE NOT (c)-[:VISITED]->(m) since this will naturally be ruled out; we would have already visited every visitor of m, and since they've already been visited, we cannot visit them again, so none of them will appear in the final result set at 3 hops.
Give this a try and run it a couple times to get that into pagecache (or as much as possible...with 512MB pagecache you may not be able to get all of the traversed structure into memory).
I have tested all optimised query on Neo4j and on Oracle. Results are:
Oracle - 2.197 sec
Neo4j - 5.326 sec
You can see details here: http://homme.io/41163#run
And there is more complimentared for Neo4j case at http://homme.io/41721.

Why does this cypher query never finish

This is the query:
MATCH (t:Table)-[*]-(a:Attribute) RETURN t,a
Here is the complete graph:
Here is the query and what happens when I try to execute it:
The reason is that you are performing a variable-length relationship without an upper bound. Cypher will attempt to find every possible path in existence that can be made no matter how long the path, provided that the path begins with a :Table node and ends with an :Attribute node. While a relationship will only be traversed once per path, there's no restriction to using a different relationship to return to a previously traversed node and then using another as-of-yet-untraversed-relationship-in-the-path to leave it and continue traversing.
Even on a small graph, the number of possible paths explodes. You can see for yourself how the number of paths grows, and how the db will get slower as the number of possible paths to explore explodes.
MATCH (:Table)-[*..6]-(:Attribute)
RETURN count(*) as pathsFound
Now if that finishes quick, increase the upper bound and run it, and keep on doing it, and see how high you can go, and how high the paths found gets, before the db starts running into trouble.
I'll save you some time, though. I recreated your graph, and you hit the max possible paths when you have an upper bound of 23 hops, returning a count of 1371112 total distinct paths in your graph matching that pattern. The browser alone won't be able to cope with this many rows of data.
Here are two queries you can run to verify it (provided that this is your entire graph):
MATCH (:Table)-[*..23]-(:Attribute)
RETURN count(*) as totalPathsFound
and
MATCH path = (:Table)-[*..23]-(:Attribute)
RETURN length(path) as pathLength, count(*) as pathsFound
ORDER BY pathLength DESC
Note that expanding out and counting the number of possible paths isn't too strenuous, we can get that in a few seconds. But doing property access or additional computations that may multiplicatively increase the number of paths can be a problem, and streaming back this many rows of data, especially to a browser app, can be a problem.
More to the point, I don't think you really want to process over a million results anyway. What the query is actually doing is likely completely different than what you really want. So you may want to clarify what exactly you want the query to do, because the current approach isn't feasible.

Neo4j count increases execution time substantially and runs out of memory

I am using Neo4j to store data regarding movie ratings. I would like to count the number of movies that two users both rated. When running the query
match (a:User)-[:RATED]->(b:Movie)<-[:RATED]-(c:User) return a,b,c limit 1000
it completes in less than a second, however running
match (a:User)-[:RATED]->(b:Movie)<-[:RATED]-(c:User) return a,count(b),c limit 1000
the database can't finish the query as the heap runs out of memory, which I have set as 4gb. Am I using the count function properly? I don't understand how the performance between these two queries can differ so significantly.
MissingNumber has a good explanation of what's going on. When you do aggregations, the whole set has to be considered to do the aggregations correctly, and that must happen before the LIMIT, and this is taking a huge toll on your heap space.
As an alternate in your case, you can try the following:
match (a:User)-[:RATED]->()<-[:RATED]-(c:User)
with DISTINCT a, c
where id(a) < id(c)
limit 1000
match (a)-[:RATED]->(m:Movie)<-[:RATED]-(c)
with a, c, count(m) as moviesRated
return a, moviesRated, c
By moving the LIMIT up before the aggregation, but using DISTINCT instead to ensure we only deal with a pair of nodes in this pattern once (and apply a predicate based on graph ids to ensure we never deal with mirrored results), we should get a more efficient query. Then for each of those 1000 pairs of a and c, we expand out the pattern again and get the actual counts.
I have ran into a similar similar situation and solved this using the following approach, this will be applicable to you.
I used a data set having:
(TYPE_S) - 380 nodes
(TYPE_N) - 800000 nodes
[:S_realation_N] - 5600000 relations
Query one :
match (s:TYPE_S)-[]-(n:TYPE_N) return s, n limit 10
This took 2 milli-seconds.
As soon as 10 patterns(relations) are found in db, neo4j just returns result.
Query two :
match (s:TYPE_S)-[]-(n:TYPE_N) return s, sum(n.value) limit 10
This took ~4000 milli-seconds.
This might look like a query as fast as last one. But surely it won’t be as fast as the previous one because of aggregation involved.
Reason:
For the query to aggregate over pattern, Neo4j has to load all the paths that matches given pattern (these are way more than 10 or given limit here and will be 5600000 as per my dataset) into ram before performing aggregation. Later this aggregation is performed over 10 full records S_TYPE nodes, so this falls into specified return format with given limit now. Rest of the relations in the ram are then flushed. Which means for a moment ran is loaded with lot data which will later be ignored due to limit.
So to optimize runtimes and memory usage here you have to avoid part of query which leads to loading data which will later be ignored.
This is how I optimized it:
match (s:TYPE_S) where ((s)-[]-(:TYPE_N))
with collect(s)[0..10] as s_list
unwind s_list as s
match (s)-[]-(n:TYPE_N) return s, sum(n.value)
This took 64 milli-seconds.
Now neo4j first shortlists 10 nodes of type TYPE_S which have relations with TYPE_S, and then matches the pattern with these nodes and get their data. 
This should work and run better than query2 since you are loading a limited set of records in to ram.
You could use this similar way to build your query, by shorting 1000 (a,b) distinct user pairs and then perform aggregations on them.
But this approach will fail in case where need to order by aggregation.
Reason for your query to run out of memory is because you are using 4 gb ram and running a query that may load a lot of combinational data into ram(this may sometimes be more than size of you db due multiplicity of data combinations defined in you patterns, in your case even if you have 50 unique users, you have 50*49 possible unique combinations of patterns that can loaded in to ram). Also other transactions and queries running in parallel could also impact.

Neo4j Recommendation Cypher Query Optimization

I am using Neo4j community edition embedded in java application for recommendation purpose. I made a custom function which contains a complex logic of comparing two entities, namely product and users. Both entities are present as nodes in graph and has more than 20 properties each for comparison purpose. For eg. I am calling this function in following format:
match (e:User {user_id:"some-id"}) with e
match (f:Product {product_id:"some-id"}) with e,f
return e,f,findComparisonValue(e,f) as pref_value;
This function call on an average takes about 4-5 ms to run. Now, to recommend best product to a particular user, I wrote a cypher query which iterates over all products, calculate the pref_value and rank them. My cypher query looks like this:
MATCH (source:User) WHERE id(source)={id} with source
MATCH (reco:Product) WHERE reco.is_active='t'
with reco, source, findComparisonValue(source, reco) as score_result
RETURN distinct reco, score_result.score as score, score_result.params as params, score_result.matched_keywords as matched_keywords
order by score desc
Some insights on graph structure:
Total Number of nodes: 2 million
Total Number of relationships: 20 million
Total Number of Users: 0.2 million
Total Number of Products: 1.8 million
The above cypher query is taking more than 10 seconds as it is iterating over all the products. On top of this cypher query, I am using graphaware-reco module for my recommendation needs (Using precompute, filteing, post processing etc). I thought of parallelising this but community edition does not support clustering. Now, as number of users in system is increasing day by day, I need to think of a scalable solution.
Can anyone help me out here, on how to optimize the query.
As others have commented, doing a significant calculation potentially millions of times in a single query is going to be slow, and does not take advantage of neo4j's strengths. You should investigate modifying your data model and calculation so that you can leverage relationships and/or indexes.
In the meantime, there are a number of things to suggest with your second query:
Make sure you have created an index for :Product(is_active), so that it is not necessary to scan all products. (By the way, if that property is actually supposed to be a boolean, then consider making it a boolean rather than a string.)
The RETURN clause should not need the DISTINCT operator, since all the result rows should be distinct anyway. This is because every reco value is already distinct. Removing that keyword should improve performance.

Resources