Neo4j count increases execution time substantially and runs out of memory - neo4j

I am using Neo4j to store data regarding movie ratings. I would like to count the number of movies that two users both rated. When running the query
match (a:User)-[:RATED]->(b:Movie)<-[:RATED]-(c:User) return a,b,c limit 1000
it completes in less than a second, however running
match (a:User)-[:RATED]->(b:Movie)<-[:RATED]-(c:User) return a,count(b),c limit 1000
the database can't finish the query as the heap runs out of memory, which I have set as 4gb. Am I using the count function properly? I don't understand how the performance between these two queries can differ so significantly.

MissingNumber has a good explanation of what's going on. When you do aggregations, the whole set has to be considered to do the aggregations correctly, and that must happen before the LIMIT, and this is taking a huge toll on your heap space.
As an alternate in your case, you can try the following:
match (a:User)-[:RATED]->()<-[:RATED]-(c:User)
with DISTINCT a, c
where id(a) < id(c)
limit 1000
match (a)-[:RATED]->(m:Movie)<-[:RATED]-(c)
with a, c, count(m) as moviesRated
return a, moviesRated, c
By moving the LIMIT up before the aggregation, but using DISTINCT instead to ensure we only deal with a pair of nodes in this pattern once (and apply a predicate based on graph ids to ensure we never deal with mirrored results), we should get a more efficient query. Then for each of those 1000 pairs of a and c, we expand out the pattern again and get the actual counts.

I have ran into a similar similar situation and solved this using the following approach, this will be applicable to you.
I used a data set having:
(TYPE_S) - 380 nodes
(TYPE_N) - 800000 nodes
[:S_realation_N] - 5600000 relations
Query one :
match (s:TYPE_S)-[]-(n:TYPE_N) return s, n limit 10
This took 2 milli-seconds.
As soon as 10 patterns(relations) are found in db, neo4j just returns result.
Query two :
match (s:TYPE_S)-[]-(n:TYPE_N) return s, sum(n.value) limit 10
This took ~4000 milli-seconds.
This might look like a query as fast as last one. But surely it won’t be as fast as the previous one because of aggregation involved.
Reason:
For the query to aggregate over pattern, Neo4j has to load all the paths that matches given pattern (these are way more than 10 or given limit here and will be 5600000 as per my dataset) into ram before performing aggregation. Later this aggregation is performed over 10 full records S_TYPE nodes, so this falls into specified return format with given limit now. Rest of the relations in the ram are then flushed. Which means for a moment ran is loaded with lot data which will later be ignored due to limit.
So to optimize runtimes and memory usage here you have to avoid part of query which leads to loading data which will later be ignored.
This is how I optimized it:
match (s:TYPE_S) where ((s)-[]-(:TYPE_N))
with collect(s)[0..10] as s_list
unwind s_list as s
match (s)-[]-(n:TYPE_N) return s, sum(n.value)
This took 64 milli-seconds.
Now neo4j first shortlists 10 nodes of type TYPE_S which have relations with TYPE_S, and then matches the pattern with these nodes and get their data. 
This should work and run better than query2 since you are loading a limited set of records in to ram.
You could use this similar way to build your query, by shorting 1000 (a,b) distinct user pairs and then perform aggregations on them.
But this approach will fail in case where need to order by aggregation.
Reason for your query to run out of memory is because you are using 4 gb ram and running a query that may load a lot of combinational data into ram(this may sometimes be more than size of you db due multiplicity of data combinations defined in you patterns, in your case even if you have 50 unique users, you have 50*49 possible unique combinations of patterns that can loaded in to ram). Also other transactions and queries running in parallel could also impact.

Related

Performance Issue with neo4j

There is DataSet at my Notebook’s Virtual Machine:
2 million unique Customers [:VISITED] 40000 unique Merchants.
Every [:VISIT] has properties: amount (double) and dt (date).
Every Customer has property “pty_id” (Integer).
And every Merchant has mcht_id (String) property.
One Customer may visit one Merchant for more than one time. And of course, one Customer may visit many Merchants. So there are 43 978 539 relationships in my graph between Customers and Merchants.
I have created Indexes:
CREATE INDEX on :Customer(pty_id)
CREATE INDEX  on :Merchant(mcht_id)
Parameters of my VM are:
Oracle (RedHat) Linux 7 with 2 core i7, 2 GB RAM
Parameters of my Neo4j 3.5.7 config:
- dbms.memory.heap.max_size=1024m
- dbms.memory.pagecache.size=512m
My task is:
Get top 10 Customers ordered by total_amount who spent their money at NOT specified Merchant(M) but visit that Merchants which have been visited by Customers who visit this specified Merchant(M)
My Solution is:
Let’s M will have mcht_id = "0000000DA5"
Then the CYPHER query will be:
MATCH
(c:Customer)-[r:VISITED]->(mm:Merchant)<-[:VISITED]-(cc:Customer)-[:VISITED]->(m:Merchant {mcht_id: "0000000DA5"})
WHERE
NOT (c)-[:VISITED]->(m)
WITH
DISTINCT c as uc
MATCH
(uc:Customer)-[rr:VISITED]->()
RETURN
uc.pty_id
,round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10;
Result is OK. I receive my answer:
uc.pty_id - v_amt: 1433798 - 348925.94; 739510 - 339169.83; 374933 -
327962.95 and so on.
The problem is that this result I have received after 437613 ms! It’s about 7 minutes!!! My estimated time for this query was about 10-20 seconds….
My Question is: What am I doing wrong???
There's a few things to improve here.
First, for graph-wide queries in a graph with millions of nodes and 50 million relationships, 1G of heap and 512M of pagecache is far too low. We usually recommend around 8-10G of heap minimum for medium to large graphs (this is your "scratch space" memory as a query executes), and to try to get as much of the graph size as possible in pagecache if you can to minimize cache misses as you traverse the graph. Neo4j likes memory. Memory is relatively cheap. You can use neo4j-admin memrec to get a recommendation of how to configure your memory settings, but in general you need to run this on a machine with more memory.
And if we're talking about hardware recommendations, usage of SSDs is highly recommended, for when you do need to hit the disk.
As for the query itself, notice in the query plan you posted that your DISTINCT operation drops the number of rows from the neighborhood of 26-35 million to only 153k rows, that's significant. Your most expensive step here (WHERE
NOT (c)-[:VISITED]->(m)) is the Expand(Into) operation on the right side of the plan, with nearly 1 billion db hits. This is happening too early in the query - you should be doing this AFTER your DISTINCT operation, so it operates on only 153k rows instead of 35 million.
You can also improve upon this so you don't even have to hit the graph to do that step of the filtering. Instead of using that WHERE NOT <pattern> approach, you can pre-match to the customers who visited the first merchant, gather them into a list, and keep them around, and instead of using negation of the pattern (where it has to actually expand out all :VISITED relationships of those customers and see if any was the original merchant), we instead do a list membership check, and ensure they aren't one of the 1k or so customers who visited the original merchant. That will happen in memory, since we already collected that list, so it shouldn't hit the graph. In any case you should do DISTINCT before this check.
In your RETURN you're performing an aggregation with respect to a node's unique property, so you're paying the cost of projecting that property across 4 million rows BEFORE the cardinality drops from the aggregation to 153k rows, meaning you're projecting out that property redundantly across a great many duplicate :Customer nodes before they become distinct from the aggregation. That's redundant and expensive property access you can avoid by aggregating with respect to the node instead, and then do your property access after the aggregation, and also after your sort and limit, so you only have to project out 10 properties.
So putting that all together, try this out:
MATCH
(cc:Customer)-[:VISITED]->(m:Merchant {mcht_id: "0000000DA5"})
WITH m, collect(DISTINCT cc) as visitors
UNWIND visitors as cc
MATCH (uc:Customer)-[:VISITED]->(mm:Merchant)<-[:VISITED]-(cc)
WHERE
mm <> m
WITH
DISTINCT visitors, uc
WHERE NOT uc IN visitors
MATCH
(uc:Customer)-[rr:VISITED]->()
WITH
uc, round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10
RETURN uc.pty_id, v_amt;
EDIT
Okay, let's try something else. I suspect that what we're encountering here is a great deal of duplicates during expansion (many visitors may have visited the same merchants). Cypher won't eliminate duplicates during traversal unless you explicitly ask for it (as it may need this info for doing aggregations such as counting of occurrences), and this query is highly dependent on getting distinct nodes during expansion.
If you can install APOC Procedures, we can make use of some expansion procs which let us change how Cypher expands, only visiting each distinct node once across all paths. That may improve the timing here. At the least it will show us if the slowdown we're seeing is related to deduplication of nodes during expansion, or if it's something else.
MATCH (m:Merchant {mcht_id: "0000000DA5"})
CALL apoc.path.expandConfig(m, {uniqueness:'NODE_GLOBAL', relationshipFilter:'VISITED', minLevel:3, maxLevel:3}) YIELD path
WITH last(nodes(path)) as uc
MATCH
(uc:Customer)-[rr:VISITED]->()
WITH
uc
,round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10
RETURN uc.pty_id, v_amt;
While this is a more complicated approach, one neat thing is that with NODE_GLOBAL uniqueness (ensuring we only visit each node once across all expanded paths) and bfs expansion, we don't need to include WHERE NOT (c)-[:VISITED]->(m) since this will naturally be ruled out; we would have already visited every visitor of m, and since they've already been visited, we cannot visit them again, so none of them will appear in the final result set at 3 hops.
Give this a try and run it a couple times to get that into pagecache (or as much as possible...with 512MB pagecache you may not be able to get all of the traversed structure into memory).
I have tested all optimised query on Neo4j and on Oracle. Results are:
Oracle - 2.197 sec
Neo4j - 5.326 sec
You can see details here: http://homme.io/41163#run
And there is more complimentared for Neo4j case at http://homme.io/41721.

Instead of creating appropriate index query is taking large time however neo4 write is fast enough

I am using neo4j-community-3.5.3 server in a system having 64 GB RAM and 32 cores.
My database size is currently 160 GB and it is growing like 1.5GB every day. I keep 12 GB in page cache and 8GB in heap.
Apart from uniqueness constraint I also create indexes on some of my node properties. Since in the current neo4j version lucene_native-1.0 indexing is deprecated I am using the default native-btree-1.0.
So the problem that I am facing is that my write performance is very good. But while reading the query result instead of querying using indexes result comes around 1 minute.
My index size is almost 21 GB. My database size is continuously growing but I am not getting the query performance as I was expected.
Please give me some healthier suggestion so that I can tune my query. Thanks in advance.
Here is a sample of my query with indexing, and some profiles:
PROFILE
OPTIONAL MATCH (u1:USER)<-[p:MENTIONS]-(tw:TWEET)<-[m:POST]-(u2:USER)
USING INDEX tw:TWEET(date)
WHERE tw.date='2019-03-03' AND u1.author_screen_name='xxx'
RETURN
u1.author_screen_name as mentioned_author,
u2.author_name as mentioned_by_author,
count(*) AS weight
ORDER BY weight DESC LIMIT 20
Query_profile1_using_indexing
Query_profile2_using_indexing
Query_profile3_using_indexing
And here is a query without indexing, and some profiles:
PROFILE
OPTIONAL MATCH (u1:USER)<-[p:MENTIONS]-(tw:TWEET)<-[m:POST]-(u2:USER)
WHERE tw.date='2019-03-03' AND u1.author_screen_name='xxx'
RETURN
u1.author_screen_name as mentioned_author,
u2.author_name as mentioned_by_author,
count(*) AS weight
ORDER BY weight DESC LIMIT 20
Query_profile1_without_using_indexing
Query_profile2_without_using_indexing
Query_profile3_without_using_indexing
Without using indexing query time taken is 880572 ms
Using indexing query time taken is 57674 ms for the same query.
In either case you're doing your projections at the same time as your aggregation, which isn't efficient. First of all, since there's only a single u1, project the author_screen_name for this at the beginning, while your cardinality is only at a single row.
Then after your match, do your aggregations, ordering and limiting, based upon the nodes themselves, and once your results are aggregated, THEN do the projections so you're doing a minimal amount of work; you don't want to do property access for a ton of rows that you're only going to discard when you get the limited result set:
MATCH (u1:USER)
WITH u1, u1.author_screen_name as mentioned_author
OPTIONAL MATCH ...
...
WITH mentioned_author, u2, count(*) AS weight
ORDER BY weight DESC
LIMIT 20
RETURN mentioned_author, u2.author_name as mentioned_by_author, weight

Complexity of a neo4j query

I need to measure the performance of any query.
for example :
MATCH (n:StateNode)-[r:has_city]->(n1:CityNode)
WHERE n.shortName IN {0} and n1.name IN {1}
WITH n1
Match (aa:ActiveStatusNode{isActive:toBoolean('true')})--(n2:PannaResume)-[r1:has_location]->(n1)
WHERE (n2.firstName="master") OR (n2.lastName="grew" )
WITH n2
MATCH (o:PannaResumeOrganizationNode)<-[h:has_organization]-(n2)-[r2:has_skill]->(n3:Skill)
WHERE (0={3} OR o.organizationId={3}) AND (0={4} OR n3.name IN {2} OR n3.name IN {5})
WITH size(collect(n3)) as count, n2
MATCH (n2) where (0={4} OR count={4})
RETURN DISTINCT n2
I have tried profile & explain clauses but they only return number of db hits. Is it possible to get big notations for a neo4j query ie cn we measure performance in terms of big O notations ? Are there any other ways to check query performance apart from using profile & explain ?
No, you cannot convert a Cypher to Big O notation.
Cypher does not describe how to fetch information, only what kind of information you want to return. It is up to the Cypher planner in the Neo4j database to convert a Cypher into an executable query (using heuristics about what info it has to find, what indexes are available to it, and internal statistics about the dataset being queried. So simply changing the state of the database can change the complexity of a Cypher.)
A very simple example of this is the Cypher Cypher 3.1 MATCH (a{id:1})-[*0..25]->(b) RETURN DISTINCT b. Using a fairly average connected graph with cycles, running against Neo4j 3.1.1 will time out for being too complex (Because the planner tries to find all paths, even though it doesn't need that redundant information), while Neo4j 3.2.3 will return very quickly (Because the Planner recognizes it only needs to do a graph scan like depth first search to find all connected nodes).
Side note, you can argue for BIG O notation on the return results. For example MATCH (a), (b) must have a minimum complexity of n^2 because the result is a Cartesian product, and execution can't be less complex then the answer. This understanding of how complexity affects row counts can help you write Cyphers that reduce the amount of work the Planner ends up planning.
For example, using WITH COLLECT(n) as data MATCH (c:M) to reduce the number of rows the Planner ends up doing work against before the next part of a Cypher from nm (first match count times second match count) to m (1 times second match count).
However, since Cypher makes no promises about how data is found, there is no way to guarantee the complexity of the execution. We can only try to write Cyphers that are more likely to get an optimal execution plan, and use EXPLAIN/PROFILE to evaluate if the planner is able to find a relatively optimal solution.
The PROFILE results show you how the neo4j server actually plans to process your Cypher query. You need to analyze the execution plan revealed by the PROFILE results to get the big O complexity. There are no tools to do that that I am aware of (although it would be a great idea for someone to create one).
You should also be aware that the execution plan for a query can change over time as the characteristics of the DB change, and also when changing to a different version of neo4j.
Nothing of this is sort is readily available. But it can be derived/approximated with some additional effort.
On profiling a query, we get a list of functions that neo4j will run to achieve the desired result.
Each of this function will be associated with the worst to best case complexities in theory. And some of them will run in parallel too. This will impact runtimes, depending on the cores that your server has.
For example match (a:A) match (a:B) results in Cartesian product. And this will be of O(count(a)*count(b))
Similarly each function of in your query-plan does have such time complexities.
So aggregations of this individual time complexities of these functions will give you an overall approximation of time-complexity of the query.
But this will change from time to time with each version of neo4j since they community can always change the implantation of a query or to achieve better runtimes / structural changes / parallelization/ less usage of ram.
If what you are looking for is an indication of the optimization of neo4j query db-hits is a good indicator.

cypher performance for multiple hops /

I'm running my cypher queryies on a very large social network (over 1B records). I'm trying to get all paths between two person with variable relationship lengths. I get a reasonable response time running a query for a single relationship length (between 0.5 -2 seconds) [the person ids are index].
MATCH paths=( (pr1:person)-[*0..1]-(pr2:person) )
WHERE pr1.id='123456'
RETURN paths
However when I run the query with multiple lengths (i.e. 2 or more) my response time goes up to several minutes. Assuming that each person has in average the same number of connection I should be running my queries for 2-3 minutes Max (but I get up to 5+ min).
MATCH paths=( (pr1:person)-[*0..2]-(pr2:person) )
pr1.id='123456'
RETURN paths
I tried to use the EXPLAIN did not show extreme values for the VarLengthExpand(All) .
Maybe the traversing is not using the index for the pr2.
Is there anyway to improve the performance of my query?
Since variable-length relationship searches have exponential complexity, your *0..2 query might be generating a very large number of paths, which can cause the neo4j server (or your client code, like the neo4j browser) to run a long time or even run out of memory.
This query might be able to finish and show you how many matching paths there are:
MATCH (pr1:person)-[*0..2]-(:person)
WHERE pr1.id='123456'
RETURN COUNT(*);
If the returned number is very large, then you should modify your query to reduce the size of the result. For example, you can adding a LIMIT clause after your original RETURN clause to limit the number of returned paths.
By the way, the EXPLAIN clause just estimates the query cost, and can be way off. The PROFILE clause performs the actual query, and gives you an accurate accounting of the DB hits (however, if your query never finishes running, then a PROFILE of it will also never finish).
Rather than using the explain, try the "profile" instead.

Cypher query to get subsets of different node labels, with relations

Let's assume this use case;
We have few nodes (labeled Big) and each having a simple integer ID property.
Each Big node has a relation with millions of (labeled Small) nodes.
such as :
(Small)-[:BELONGS_TO]->(Big)
How can I phrase a Cypher query to represent the following in natural language:
For each Big node in the range of ids between 4-7, get me 10 of Small nodes that belongs to it.
The supposed result would give 2 Big nodes, 20 Small nodes, and 20 Relations
The needed result would be represented by this graph:
2 Big nodes, each with a subset of 10 of Small nodes that belongs to them
What I've tried but failed (it only shows 1 big node (id=5) along with 10 of its related Small nodes, but doesn't show the second node (id=6):
MATCH (s:Small)-[:BELONGS_TO]->(b:Big)
Where 4<b.bigID<7
return b,s limit 10
I guess I need a more complex compound query.
Hope I could phrase my question in an understandable way!
As stdob-- says, you can't use limit here, at least not in this way, as it limits the entire result set.
While the aggregation solution will return you the right answer, you'll still pay the cost for the expansion to those millions of nodes. You need a solution that will lazily get the first ten for each.
Using APOC Procedures, you can use apoc.cypher.run() to effectively perform a subquery. The query will be run per-row, so if you limit the rows first, you can call this and use LIMIT within the subquery, and it will properly limit to 10 results per row, lazily expanding so you don't pay for an expansion to millions of nodes.
MATCH (b:Big)
WHERE 4 < b.bigID < 7
CALL apoc.cypher.run('
MATCH (s:Small)-[:BELONGS_TO]->(b)
RETURN s LIMIT 10',
{b:b}) YIELD value
RETURN b, value.s
Your query does not work because the limit applies to the entire previous flow.
You need to use aggregation function collect:
MATCH (s:Small)-[:BELONGS_TO]->(b:Big) Where 4<b.bigID<7
With b,
collect(distinct s)[..10] as smalls
return b,
smalls

Resources