Complexity of a neo4j query - neo4j

I need to measure the performance of any query.
for example :
MATCH (n:StateNode)-[r:has_city]->(n1:CityNode)
WHERE n.shortName IN {0} and n1.name IN {1}
WITH n1
Match (aa:ActiveStatusNode{isActive:toBoolean('true')})--(n2:PannaResume)-[r1:has_location]->(n1)
WHERE (n2.firstName="master") OR (n2.lastName="grew" )
WITH n2
MATCH (o:PannaResumeOrganizationNode)<-[h:has_organization]-(n2)-[r2:has_skill]->(n3:Skill)
WHERE (0={3} OR o.organizationId={3}) AND (0={4} OR n3.name IN {2} OR n3.name IN {5})
WITH size(collect(n3)) as count, n2
MATCH (n2) where (0={4} OR count={4})
RETURN DISTINCT n2
I have tried profile & explain clauses but they only return number of db hits. Is it possible to get big notations for a neo4j query ie cn we measure performance in terms of big O notations ? Are there any other ways to check query performance apart from using profile & explain ?

No, you cannot convert a Cypher to Big O notation.
Cypher does not describe how to fetch information, only what kind of information you want to return. It is up to the Cypher planner in the Neo4j database to convert a Cypher into an executable query (using heuristics about what info it has to find, what indexes are available to it, and internal statistics about the dataset being queried. So simply changing the state of the database can change the complexity of a Cypher.)
A very simple example of this is the Cypher Cypher 3.1 MATCH (a{id:1})-[*0..25]->(b) RETURN DISTINCT b. Using a fairly average connected graph with cycles, running against Neo4j 3.1.1 will time out for being too complex (Because the planner tries to find all paths, even though it doesn't need that redundant information), while Neo4j 3.2.3 will return very quickly (Because the Planner recognizes it only needs to do a graph scan like depth first search to find all connected nodes).
Side note, you can argue for BIG O notation on the return results. For example MATCH (a), (b) must have a minimum complexity of n^2 because the result is a Cartesian product, and execution can't be less complex then the answer. This understanding of how complexity affects row counts can help you write Cyphers that reduce the amount of work the Planner ends up planning.
For example, using WITH COLLECT(n) as data MATCH (c:M) to reduce the number of rows the Planner ends up doing work against before the next part of a Cypher from nm (first match count times second match count) to m (1 times second match count).
However, since Cypher makes no promises about how data is found, there is no way to guarantee the complexity of the execution. We can only try to write Cyphers that are more likely to get an optimal execution plan, and use EXPLAIN/PROFILE to evaluate if the planner is able to find a relatively optimal solution.

The PROFILE results show you how the neo4j server actually plans to process your Cypher query. You need to analyze the execution plan revealed by the PROFILE results to get the big O complexity. There are no tools to do that that I am aware of (although it would be a great idea for someone to create one).
You should also be aware that the execution plan for a query can change over time as the characteristics of the DB change, and also when changing to a different version of neo4j.

Nothing of this is sort is readily available. But it can be derived/approximated with some additional effort.
On profiling a query, we get a list of functions that neo4j will run to achieve the desired result.
Each of this function will be associated with the worst to best case complexities in theory. And some of them will run in parallel too. This will impact runtimes, depending on the cores that your server has.
For example match (a:A) match (a:B) results in Cartesian product. And this will be of O(count(a)*count(b))
Similarly each function of in your query-plan does have such time complexities.
So aggregations of this individual time complexities of these functions will give you an overall approximation of time-complexity of the query.
But this will change from time to time with each version of neo4j since they community can always change the implantation of a query or to achieve better runtimes / structural changes / parallelization/ less usage of ram.
If what you are looking for is an indication of the optimization of neo4j query db-hits is a good indicator.

Related

Query optimization that collects and orders nodes on very large graph

I have a decently large graph (1.8 billion nodes and roughly the same number of relationships) where I am performing the follow query:
MATCH (n:Article)
WHERE n.id IN $pmids
MATCH (n)-[:HAS_MENTION]->(m:Mention)
WITH n, collect(m) as mentions
RETURN n.id as pmid, mentions
ORDER BY pmid
where $pmids are a list of strings, e.g. ["1234", "4567"] where the length of this list varies from 100-500 length.
I am currently am holding the data within neo4j docker community instance with the following conf modifications: NEO4J_dbms_memory_pagecache_size=32G, NEO4J_dbms_memory_heap_max__size=32G. Index has been created for Article.id.
This query has been quite slow to run (roughly 5 seconds) and I would like to optimize to make for faster runtime. As part of work, I have access to neo4j enterprise so one approach would be to ingest this data as part of a neo4j enterprise account where I can tweak advanced configuration settings.
In general, does anyone have any tips in how I may improve performance, whether it be optimizing the cypher query itself, increase workers or other settings in neo4j.conf?
Thanks in advance.
For anyone interested - I posed this question in the neo4j forums as well and there have already been some interesting optimization suggestions (especially around the "type hint" to trigger backward-indexing, and using pattern comprehension instead of collect()
Initial thoughts
you are using a string field to store PMID, but PMIDs are numeric, it might reduce the database size, and possibly perform better if stored as int (and indexed as int, and searched as int)
if the PMID list is usually large, and the server has over half dozen cores, it might be worth looking into the apoc parallel cypher functions
do you really need every property from the Mention nodes? if not try gathering just what you need
what is the size of the database in GBs? (some context is required in terms of memory settings), and what did neo4j-admin memrec recommend?
If this is how the db is always used, all the time, a sql database might be better, and when building that sql db, collect the mentions into one field (once and done)
Note: Go PubMed!

Cypher does not use NodeIndexSeek without hint

I add relationships with an UNWIND query (neo4j 3.4.7, 30 GB heap, 30 GB page cache):
UNWIND { rels } AS rel
MATCH (a:Locus), (b:Snp)
WHERE a.chr = rel.start_chr AND a.start = rel.start_start AND a.end = rel.start_end AND a.ref = rel.start_ref AND b.sid = rel.end_sid
CREATE (a)-[r:TEST_MAPS]->(b)
SET r = rel.properties
Here are example parameters:
:param rels => [{start_chr: '6', start_start: 93922926, start_end: 93922926, start_ref: 'h37', end_sid: 'rs782706', properties: {source: 'binder_immuno', uuid: 'e2ee1287-9894-4eb4-8ba8-d8adc4959e50'}}]
The properties are indexed with :Snp(sid) and :Locus(chr, start, end, ref).
Problem: Adding relationships is very slow.
When I create the relationships, the query planner uses a fast NodeIndexSeek on a:Locus but uses a much slower NodeIndexScan on b:Snp (at least one order of magnitude slower).
The selection of the planner seems to depend on the Labels which are used, i.e. adding relationships the same way with other labels was fast and used NodeIndexSeek only.
I know that I can force the planner to use a seek on b:Snp. However, is there a way to tell Cypher to always do a seek when an index is available without changing the query?
Cypher makes no guarantees about how information will be retrieved. The executed plan will vary based on what version of Neo4j (Planner) you are running, and the internal DB statistics at the time of planning.
This is the reason Cypher has hints at all. Sometimes the internal statistics will deceive the planner into deciding on a less optimal plan.
One way you might be able to get the results you want is to inline property matches where you can. Like doing MATCH (a:Locus), (b:Snp{sid:rel.end_sid}). This isn't guaranteed to change the final plan, but moving as much of the WHERE into the MATCH part as you can seems to usually get better plans. (For more complex queries. For simpler ones, there will be no difference. Mileage will vary based on what version of Neo4j you are running.)

Neo4j Isochrones improvement

i am currently working with isochrones on Neo4j and PostGIS.
My problem in neo4j is that my query for calculating isochrones is not really efficient.
match (n:node) where n.id_gis='155'
with n
match path=(n)-[*0..15]-(e)
with e, min(reduce(cost=0.0, r IN rels(path) | cost + toFloat(r.cost2)*3600)) as cost
where cost < 30
return cost, collect(e) as isochrones
order by cost
As you can see in the code above i have currently a limit for maximum hops because otherwise it will search for all possible paths in my database before calculating the max cost.
Does anyone have an idea how i can change/improve my query so that it will be executed in a "normal" time and without limiting the amount of relationships?
I'm afraid you can't do better using Cypher.
You can however do much better using the traversal framework in Java. See this 2-part blog post comparing the Cypher implementation of Dijkstra's algorithm (similar problem, just with fixed start and end nodes instead of open ended) with APOC's: part 1 and part 2.
Using a traversal, you can compute the cost while traversing, which allows you to:
prune the branch as soon as the maximum cost is reached
collect the best cost so far for an end node, and prune the branch if the cost of the current branch is worse than the best cost for the same end node
You can use a Map<Node, Double> to collect the costs, though if your graph is large enough, using a primitive map (such as fastutil, Trove, HPPC, Koloboke, etc.) will use far less memory and be generally faster (more compact, less boxing and unboxing; just use the node's long id as the key).
Once the traversal has completed, you just have to inverse the map into a multimap of costs and related end nodes to get your result.
It's more complex than executing a Cypher query, but it'll be much, much faster.

Better Way to remove cycles from a path in neo4j graph

I am using neo4j graph database version 2.1.7. Brief Details around data:
2 million nodes with 6 different type of nodes, 5 million relationships with only 5 different type of relationships and mostly connected graph but contains a few isolated subgraphs.
While resolving paths, i get cycles in path. And to restrict that, i used the solution shared in below:
Returning only simple paths in Neo4j Cypher query
Here is the Query, i am using:
MATCH (n:nodeA{key:905728})
MATCH path = n-[:rel1|rel2|rel3|rel4*0..]->(c:nodeA)-[:rel5*0..1]->(b:nodeA)
WHERE ALL(a in nodes(path) where 1=length (filter (m in nodes(path) where m=a)))
and (length(EXTRACT (p in NODES(path)| p.key)) > 1)
and ((exists ((c)-[:rel5]->(b)) and (not exists((b)-[:rel1|rel2|rel3|rel4]->(:nodeA)) OR ANY (x in nodes(path) where (b)-[]->(x))))
OR (not exists ((c)-[:rel5]->()) and (not exists ((c)-[:rel1|rel2|rel3|rel4]->(:nodeA)) OR ANY (x in nodes(path) where (c)-[]->(x)))))
RETURN distinct EXTRACT (rp in Rels(path)| type(rp)), EXTRACT (p in NODES(path)| p.key);
The above query solves mine requirement but is not cost effective and keeps running if is run for huge subgraph. I have used 'Profile' command to improve query performance from what i started with. But, now stuck at this point. The performance has improved but, not what i expected from neo4j :(
I don't know that I have a solution, but I have a number of suggestions. Some might speed things up, some might just make the query easier to read.
Firstly, rather than putting exists ((c)-[:rel5]->(b)) in your WHERE, I believe you can put it in your MATCH like this:
MATCH path = n-[:rel1|rel2|rel3|rel4*0..]->(c:nodeA)-[:rel5*0..1]->(b:nodeA), (c)-[:rel5]->(b)
I don't think you need the exists keyword. I think you can just say, for example, (NOT (b)-[:rel1|rel2|rel3|rel4]->(:nodeA))
I'd also suggest thinking about the WITH clause for potential performance improvements.
A couple of notes about your variable paths: In *0.. the 0 means that your potentially looking for a self-reference. That may or may not be what you want. Also, leaving the variable path open ended can often cause performance problems (as I think you're seeing). If you can possibly cap it that may help.
Also, if you upgrade to 2.2.1, there are a number of built-in performance improvements with the 2.2.x line, but you also get visual PROFILEing in the console and a new EXPLAIN command which both profiles and tells you the real performance of the query after running it.
One thing to consider too is that I don't think you're hitting performance boundaries of Neo4j but rather, perhaps, you're potentially hitting some boundaries of Cypher. If so, I might suggest you do your querying with the Java APIs that Neo4j provides for better performance and more control. This can either be via embedding your database if you're using a JVM-compatible language or by writing an unmanaged extension which lets you do your own querying in java but provide a custom REST API from the server
Did a couple of more tweaks to my query as suggested above by Brian. And found improvement in query response time. Now, It takes almost 20% of time in execution compared to my original query and the current query makes almost 60% less db hits, compared to the query i shared earlier, during query execution. PFB the updated query:
MATCH (n:nodeA{key:905728})
MATCH path = n-[:rel1|rel2|rel3|rel4*1..]->(c:nodeA)-[:rel5*0..1]->(b:nodeA)
WHERE ALL(a in nodes(path) where 1=length (filter (m in nodes(path) where m=a)))
and (length(path) > 0)
and ((exists ((c)-[:rel5]->(b)) and (not ((c)-[:rel1|rel2|rel3|rel4]->()) OR ANY (x in nodes(path) where (c)-[]->(x))))
OR (not exists ((c)-[:rel5]->()) and (not ((c)-[:rel1|rel2|rel3|rel4]->()) OR ANY (x in nodes(path) where (c)-[]->(x)))))
RETURN distinct EXTRACT (rp in Rels(path)| type(rp)), EXTRACT (p in NODES(path)| p.key);
And observed dramatic improvement when capped the path from *1.. to *1..15. Also, removed one filter from query which too was taking longer time.
But, the query response time increased when queried on nodes having relationships more than 18-20 depths.
I would advise to use profile command oftenly to find pain points in your query. That would help you resolve the issues faster.
Thanks Brian.

neo4j cypher performance with multiple start nodes

http://console.neo4j.org/r/8mkc4z
In the grpah above, the purpose of the query
start n=node(1) match n-[:KNOWS]->m-[:KNOWS]->p where p.name='Cypher' return n, m, p
Is to find m, such that Neo knows m and m knows Cypher.
The same could be achieved by the following query too -
start n=node(1), p=node(4) match n-[:KNOWS]->m-[:KNOWS]->p return n, m, p
The first one uses where condition and second one uses multiple start nodes.
From performance perspective, which one should run faster and possibly in what scenarios.
I have faced performance issues with multiple start nodes whereas I think, logically having it as start node rather than where condition should be faster.
Are there any rules on what approach to use based on different scenarios.
So far we've worked on cypher the language, adding updating features in 1.8.
In Neo4j 1.9 we will focus on cypher performance.
So far pattern matchers with a single start-points are faster than ones with multiple start points. Still if the filtering is done only after the fact (like in your first query) they may still perform slower (depends on the result volume).
But that will change in the course of the next release. I think the best tip I can give you so far is to profile the queries with your realistic datasets (write data generators if you don't have the expected data yet).

Resources