Is size() still faster than EXISTS in neo4j? - neo4j

I found this guide on query tuning for Neo4j 2.2, and one of the tips in the guide is that when finding whether a relationship exists, this query:
size((n)-[:DIRECTED]->()) <> 0
is faster than this query:
EXISTS((n)-[:DIRECTED]->())
To me, it seems counter-intuitive that finding the total number of relationships is faster than determining whether the relationship exists at all. My question is- has EXISTS been optimized in later versions of Neo4j so that this tip is no longer necessary? And if not, what is the difference between these two functions that enables size() to be so much faster?

The Cypher query planners undergo continuous improvement, and the cited performance difference (that existed in neo4j 2.2) no longer exists.
For example, using PROFILE in neo4j 3.4.1, these 2 queries now produce essentially the same efficient execution plan (using the degree count):
PROFILE
MATCH (n:Person) WHERE SIZE((n)-[:DIRECTED]->()) > 0
RETURN count(*);
PROFILE
MATCH (n:Person) WHERE EXISTS((n)-[:DIRECTED]->())
RETURN count(*);

Related

Complexity of a neo4j query

I need to measure the performance of any query.
for example :
MATCH (n:StateNode)-[r:has_city]->(n1:CityNode)
WHERE n.shortName IN {0} and n1.name IN {1}
WITH n1
Match (aa:ActiveStatusNode{isActive:toBoolean('true')})--(n2:PannaResume)-[r1:has_location]->(n1)
WHERE (n2.firstName="master") OR (n2.lastName="grew" )
WITH n2
MATCH (o:PannaResumeOrganizationNode)<-[h:has_organization]-(n2)-[r2:has_skill]->(n3:Skill)
WHERE (0={3} OR o.organizationId={3}) AND (0={4} OR n3.name IN {2} OR n3.name IN {5})
WITH size(collect(n3)) as count, n2
MATCH (n2) where (0={4} OR count={4})
RETURN DISTINCT n2
I have tried profile & explain clauses but they only return number of db hits. Is it possible to get big notations for a neo4j query ie cn we measure performance in terms of big O notations ? Are there any other ways to check query performance apart from using profile & explain ?
No, you cannot convert a Cypher to Big O notation.
Cypher does not describe how to fetch information, only what kind of information you want to return. It is up to the Cypher planner in the Neo4j database to convert a Cypher into an executable query (using heuristics about what info it has to find, what indexes are available to it, and internal statistics about the dataset being queried. So simply changing the state of the database can change the complexity of a Cypher.)
A very simple example of this is the Cypher Cypher 3.1 MATCH (a{id:1})-[*0..25]->(b) RETURN DISTINCT b. Using a fairly average connected graph with cycles, running against Neo4j 3.1.1 will time out for being too complex (Because the planner tries to find all paths, even though it doesn't need that redundant information), while Neo4j 3.2.3 will return very quickly (Because the Planner recognizes it only needs to do a graph scan like depth first search to find all connected nodes).
Side note, you can argue for BIG O notation on the return results. For example MATCH (a), (b) must have a minimum complexity of n^2 because the result is a Cartesian product, and execution can't be less complex then the answer. This understanding of how complexity affects row counts can help you write Cyphers that reduce the amount of work the Planner ends up planning.
For example, using WITH COLLECT(n) as data MATCH (c:M) to reduce the number of rows the Planner ends up doing work against before the next part of a Cypher from nm (first match count times second match count) to m (1 times second match count).
However, since Cypher makes no promises about how data is found, there is no way to guarantee the complexity of the execution. We can only try to write Cyphers that are more likely to get an optimal execution plan, and use EXPLAIN/PROFILE to evaluate if the planner is able to find a relatively optimal solution.
The PROFILE results show you how the neo4j server actually plans to process your Cypher query. You need to analyze the execution plan revealed by the PROFILE results to get the big O complexity. There are no tools to do that that I am aware of (although it would be a great idea for someone to create one).
You should also be aware that the execution plan for a query can change over time as the characteristics of the DB change, and also when changing to a different version of neo4j.
Nothing of this is sort is readily available. But it can be derived/approximated with some additional effort.
On profiling a query, we get a list of functions that neo4j will run to achieve the desired result.
Each of this function will be associated with the worst to best case complexities in theory. And some of them will run in parallel too. This will impact runtimes, depending on the cores that your server has.
For example match (a:A) match (a:B) results in Cartesian product. And this will be of O(count(a)*count(b))
Similarly each function of in your query-plan does have such time complexities.
So aggregations of this individual time complexities of these functions will give you an overall approximation of time-complexity of the query.
But this will change from time to time with each version of neo4j since they community can always change the implantation of a query or to achieve better runtimes / structural changes / parallelization/ less usage of ram.
If what you are looking for is an indication of the optimization of neo4j query db-hits is a good indicator.

Neo4j Cypher find all paths exploring sorted relationships

I'm struggling for days to find a way for finding all paths (to a maximum length) between two nodes while controlling the path exploration by Neo4j by sorting the relationships that are going to be explored (by one of their properties).
So to be clear, lets say I want to find K best paths between two nodes until a maximum length M. The query will be like:
match (source{name:"source"}), (target{name:"target"}),
p = (source)-[*..M]->(target)
return p order by length(p) limit K;
So far so good. But lets say the relationships of the path have a property called "priority". What I want is to write a query that tells Neo4j on each step of path exploration which relationships should be explored first.
I know that can be possible when I use the java libraries and an embedded database (By implementing PathExpander interface and giving it as input to the GraphAlgoFactory.allSimplePaths() function in Java).
But now I'm trying to find a way doing this in a server mode database access using Bolt or REST api.
Is there any way to do this in the server mode? Or maybe using Java libraries functions while accessing the graph in server mode?
use labels and an index to find your two start-nodes
perhaps consider allShortestPaths to make it faster
try this:
match (source{name:"source"}), (target{name:"target"}),
p = (source)-[rels:*..20]->(target)
return p, reduce(prio=0, r IN rels | prio + r.priority) as priority
order by priority ASC, length(p)
limit 100;
I had a very similar problem. I was trying to find the shortest path from one node to all other nodes. I had written a query similar to the one in the answer above (https://stackoverflow.com/a/38030536/783836) and couldn't get it to perform in any reasonable time.
Asking Can Graph DBs perform well with unspecified end nodes? pointed me to the solution: the Single Shortest Path algorithm.
In Neo4j you need to install the Graph Data Science Library and make use of this function: gds.alpha.shortestPath.deltaStepping.stream

Better Way to remove cycles from a path in neo4j graph

I am using neo4j graph database version 2.1.7. Brief Details around data:
2 million nodes with 6 different type of nodes, 5 million relationships with only 5 different type of relationships and mostly connected graph but contains a few isolated subgraphs.
While resolving paths, i get cycles in path. And to restrict that, i used the solution shared in below:
Returning only simple paths in Neo4j Cypher query
Here is the Query, i am using:
MATCH (n:nodeA{key:905728})
MATCH path = n-[:rel1|rel2|rel3|rel4*0..]->(c:nodeA)-[:rel5*0..1]->(b:nodeA)
WHERE ALL(a in nodes(path) where 1=length (filter (m in nodes(path) where m=a)))
and (length(EXTRACT (p in NODES(path)| p.key)) > 1)
and ((exists ((c)-[:rel5]->(b)) and (not exists((b)-[:rel1|rel2|rel3|rel4]->(:nodeA)) OR ANY (x in nodes(path) where (b)-[]->(x))))
OR (not exists ((c)-[:rel5]->()) and (not exists ((c)-[:rel1|rel2|rel3|rel4]->(:nodeA)) OR ANY (x in nodes(path) where (c)-[]->(x)))))
RETURN distinct EXTRACT (rp in Rels(path)| type(rp)), EXTRACT (p in NODES(path)| p.key);
The above query solves mine requirement but is not cost effective and keeps running if is run for huge subgraph. I have used 'Profile' command to improve query performance from what i started with. But, now stuck at this point. The performance has improved but, not what i expected from neo4j :(
I don't know that I have a solution, but I have a number of suggestions. Some might speed things up, some might just make the query easier to read.
Firstly, rather than putting exists ((c)-[:rel5]->(b)) in your WHERE, I believe you can put it in your MATCH like this:
MATCH path = n-[:rel1|rel2|rel3|rel4*0..]->(c:nodeA)-[:rel5*0..1]->(b:nodeA), (c)-[:rel5]->(b)
I don't think you need the exists keyword. I think you can just say, for example, (NOT (b)-[:rel1|rel2|rel3|rel4]->(:nodeA))
I'd also suggest thinking about the WITH clause for potential performance improvements.
A couple of notes about your variable paths: In *0.. the 0 means that your potentially looking for a self-reference. That may or may not be what you want. Also, leaving the variable path open ended can often cause performance problems (as I think you're seeing). If you can possibly cap it that may help.
Also, if you upgrade to 2.2.1, there are a number of built-in performance improvements with the 2.2.x line, but you also get visual PROFILEing in the console and a new EXPLAIN command which both profiles and tells you the real performance of the query after running it.
One thing to consider too is that I don't think you're hitting performance boundaries of Neo4j but rather, perhaps, you're potentially hitting some boundaries of Cypher. If so, I might suggest you do your querying with the Java APIs that Neo4j provides for better performance and more control. This can either be via embedding your database if you're using a JVM-compatible language or by writing an unmanaged extension which lets you do your own querying in java but provide a custom REST API from the server
Did a couple of more tweaks to my query as suggested above by Brian. And found improvement in query response time. Now, It takes almost 20% of time in execution compared to my original query and the current query makes almost 60% less db hits, compared to the query i shared earlier, during query execution. PFB the updated query:
MATCH (n:nodeA{key:905728})
MATCH path = n-[:rel1|rel2|rel3|rel4*1..]->(c:nodeA)-[:rel5*0..1]->(b:nodeA)
WHERE ALL(a in nodes(path) where 1=length (filter (m in nodes(path) where m=a)))
and (length(path) > 0)
and ((exists ((c)-[:rel5]->(b)) and (not ((c)-[:rel1|rel2|rel3|rel4]->()) OR ANY (x in nodes(path) where (c)-[]->(x))))
OR (not exists ((c)-[:rel5]->()) and (not ((c)-[:rel1|rel2|rel3|rel4]->()) OR ANY (x in nodes(path) where (c)-[]->(x)))))
RETURN distinct EXTRACT (rp in Rels(path)| type(rp)), EXTRACT (p in NODES(path)| p.key);
And observed dramatic improvement when capped the path from *1.. to *1..15. Also, removed one filter from query which too was taking longer time.
But, the query response time increased when queried on nodes having relationships more than 18-20 depths.
I would advise to use profile command oftenly to find pain points in your query. That would help you resolve the issues faster.
Thanks Brian.

How to get node's id with cypher request?

I'm using neo4j and making executing this query:
MATCH (n:Person) RETURN n.name LIMIT 5
I'm getting the names but i need the ids too.
Please help!
Since ID isn't a property, it's returned using the ID function.
MATCH (n:Person) RETURN ID(n) LIMIT 5
Not sure how helpful or relevant this is, but when I'm using the NodeJS API the record objects returned from Cypher queries have an identity field on the same level as the properties object (e.g record.get(0).properties, record.get(0).identity). I'm assuming you aren't just doing plain Cypher queries and actually using a driver to send the queries - so you might not have to run another MATCH statement.
I'm aware that the OP is asking about Cypher specifically - but it might be helpful to other users that stumble upon this question.
Or you can take a look on the Neo4j Cypher Refcard
You can get a short look to a lots of functions and patterns you can write.
And more about functions on The Neo4j Developer Manual - Chapter 3. Cypher - 3.4. Functions

Neo4j why picking a single node and a single edge take so long?

I am trying to test the speed of Neo4j, thus I created an empty database and then populate it with 10,000 users.
Now I run the following query
MATCH (n) RETURN id(n) LIMIT 1;
Surprisingly, it takes 1069 ms!
Then I run the following query (note: I haven't created any edges)
MATCH ()-[r]-() RETURN id(r) LIMIT 1;
which takes 1153ms!
Then I run
MATCH (n) RETURN id(n) SKIP 9900 LIMIT 100
which takes 10427ms.
Is it normal? I think those operations, at least the last one, is quite frequent in an app. I am using a Macbook Air with 1.7GHz Core i5
What version of Neo4j are you using?
How do you measure? The Neo4j browser measures multiple roundtrips for additional data.
Also is that the first or a subsequent query?
None of those queries should be that slow. Perhaps you can share your Neo4j configuration?
that one should be really fast
this one goes over all the nodes (or even over the cross product) in your graph and tries to find a relationship at the end it doesn't find any
that one should also be really fat.
Regarding your comment, if you know the first node, your search will be anchored and you don't have to scan all rels in the database.
MATCH (:User {name:"Han"})-[:FRIEND]->(friend)
RETURN friend

Resources