i am currently working with isochrones on Neo4j and PostGIS.
My problem in neo4j is that my query for calculating isochrones is not really efficient.
match (n:node) where n.id_gis='155'
with n
match path=(n)-[*0..15]-(e)
with e, min(reduce(cost=0.0, r IN rels(path) | cost + toFloat(r.cost2)*3600)) as cost
where cost < 30
return cost, collect(e) as isochrones
order by cost
As you can see in the code above i have currently a limit for maximum hops because otherwise it will search for all possible paths in my database before calculating the max cost.
Does anyone have an idea how i can change/improve my query so that it will be executed in a "normal" time and without limiting the amount of relationships?
I'm afraid you can't do better using Cypher.
You can however do much better using the traversal framework in Java. See this 2-part blog post comparing the Cypher implementation of Dijkstra's algorithm (similar problem, just with fixed start and end nodes instead of open ended) with APOC's: part 1 and part 2.
Using a traversal, you can compute the cost while traversing, which allows you to:
prune the branch as soon as the maximum cost is reached
collect the best cost so far for an end node, and prune the branch if the cost of the current branch is worse than the best cost for the same end node
You can use a Map<Node, Double> to collect the costs, though if your graph is large enough, using a primitive map (such as fastutil, Trove, HPPC, Koloboke, etc.) will use far less memory and be generally faster (more compact, less boxing and unboxing; just use the node's long id as the key).
Once the traversal has completed, you just have to inverse the map into a multimap of costs and related end nodes to get your result.
It's more complex than executing a Cypher query, but it'll be much, much faster.
Related
I add relationships with an UNWIND query (neo4j 3.4.7, 30 GB heap, 30 GB page cache):
UNWIND { rels } AS rel
MATCH (a:Locus), (b:Snp)
WHERE a.chr = rel.start_chr AND a.start = rel.start_start AND a.end = rel.start_end AND a.ref = rel.start_ref AND b.sid = rel.end_sid
CREATE (a)-[r:TEST_MAPS]->(b)
SET r = rel.properties
Here are example parameters:
:param rels => [{start_chr: '6', start_start: 93922926, start_end: 93922926, start_ref: 'h37', end_sid: 'rs782706', properties: {source: 'binder_immuno', uuid: 'e2ee1287-9894-4eb4-8ba8-d8adc4959e50'}}]
The properties are indexed with :Snp(sid) and :Locus(chr, start, end, ref).
Problem: Adding relationships is very slow.
When I create the relationships, the query planner uses a fast NodeIndexSeek on a:Locus but uses a much slower NodeIndexScan on b:Snp (at least one order of magnitude slower).
The selection of the planner seems to depend on the Labels which are used, i.e. adding relationships the same way with other labels was fast and used NodeIndexSeek only.
I know that I can force the planner to use a seek on b:Snp. However, is there a way to tell Cypher to always do a seek when an index is available without changing the query?
Cypher makes no guarantees about how information will be retrieved. The executed plan will vary based on what version of Neo4j (Planner) you are running, and the internal DB statistics at the time of planning.
This is the reason Cypher has hints at all. Sometimes the internal statistics will deceive the planner into deciding on a less optimal plan.
One way you might be able to get the results you want is to inline property matches where you can. Like doing MATCH (a:Locus), (b:Snp{sid:rel.end_sid}). This isn't guaranteed to change the final plan, but moving as much of the WHERE into the MATCH part as you can seems to usually get better plans. (For more complex queries. For simpler ones, there will be no difference. Mileage will vary based on what version of Neo4j you are running.)
I need to measure the performance of any query.
for example :
MATCH (n:StateNode)-[r:has_city]->(n1:CityNode)
WHERE n.shortName IN {0} and n1.name IN {1}
WITH n1
Match (aa:ActiveStatusNode{isActive:toBoolean('true')})--(n2:PannaResume)-[r1:has_location]->(n1)
WHERE (n2.firstName="master") OR (n2.lastName="grew" )
WITH n2
MATCH (o:PannaResumeOrganizationNode)<-[h:has_organization]-(n2)-[r2:has_skill]->(n3:Skill)
WHERE (0={3} OR o.organizationId={3}) AND (0={4} OR n3.name IN {2} OR n3.name IN {5})
WITH size(collect(n3)) as count, n2
MATCH (n2) where (0={4} OR count={4})
RETURN DISTINCT n2
I have tried profile & explain clauses but they only return number of db hits. Is it possible to get big notations for a neo4j query ie cn we measure performance in terms of big O notations ? Are there any other ways to check query performance apart from using profile & explain ?
No, you cannot convert a Cypher to Big O notation.
Cypher does not describe how to fetch information, only what kind of information you want to return. It is up to the Cypher planner in the Neo4j database to convert a Cypher into an executable query (using heuristics about what info it has to find, what indexes are available to it, and internal statistics about the dataset being queried. So simply changing the state of the database can change the complexity of a Cypher.)
A very simple example of this is the Cypher Cypher 3.1 MATCH (a{id:1})-[*0..25]->(b) RETURN DISTINCT b. Using a fairly average connected graph with cycles, running against Neo4j 3.1.1 will time out for being too complex (Because the planner tries to find all paths, even though it doesn't need that redundant information), while Neo4j 3.2.3 will return very quickly (Because the Planner recognizes it only needs to do a graph scan like depth first search to find all connected nodes).
Side note, you can argue for BIG O notation on the return results. For example MATCH (a), (b) must have a minimum complexity of n^2 because the result is a Cartesian product, and execution can't be less complex then the answer. This understanding of how complexity affects row counts can help you write Cyphers that reduce the amount of work the Planner ends up planning.
For example, using WITH COLLECT(n) as data MATCH (c:M) to reduce the number of rows the Planner ends up doing work against before the next part of a Cypher from nm (first match count times second match count) to m (1 times second match count).
However, since Cypher makes no promises about how data is found, there is no way to guarantee the complexity of the execution. We can only try to write Cyphers that are more likely to get an optimal execution plan, and use EXPLAIN/PROFILE to evaluate if the planner is able to find a relatively optimal solution.
The PROFILE results show you how the neo4j server actually plans to process your Cypher query. You need to analyze the execution plan revealed by the PROFILE results to get the big O complexity. There are no tools to do that that I am aware of (although it would be a great idea for someone to create one).
You should also be aware that the execution plan for a query can change over time as the characteristics of the DB change, and also when changing to a different version of neo4j.
Nothing of this is sort is readily available. But it can be derived/approximated with some additional effort.
On profiling a query, we get a list of functions that neo4j will run to achieve the desired result.
Each of this function will be associated with the worst to best case complexities in theory. And some of them will run in parallel too. This will impact runtimes, depending on the cores that your server has.
For example match (a:A) match (a:B) results in Cartesian product. And this will be of O(count(a)*count(b))
Similarly each function of in your query-plan does have such time complexities.
So aggregations of this individual time complexities of these functions will give you an overall approximation of time-complexity of the query.
But this will change from time to time with each version of neo4j since they community can always change the implantation of a query or to achieve better runtimes / structural changes / parallelization/ less usage of ram.
If what you are looking for is an indication of the optimization of neo4j query db-hits is a good indicator.
I am yet trying to make use of neo4j to perform a complex query (similar to shortest path search except I have very strange conditions applied to this search like minimum path length in terms of nodes traversed count).
My dataset contains around 2.5M nodes of one single type and around 1.5 billion edges (One single type as well). Each given node has on average 1000 directional relation to a "next" node.
Yet, I have a query that allows me to retrieve this shortest path given all of my conditions but the only way I found to have decent response time (under one second) is to actually limit the number of results after each new node added to the path, filter it, order it and then pursue to the next node (This is kind of a greedy algorithm I suppose).
I'd like to limit them a lot less than I do in order to yield more path as a result, but the problem is the exponential complexity of this search that makes going from LIMIT 40 to LIMIT 60 usually a matter of x10 ~ x100 processing time.
This being said, I am yet evaluating several solutions to increase the speed of the request but I'm quite unsure of the result they will yield as I'm not sure about how neo4j really stores my data internally.
The solution I think about yet is to actually add a property to my relationships which would be an integer in between 1 and 15 because I usually will only query the relationships that have one or two max different values for this property. (like only relationships that have this property to 8 or 9 for example).
As I can guess yet, for each relationship, neo4j then have to gather the original node properties and use it to apply my further filters which takes a very long time when crossing 4 nodes long path with 1000 relationships each (I guess O(1000^4)). Am I right ?
With relationship properties, will it have direct access to it without further data fetching ? Is there any chance it will make my queries faster? How are neo4j edges properties stored ?
UPDATE
Following #logisima 's advice I've written a procedure directly with the Java traversal API of neo4j. I then switched to the raw Java procedure API of Neo4J to leverage even more power and flexibility as my use case required it.
The results are really good : the lower bound complexity is overall a little less thant it was before but the higher bound is like ten time faster and when at least some of the nodes that will be used for the traversal are in the cache of Neo4j, the performances just becomes astonishing (depth 20 in less than a second for one of my tests when I only need depth 4 usually).
But that's not all. The procedures makes it very very easily customisable while keeping the performances at their best and optimizing every single operation at its best. The results is that I can use far more powerful filters in far less computing time and can easily update my procedure to add new features. Last but not least Procedures are very easily pluggable with spring-data for neo4j (which I use to connect neo4j to my HTTP API). Where as with cypher, I would have to auto generate the queries (as being very complex, there was like 30 java classes to do the trick properly) and I should have used jdbc for neo4j while handling a separate connection pool only for this request. Cannot recommend more to use the awesome neo4j java API.
Thanks again #logisima
If you're trying to do a custom shortespath algo, then you should write a cypher procedure with the traversal API.
The principe of Cypher is to make pattern matching, and you want to traverse the graph in a specific way to find your good solution.
The response time should be really faster for your use-case !
I have a GraphAware time tree and spatial r tree set up to reference a large set of nodes in my graph. I am trying to search these records by time and space.
Individually I can gather results from these queries in about 5 seconds:
WITH
({start:1300542000000,end:1350543000000}) as tr
CALL ga.timetree.events.range(tr) YIELD node as n
RETURN count(n);
> ~ 500000 results
WITH
({lon:120.0,lat:20.0}) as smin, ({lon:122.0,lat:21.0}) as smax
CALL spatial.bbox('spatial_records', smin, smax) YIELD node as n
RETURN count(n);
> ~ 30000 results
When I try to filter these results the performance drops drastically. Neo4j is already using up a large amount of memory in my system, so I am under the impression that the memory footprint of this command is too much on my system, and that the query will never finish. (I am using to the neo4j-shell to run these commands)
WITH
({start:1300542000000,end:1350543000000}) as tr,
({lon:120.0,lat:20.0}) as smin, ({lon:122.0,lat:21.0}) as smax
CALL ga.timetree.events.range(tr) YIELD node as n
CALL spatial.bbox('spatial_records', smin, smax) YIELD node as m
WITH COLLECT(n) as nn, COLLECT(m) as mm
RETURN FILTER(x in nn WHERE X in mm);
I am wondering what the best way to efficiently filter the results of these two statement calls is. I attempted to use the REDUCE clause, but couldn't quite figure out the syntax.
As a side question, given that this is the most common type of query that I will issue to my database, is this a good way to do things (as in using the time tree and r tree referencing the same set of nodes)? I haven't found any other tools in neo4j that support indexing both space and time in a single structure, so this is my current implementation.
The first procedure returns you 500k nodes, and collecting is a costly operation, so yeah this would be very memory heavy.
I would start from what returns you the less nodes, and then using cypher rather than a procedure, so here I would replace the call to the timetree procedure by a ranged query filter in Cypher.
Assuming you have an indexed timestamp property on your nodes :
CALL spatial.bbox('spatial_records', smin, smax) YIELD node as m
WITH m
WHERE m.timestamp > 1300542000000 and m.timestamp < 1350543000000
RETURN m
I wouldn't recommend to remove the timetree (otherwise I would be fired <- joke) . In some time query cases the timetree would outperform the queries on ranged query, especially when the resolution is high (millisecond) and you have a lot of very consecutive timestamps.
Otherwise you seem to have a very good use case, this would be nice if you could send more details on the neo4j slack or privately (christophe at graphaware dot com), this could help Neo4j and GraphAware to maybe support more stuff via procedures (like passing a collection of nodes and filter out those not being in the range or a smooth combination with spatial) in a better way, as long as it is generic enough.
In the meantime, as you are using open source products, you could easily create a procedure that combine two procedures for your specific use case.
http://console.neo4j.org/r/8mkc4z
In the grpah above, the purpose of the query
start n=node(1) match n-[:KNOWS]->m-[:KNOWS]->p where p.name='Cypher' return n, m, p
Is to find m, such that Neo knows m and m knows Cypher.
The same could be achieved by the following query too -
start n=node(1), p=node(4) match n-[:KNOWS]->m-[:KNOWS]->p return n, m, p
The first one uses where condition and second one uses multiple start nodes.
From performance perspective, which one should run faster and possibly in what scenarios.
I have faced performance issues with multiple start nodes whereas I think, logically having it as start node rather than where condition should be faster.
Are there any rules on what approach to use based on different scenarios.
So far we've worked on cypher the language, adding updating features in 1.8.
In Neo4j 1.9 we will focus on cypher performance.
So far pattern matchers with a single start-points are faster than ones with multiple start points. Still if the filtering is done only after the fact (like in your first query) they may still perform slower (depends on the result volume).
But that will change in the course of the next release. I think the best tip I can give you so far is to profile the queries with your realistic datasets (write data generators if you don't have the expected data yet).