Cypher does not use NodeIndexSeek without hint - neo4j

I add relationships with an UNWIND query (neo4j 3.4.7, 30 GB heap, 30 GB page cache):
UNWIND { rels } AS rel
MATCH (a:Locus), (b:Snp)
WHERE a.chr = rel.start_chr AND a.start = rel.start_start AND a.end = rel.start_end AND a.ref = rel.start_ref AND b.sid = rel.end_sid
CREATE (a)-[r:TEST_MAPS]->(b)
SET r = rel.properties
Here are example parameters:
:param rels => [{start_chr: '6', start_start: 93922926, start_end: 93922926, start_ref: 'h37', end_sid: 'rs782706', properties: {source: 'binder_immuno', uuid: 'e2ee1287-9894-4eb4-8ba8-d8adc4959e50'}}]
The properties are indexed with :Snp(sid) and :Locus(chr, start, end, ref).
Problem: Adding relationships is very slow.
When I create the relationships, the query planner uses a fast NodeIndexSeek on a:Locus but uses a much slower NodeIndexScan on b:Snp (at least one order of magnitude slower).
The selection of the planner seems to depend on the Labels which are used, i.e. adding relationships the same way with other labels was fast and used NodeIndexSeek only.
I know that I can force the planner to use a seek on b:Snp. However, is there a way to tell Cypher to always do a seek when an index is available without changing the query?

Cypher makes no guarantees about how information will be retrieved. The executed plan will vary based on what version of Neo4j (Planner) you are running, and the internal DB statistics at the time of planning.
This is the reason Cypher has hints at all. Sometimes the internal statistics will deceive the planner into deciding on a less optimal plan.
One way you might be able to get the results you want is to inline property matches where you can. Like doing MATCH (a:Locus), (b:Snp{sid:rel.end_sid}). This isn't guaranteed to change the final plan, but moving as much of the WHERE into the MATCH part as you can seems to usually get better plans. (For more complex queries. For simpler ones, there will be no difference. Mileage will vary based on what version of Neo4j you are running.)

Related

Complexity of a neo4j query

I need to measure the performance of any query.
for example :
MATCH (n:StateNode)-[r:has_city]->(n1:CityNode)
WHERE n.shortName IN {0} and n1.name IN {1}
WITH n1
Match (aa:ActiveStatusNode{isActive:toBoolean('true')})--(n2:PannaResume)-[r1:has_location]->(n1)
WHERE (n2.firstName="master") OR (n2.lastName="grew" )
WITH n2
MATCH (o:PannaResumeOrganizationNode)<-[h:has_organization]-(n2)-[r2:has_skill]->(n3:Skill)
WHERE (0={3} OR o.organizationId={3}) AND (0={4} OR n3.name IN {2} OR n3.name IN {5})
WITH size(collect(n3)) as count, n2
MATCH (n2) where (0={4} OR count={4})
RETURN DISTINCT n2
I have tried profile & explain clauses but they only return number of db hits. Is it possible to get big notations for a neo4j query ie cn we measure performance in terms of big O notations ? Are there any other ways to check query performance apart from using profile & explain ?
No, you cannot convert a Cypher to Big O notation.
Cypher does not describe how to fetch information, only what kind of information you want to return. It is up to the Cypher planner in the Neo4j database to convert a Cypher into an executable query (using heuristics about what info it has to find, what indexes are available to it, and internal statistics about the dataset being queried. So simply changing the state of the database can change the complexity of a Cypher.)
A very simple example of this is the Cypher Cypher 3.1 MATCH (a{id:1})-[*0..25]->(b) RETURN DISTINCT b. Using a fairly average connected graph with cycles, running against Neo4j 3.1.1 will time out for being too complex (Because the planner tries to find all paths, even though it doesn't need that redundant information), while Neo4j 3.2.3 will return very quickly (Because the Planner recognizes it only needs to do a graph scan like depth first search to find all connected nodes).
Side note, you can argue for BIG O notation on the return results. For example MATCH (a), (b) must have a minimum complexity of n^2 because the result is a Cartesian product, and execution can't be less complex then the answer. This understanding of how complexity affects row counts can help you write Cyphers that reduce the amount of work the Planner ends up planning.
For example, using WITH COLLECT(n) as data MATCH (c:M) to reduce the number of rows the Planner ends up doing work against before the next part of a Cypher from nm (first match count times second match count) to m (1 times second match count).
However, since Cypher makes no promises about how data is found, there is no way to guarantee the complexity of the execution. We can only try to write Cyphers that are more likely to get an optimal execution plan, and use EXPLAIN/PROFILE to evaluate if the planner is able to find a relatively optimal solution.
The PROFILE results show you how the neo4j server actually plans to process your Cypher query. You need to analyze the execution plan revealed by the PROFILE results to get the big O complexity. There are no tools to do that that I am aware of (although it would be a great idea for someone to create one).
You should also be aware that the execution plan for a query can change over time as the characteristics of the DB change, and also when changing to a different version of neo4j.
Nothing of this is sort is readily available. But it can be derived/approximated with some additional effort.
On profiling a query, we get a list of functions that neo4j will run to achieve the desired result.
Each of this function will be associated with the worst to best case complexities in theory. And some of them will run in parallel too. This will impact runtimes, depending on the cores that your server has.
For example match (a:A) match (a:B) results in Cartesian product. And this will be of O(count(a)*count(b))
Similarly each function of in your query-plan does have such time complexities.
So aggregations of this individual time complexities of these functions will give you an overall approximation of time-complexity of the query.
But this will change from time to time with each version of neo4j since they community can always change the implantation of a query or to achieve better runtimes / structural changes / parallelization/ less usage of ram.
If what you are looking for is an indication of the optimization of neo4j query db-hits is a good indicator.

neo4j for fraud detection - efficient data structure

I'm trying to improve a fraud detection system for a commerce website. We deal with direct bank transactions, so fraud is a risk we need to manage. I recently learned of graphing databases and can see how it applies to these problems. So, over the past couple of days I set up neo4j and parsed our data into it: example
My intuition was to create a node for each order, and a node for each piece of data associated with it, and then connect them all together. Like this:
MATCH (w:Wallet),(i:Ip),(e:Email),(o:Order)
WHERE w.wallet="ex" AND i.ip="ex" AND e.email="ex" AND o.refcode="ex"
CREATE (w)-[:USED]->(o),(i)-[:USED]->(o),(e)-[:USED]->(o)
But this query runs very slowly as the database size increases (I assume because it needs to search the whole data set for the nodes I'm asking for). It also takes a long time to run a query like this:
START a=node(179)
MATCH (a)-[:USED*]-(d)
WHERE EXISTS(d.refcode)
RETURN distinct d
This is intended to extract all orders that are connected to a starting point. I'm very new to Cypher (<24 hours), and I'm finding it particularly difficult to search for solutions.
Are there any specific issues with the data structure or queries that I can address to improve performance? It ideally needs to complete this kind of thing within a few seconds, as I'd expect from a SQL database. At this time we have about 17,000 nodes.
Always a good idea to completely read through the developers manual.
For speeding up lookups of nodes by a property, you definitely need to create indexes or unique constraints (depending on if the property should be unique to a label/value).
Once you've created the indexes and constraints you need, they'll be used under the hood by your query to speed up your matches.
START is only used for legacy indexes, and for the latest Neo4j versions you should use MATCH instead. If you're matching based upon an internal id, you can use MATCH (n) WHERE id(n) = xxx.
Keep in mind that you should not persist node ids outside of Neo4j for lookup in future queries, as internal node ids can be reused as nodes are deleted and created, so an id that once referred to a node that was deleted may later end up pointing to a completely different node.
Using labels in your queries should help your performance. In the query you gave to find orders, Neo4j must inspect every end node in your path to see if the property exists. Property access tends to be expensive, especially when you're using a variable-length match, so it's better to restrict the nodes you want by label.
MATCH (a)-[:USED*]-(d:Order)
WHERE id(a) = 179
RETURN distinct d
On larger graphs, the variable-length match might start slowing down, so you may get more performance by installing APOC Procedures and using the Path Expander procedure to gather all subgraph nodes and filter down to just Order nodes.
MATCH (a)
WHERE id(a) = 179
CALL apoc.path.expandConfig(a, {bfs:true, uniqueness:"NODE_GLOBAL"}) YIELD path
RETURN LAST(NODES(path)) as d
WHERE d:Order

Cypher query slow when intermediate node labels are specified

I have the following Cypher query
MATCH (p1:`Article` {article_id:'1234'})--(a1:`Author` {name:'Jones, P'})
MATCH (p2:`Article` {article_id:'5678'})--(a2:`Author` {name:'Jones, P'})
MATCH (p1)-[:WRITTEN_BY]->(c1:`Author`)-[h1:HAS_NAME]->(l1)
MATCH (p2)-[:WRITTEN_BY]->(c2:`Author`)-[h2:HAS_NAME]->(l2)
WHERE l1=l2 AND c1<>a1 AND c2<>a2
RETURN c1.FullName, c2.FullName, h1.distance + h2.distance
On my local Neo4j server, running this query takes ~4 seconds and PROFILE shows >3 million db hits. If I don't specify the Author label on c1 and c2 (it's redundant thanks to the relationship labels), the same query returns the same output in 33ms, and PROFILE shows <200 db hits.
When I run the same two queries on a larger version of the same database that's hosted on a remote server, this difference in performance vanishes.
Both dbs have the same constraints and indexes. Any ideas what else might be going wrong?
Your query has a lot of unnecessary stuff in it, so first off, here's a cleaner version of it that is less likely to get misinterpreted by the planner:
MATCH (name:Name) WHERE NOT name.name = 'Jones, P'
WITH name
MATCH (:`Article` {article_id:'1234'})-[:WRITTEN_BY]->()-[h1:HAS_NAME]->(name)<-[h2:HAS_NAME]-()<-[:WRITTEN_BY]-(:`Article` {article_id:'5678'})
RETURN name.name, h1.distance + h2.distance
There's really only one path you want to find, and you want to find it for any author whose name is not Jones, P. Take advantage of your shared :Name nodes to start your query with the smallest set of definite points and expand paths from there. You are generating a massive cartesian product by stacking all those MATCH statements and then filtering them out.
As for the difference in query performance, it appears that the query planner is trying to use the Author label to build your 3rd and 4th paths, whereas if you leave it out, the planner will only touch the much narrower set of :Articles (fixed by indexed property), then expand relationships through the (incidentally very small) set of nodes that have -[:WRITTEN_BY]-> relationships, and then the (also incidentally very small) set of those nodes that have a -[:HAS_NAME]-> relationship. That decision is based partly on the predictable size of the various sets, so if you have a different number of :Author nodes on the server, the planner will make a smarter choice and not use them.

Neo4j and Cypher - How can I create/merge chained sequential node relationships (and even better time-series)?

To keep things simple, as part of the ETL on my time-series data, I added a sequence number property to each row corresponding to 0..370365 (370,366 nodes, 5,555,490 properties - not that big). I later added a second property and named it "outeseq" (original) and "ineseq" (second) to see if an outright equivalence to base the relationship on might speed things up a bit.
I can get both of the following queries to run properly on up to ~30k nodes (LIMIT 30000) but past that, its just an endless wait. My JVM has 16g max (if it can even use it on a windows box):
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.outeseq-1
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
or
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq=b.ineseq
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
I also added these in hopes of speeding things up:
CREATE CONSTRAINT ON (a:BOOK)
ASSERT a.outeseq IS UNIQUE
CREATE CONSTRAINT ON (b:BOOK)
ASSERT b.ineseq IS UNIQUE
I can't get the relationships created for the entire data set! Help!
Alternatively, I can also get bits of the relationships built with parameters, but haven't figured out how to parameterize the sequence over all of the node-to-node sequential relationships, at least not in a semantically general enough way to do this.
I profiled the query, but did't see any reason for it to "blow-up".
Another question: I would like each relationship to have a property to represent the difference in the time-stamps of each node or delta-t. Is there a way to take the difference between the two values in two sequential nodes, and assign it to the relationship?....for all of the relationships at the same time?
The last Q, if you have the time - I'd really like to use the raw data and just chain the directed relationships from one nodes'stamp to the next nearest node with the minimum delta, but didn't run right at this for fear that it cause scanning of all the nodes in order to build each relationship.
Before anyone suggests that I look to KDB or other db's for time series, let me say I have a very specific reason to want to use a DAG representation.
It seems like this should be so easy...it probably is and I'm blind. Thanks!
Creating Relationships
Since your queries work on 30k nodes, I'd suggest to run them page by page over all the nodes. It seems feasible because outeseq and ineseq are unique and numeric so you can sort nodes by that properties and run query against one slice at time.
MATCH (a:BOOK),(b:BOOK)
WHERE a.outeseq = b.outeseq-1
WITH a, b ORDER BY a.outeseq SKIP {offset} LIMIT 30000
MERGE (a)-[s:FORWARD_SEQ]->(b)
RETURN s;
It will take about 13 times to run the query changing {offset} to cover all the data. It would be nice to write a script on any language which has a neo4j client.
Updating Relationship's Properties
You can assign timestamp delta to relationships using SET clause following the MATCH. Assuming that a timestamp is a long:
MATCH (a:BOOK)-[s:FORWARD_SEQ]->(b:BOOK)
SET s.delta = abs(b.timestamp - a.timestamp);
Chaining Nodes With Minimal Delta
When relationships have the delta property inside, the graph becomes a weighted graph. So we can apply this approach to calculate the shortest path using deltas. Then we just save the length of the shortest path (summ of deltas) into the relation between the first and the last node.
MATCH p=(a:BOOK)-[:FORWARD_SEQ*1..]->(b:BOOK)
WITH p AS shortestPath, a, b,
reduce(weight=0, r in relationships(p) : weight+r.delta) AS totalDelta
ORDER BY totalDelta ASC
LIMIT 1
MERGE (a)-[nearest:NEAREST {delta: totalDelta}]->(b)
RETURN nearest;
Disclaimer: queries above are not supposed to be totally working, they just hint possible approaches to the problem.

neo4j cypher query optimization

With 500,000 nodes I'm getting 10-15 seconds, any idea how I can optimize this?
start n=node(*) WHERE HAS(n.score) RETURN n, n.score ORDER BY n.score DESC Limit 5;
from looking around I get the sense that the WHERE clause is slowing it down but I'm not sure how I can use a MATCH on a property of a node.
As Luanne says it takes time because your are searching in all the nodes of your graph.
You could search only in the nodes that has a score property (by indexing them, by searching them from a common node, or - if you're using Neo4j 2 - by labeling them)
See http://docs.neo4j.org/chunked/milestone/indexing.html for further explanations on indexes (which seems to be the more common solution).
With node(*) you're effectively touching your entire graph of 500,000 nodes to check the presence of a property, and the ordering the results. How many rows do you get back?
If you drop your order clause is it any faster?
And what's your use case? Wondering if you can model this differently to avoid a global graph operation. For example, index nodes with the score property, or create a relation from all nodes with the score property to some sort of reference node. Depends on your use case really

Resources