Neo4j index faster than relationships - neo4j

Using Neo4j 3.2.3 inside a Docker instance. Need help interpreting the profiler.
Importing relational data into Neo4j. Initially importing a status column as a relationship to the Respondent node (i.e. (r:Respondent)-[:FROM]->(s:Status) where status has a unique id corresponding to the status column in the flat file). Respondent nodes have a property visit_date which is an integer YYYYMMDD. Using Status as a relationship in:
MATCH (s:Status {id: 1})<-[:FROM]-(r:Respondent)
WHERE r.visit_date >= 20160101 and r.visit_date <= 20161231
RETURN COUNT(r)
first does a node seek by range on the Respondent node's visit_date property (yielding 101,057 DB hits identical to the number of respondents with status equal to 1). There is an index on the 'visit_date' property. Next Neo4j expands all against the Status. This expansion does 303,168 db hits equal to the number of all respondents and a filter is applied to each. Would have expected that the number of hits would be lower than the first range seek instead Neo4j fans out.
If I put the status as a property in Respondent and query:
MATCH (r:Respondent)
WHERE r.visit_date >= 20160101 and r.visit_date <= 20161231 and r.status =1
RETURN COUNT(r);
the range seek is done first on visit_date and then a filter applied status (just on the range of 101,057 respondents). The query is faster and has less total db hits.
Surprised by the results in the profiler (in particular fan-out in the expand all after the Respondents have been ranged). One caveat is that I am profiling on my laptop with DDR3 RAM limited to 8GB so there is only 4G dedicated to the Docker container and 3072M to the heap which doesn't leave much over for page caching (500 M). The speed difference in querying might be due to a poor configuration but the profiling (i.e. expanding or not) shouldn't be effected by the Neo4j configuration. The question is why the fan-out when Status is a relationship when the range seek already has reduced the result set?
Update 1 (with explain images)
Adds explain images (note: couldn't figure out how people dump explains/profiles as text). Weird that the index hint on Status when it is a relationship is just as fast despite the actual hits being higher.
Matching Status relationship
Indexing status as property of Respondent
Matching Status relationship using index hint which is just as fast as indexing on status as a property

Related

Performance Issue with neo4j

There is DataSet at my Notebook’s Virtual Machine:
2 million unique Customers [:VISITED] 40000 unique Merchants.
Every [:VISIT] has properties: amount (double) and dt (date).
Every Customer has property “pty_id” (Integer).
And every Merchant has mcht_id (String) property.
One Customer may visit one Merchant for more than one time. And of course, one Customer may visit many Merchants. So there are 43 978 539 relationships in my graph between Customers and Merchants.
I have created Indexes:
CREATE INDEX on :Customer(pty_id)
CREATE INDEX  on :Merchant(mcht_id)
Parameters of my VM are:
Oracle (RedHat) Linux 7 with 2 core i7, 2 GB RAM
Parameters of my Neo4j 3.5.7 config:
- dbms.memory.heap.max_size=1024m
- dbms.memory.pagecache.size=512m
My task is:
Get top 10 Customers ordered by total_amount who spent their money at NOT specified Merchant(M) but visit that Merchants which have been visited by Customers who visit this specified Merchant(M)
My Solution is:
Let’s M will have mcht_id = "0000000DA5"
Then the CYPHER query will be:
MATCH
(c:Customer)-[r:VISITED]->(mm:Merchant)<-[:VISITED]-(cc:Customer)-[:VISITED]->(m:Merchant {mcht_id: "0000000DA5"})
WHERE
NOT (c)-[:VISITED]->(m)
WITH
DISTINCT c as uc
MATCH
(uc:Customer)-[rr:VISITED]->()
RETURN
uc.pty_id
,round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10;
Result is OK. I receive my answer:
uc.pty_id - v_amt: 1433798 - 348925.94; 739510 - 339169.83; 374933 -
327962.95 and so on.
The problem is that this result I have received after 437613 ms! It’s about 7 minutes!!! My estimated time for this query was about 10-20 seconds….
My Question is: What am I doing wrong???
There's a few things to improve here.
First, for graph-wide queries in a graph with millions of nodes and 50 million relationships, 1G of heap and 512M of pagecache is far too low. We usually recommend around 8-10G of heap minimum for medium to large graphs (this is your "scratch space" memory as a query executes), and to try to get as much of the graph size as possible in pagecache if you can to minimize cache misses as you traverse the graph. Neo4j likes memory. Memory is relatively cheap. You can use neo4j-admin memrec to get a recommendation of how to configure your memory settings, but in general you need to run this on a machine with more memory.
And if we're talking about hardware recommendations, usage of SSDs is highly recommended, for when you do need to hit the disk.
As for the query itself, notice in the query plan you posted that your DISTINCT operation drops the number of rows from the neighborhood of 26-35 million to only 153k rows, that's significant. Your most expensive step here (WHERE
NOT (c)-[:VISITED]->(m)) is the Expand(Into) operation on the right side of the plan, with nearly 1 billion db hits. This is happening too early in the query - you should be doing this AFTER your DISTINCT operation, so it operates on only 153k rows instead of 35 million.
You can also improve upon this so you don't even have to hit the graph to do that step of the filtering. Instead of using that WHERE NOT <pattern> approach, you can pre-match to the customers who visited the first merchant, gather them into a list, and keep them around, and instead of using negation of the pattern (where it has to actually expand out all :VISITED relationships of those customers and see if any was the original merchant), we instead do a list membership check, and ensure they aren't one of the 1k or so customers who visited the original merchant. That will happen in memory, since we already collected that list, so it shouldn't hit the graph. In any case you should do DISTINCT before this check.
In your RETURN you're performing an aggregation with respect to a node's unique property, so you're paying the cost of projecting that property across 4 million rows BEFORE the cardinality drops from the aggregation to 153k rows, meaning you're projecting out that property redundantly across a great many duplicate :Customer nodes before they become distinct from the aggregation. That's redundant and expensive property access you can avoid by aggregating with respect to the node instead, and then do your property access after the aggregation, and also after your sort and limit, so you only have to project out 10 properties.
So putting that all together, try this out:
MATCH
(cc:Customer)-[:VISITED]->(m:Merchant {mcht_id: "0000000DA5"})
WITH m, collect(DISTINCT cc) as visitors
UNWIND visitors as cc
MATCH (uc:Customer)-[:VISITED]->(mm:Merchant)<-[:VISITED]-(cc)
WHERE
mm <> m
WITH
DISTINCT visitors, uc
WHERE NOT uc IN visitors
MATCH
(uc:Customer)-[rr:VISITED]->()
WITH
uc, round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10
RETURN uc.pty_id, v_amt;
EDIT
Okay, let's try something else. I suspect that what we're encountering here is a great deal of duplicates during expansion (many visitors may have visited the same merchants). Cypher won't eliminate duplicates during traversal unless you explicitly ask for it (as it may need this info for doing aggregations such as counting of occurrences), and this query is highly dependent on getting distinct nodes during expansion.
If you can install APOC Procedures, we can make use of some expansion procs which let us change how Cypher expands, only visiting each distinct node once across all paths. That may improve the timing here. At the least it will show us if the slowdown we're seeing is related to deduplication of nodes during expansion, or if it's something else.
MATCH (m:Merchant {mcht_id: "0000000DA5"})
CALL apoc.path.expandConfig(m, {uniqueness:'NODE_GLOBAL', relationshipFilter:'VISITED', minLevel:3, maxLevel:3}) YIELD path
WITH last(nodes(path)) as uc
MATCH
(uc:Customer)-[rr:VISITED]->()
WITH
uc
,round(100*sum(rr.amount))/100 as v_amt
ORDER BY v_amt DESC
LIMIT 10
RETURN uc.pty_id, v_amt;
While this is a more complicated approach, one neat thing is that with NODE_GLOBAL uniqueness (ensuring we only visit each node once across all expanded paths) and bfs expansion, we don't need to include WHERE NOT (c)-[:VISITED]->(m) since this will naturally be ruled out; we would have already visited every visitor of m, and since they've already been visited, we cannot visit them again, so none of them will appear in the final result set at 3 hops.
Give this a try and run it a couple times to get that into pagecache (or as much as possible...with 512MB pagecache you may not be able to get all of the traversed structure into memory).
I have tested all optimised query on Neo4j and on Oracle. Results are:
Oracle - 2.197 sec
Neo4j - 5.326 sec
You can see details here: http://homme.io/41163#run
And there is more complimentared for Neo4j case at http://homme.io/41721.

cypher performance for multiple hops /

I'm running my cypher queryies on a very large social network (over 1B records). I'm trying to get all paths between two person with variable relationship lengths. I get a reasonable response time running a query for a single relationship length (between 0.5 -2 seconds) [the person ids are index].
MATCH paths=( (pr1:person)-[*0..1]-(pr2:person) )
WHERE pr1.id='123456'
RETURN paths
However when I run the query with multiple lengths (i.e. 2 or more) my response time goes up to several minutes. Assuming that each person has in average the same number of connection I should be running my queries for 2-3 minutes Max (but I get up to 5+ min).
MATCH paths=( (pr1:person)-[*0..2]-(pr2:person) )
pr1.id='123456'
RETURN paths
I tried to use the EXPLAIN did not show extreme values for the VarLengthExpand(All) .
Maybe the traversing is not using the index for the pr2.
Is there anyway to improve the performance of my query?
Since variable-length relationship searches have exponential complexity, your *0..2 query might be generating a very large number of paths, which can cause the neo4j server (or your client code, like the neo4j browser) to run a long time or even run out of memory.
This query might be able to finish and show you how many matching paths there are:
MATCH (pr1:person)-[*0..2]-(:person)
WHERE pr1.id='123456'
RETURN COUNT(*);
If the returned number is very large, then you should modify your query to reduce the size of the result. For example, you can adding a LIMIT clause after your original RETURN clause to limit the number of returned paths.
By the way, the EXPLAIN clause just estimates the query cost, and can be way off. The PROFILE clause performs the actual query, and gives you an accurate accounting of the DB hits (however, if your query never finishes running, then a PROFILE of it will also never finish).
Rather than using the explain, try the "profile" instead.

Cypher query slow when intermediate node labels are specified

I have the following Cypher query
MATCH (p1:`Article` {article_id:'1234'})--(a1:`Author` {name:'Jones, P'})
MATCH (p2:`Article` {article_id:'5678'})--(a2:`Author` {name:'Jones, P'})
MATCH (p1)-[:WRITTEN_BY]->(c1:`Author`)-[h1:HAS_NAME]->(l1)
MATCH (p2)-[:WRITTEN_BY]->(c2:`Author`)-[h2:HAS_NAME]->(l2)
WHERE l1=l2 AND c1<>a1 AND c2<>a2
RETURN c1.FullName, c2.FullName, h1.distance + h2.distance
On my local Neo4j server, running this query takes ~4 seconds and PROFILE shows >3 million db hits. If I don't specify the Author label on c1 and c2 (it's redundant thanks to the relationship labels), the same query returns the same output in 33ms, and PROFILE shows <200 db hits.
When I run the same two queries on a larger version of the same database that's hosted on a remote server, this difference in performance vanishes.
Both dbs have the same constraints and indexes. Any ideas what else might be going wrong?
Your query has a lot of unnecessary stuff in it, so first off, here's a cleaner version of it that is less likely to get misinterpreted by the planner:
MATCH (name:Name) WHERE NOT name.name = 'Jones, P'
WITH name
MATCH (:`Article` {article_id:'1234'})-[:WRITTEN_BY]->()-[h1:HAS_NAME]->(name)<-[h2:HAS_NAME]-()<-[:WRITTEN_BY]-(:`Article` {article_id:'5678'})
RETURN name.name, h1.distance + h2.distance
There's really only one path you want to find, and you want to find it for any author whose name is not Jones, P. Take advantage of your shared :Name nodes to start your query with the smallest set of definite points and expand paths from there. You are generating a massive cartesian product by stacking all those MATCH statements and then filtering them out.
As for the difference in query performance, it appears that the query planner is trying to use the Author label to build your 3rd and 4th paths, whereas if you leave it out, the planner will only touch the much narrower set of :Articles (fixed by indexed property), then expand relationships through the (incidentally very small) set of nodes that have -[:WRITTEN_BY]-> relationships, and then the (also incidentally very small) set of those nodes that have a -[:HAS_NAME]-> relationship. That decision is based partly on the predictable size of the various sets, so if you have a different number of :Author nodes on the server, the planner will make a smarter choice and not use them.

Graph database performance

I was reading a book recommended on Neo4j site: http://neo4j.com/books/graph-databases/ about graph database performance and it said:
"In contrast to relational databases, where join-intensive query performance deteriorates
as the dataset gets bigger, with a graph database performance tends to remain
relatively constant, even as the dataset grows. This is because queries are localized to a
portion of the graph. As a result, the execution time for each query is proportional only
to the size of the part of the graph traversed to satisfy that query, rather than the size of
the overall graph."
So e.g. I want to return only nodes with a label "Doctor, that's localized to a portion of a graph. But my question is how does the database itself know where those nodes are ? In other words, does it not need to traverse all nodes to find out whether or not they satisfy the query and make decision based on that ?
Neo4j has a special indexing for node labels so that it can find all nodes for a label without searching all nodes. Beyond that you can:
Create your own indexes based on node properties (either schema indexes or legacy indexes) in order to find nodes as starting points
Query by node IDs to find a starting point (though I'd suggest using your own property with an index if you need to identify nodes more permanently)
In general localized searches mean: you start from a smallish set of starting points which can be people, products, places, orders etc.
A portion of the graph that is annotated with a label, often doesn't fall into that category, i.e. all doctors are not a smallish set of starting points.
Your query would probably touch a large portion of the graph if you traverse out from all doctors to their neighborhoods.
A query like this would be a graph local one:
MATCH (:City {name:"SFO"})<-[:RESIDES_IN]-(d:Doctor)-[presc:PRESCRIBES]->(m:Medicine)
RETURN d.name, m.name, sum(presc.amount) as amount

neo4j - find child nodes with property value is slow

I have a graph with approximately 1 million nodes.
The graph represents a catalog tree (spare parts). Maximum deep is about 6.
A node have a filter property that can have any value, even empty. This filter property is used to filter the catalog for the user.
What I want is to ask a question like this when I click a node (any level):
"for each child node, tell me if any of its children (any level) has a filter attribute with a value of ...".
With my query I takes about 12 sec for each child to get the result. Should not this scenario be an ideal use case for neo? Shouldn't it be way faster?
I can send the nodes and relations as text files if you want the data.
my query is something like this:
start n=node(3)
match n-[:PARENT_ITEM;1..6]->x
where x.filter="something"
return count(x)
I'm running on a Windows Azure Large server (4 cores, 7Gb ram) and i haven't done any configurations after neo installation.

Resources