Make my Neo4j queries faster - neo4j

I'm evaluating Neo4j for our application, and now am at a point where performance is an issue. I've created a lot of nodes and edges that I'm doing some queries against. The following is a detail of the nodes and edges data in this database:
I am trying to do a search that traverses the yellow arrows of this diagram. What I have so far is the following query:
MATCH (n:LABEL_TYPE_Project {id:'14'})
-[:RELATIONSHIP_scopes*1]->(m:LABEL_TYPE_PermissionNode)
-[:RELATIONSHIP_observedBy*1]->(o:LABEL_TYPE_Item)
WHERE m.id IN ['1', '2', '6', '12', '12064', '19614', '19742', '19863', '21453', '21454', '21457', '21657', '21658', '31123', '31127', '31130', '47691', '55603', '55650', '56026', '56028', '56029', '56050', '56052', '85383', '85406', '85615', '105665', '1035242', '1035243']
AND o.content =~ '.*some string.*'
RETURN o
LIMIT 20
(The variable paths above have been updated, see "Update 2")
The above query takes a barely-acceptable 1200ms. It only returns the requested 20 items. If I want a count of the same, this takes forever:
MATCH ... more of the same ...
RETURN count(o)
The above query takes many minutes. This is Neo4j 2.2.0-M03 Community running on CentOS. There is around 385,000 nodes, 170,000 of type Item.
I have created indices on all id fields (programmatically, index().forNodes(...).add(...)), also on the content field (CREATE INDEX ... statement).
Are there fundamental improvements yet to be made to my queries? Things I can try?
Much appreciated.
This question was moved over from Neo4j discussion group on Google per their suggestions.
Update 1
As requested:
:schema
Gives:
Indexes
ON :LABEL_TYPE_Item(id) ONLINE
ON :LABEL_TYPE_Item(active) ONLINE
ON :LABEL_TYPE_Item(content) ONLINE
ON :LABEL_TYPE_PermissionNode(id) ONLINE
ON :LABEL_TYPE_Project(id) ONLINE
No constraints
(This is updated, see "Update 2")
Update 2
I have made the following noteworthy improvements to the query:
Shame on me, I did have super-nodes for all TYPE_Projects (not by design, just messed up the importing algorithm that I was using) and I removed it now
I had a lot of "strings" that could have been proper data types, such as integers, booleans and I am now importing them as such (you'll see in the updated queries below that I removed a lot of quotes)
As pointed out, I had variable length paths and I fixed those
As pointed out, I should have had uniqueness indices instead of regular indices and I fixed that
As a consequence:
:schema
Now gives:
Indexes
ON :LABEL_TYPE_Item(active) ONLINE
ON :LABEL_TYPE_Item(content) ONLINE
ON :LABEL_TYPE_Item(id) ONLINE (for uniqueness constraint)
ON :LABEL_TYPE_PermissionNode(id) ONLINE (for uniqueness constraint)
ON :LABEL_TYPE_Project(id) ONLINE (for uniqueness constraint)
Constraints
ON (label_type_item:LABEL_TYPE_Item) ASSERT label_type_item.id IS UNIQUE
ON (label_type_project:LABEL_TYPE_Project) ASSERT label_type_project.id IS UNIQUE
ON (label_type_permissionnode:LABEL_TYPE_PermissionNode) ASSERT label_type_permissionnode.id IS UNIQUE
The query now looks like this:
MATCH (n:LABEL_TYPE_Project {id:14})
-[:RELATIONSHIP_scopes]->(m:LABEL_TYPE_PermissionNode)
-[:RELATIONSHIP_observedBy]->(o:LABEL_TYPE_Item)
WHERE m.id IN [1, 2, 6, 12, 12064, 19614, 19742, 19863, 21453, 21454, 21457, 21657, 21658, 31123, 31127, 31130, 47691, 55603, 55650, 56026, 56028, 56029, 56050, 56052, 85383, 85406, 85615, 105665, 1035242, 1035243]
AND o.content =~ '.*some string.*'
RETURN o
LIMIT 20
The above query now takes approx. 350ms.
I still want a count of the same:
MATCH ...
RETURN count(0)
The above query now takes approx. 1100ms. Although that's much better, and barely acceptable for this particular query, I've already found some more-complex queries that inherently take longer. So a further improvement on this query here would be great.
As requested here is the PROFILE for RETURN o query (for the improved query):
Compiler CYPHER 2.2
Planner COST
Projection
|
+Limit
|
+Filter(0)
|
+Expand(All)(0)
|
+Filter(1)
|
+Expand(All)(1)
|
+NodeUniqueIndexSeek
+---------------------+---------------+-------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+---------------------+---------------+-------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Projection | 1900 | 20 | 0 | m, n, o | o |
| Limit | 1900 | 20 | 0 | m, n, o | { AUTOINT32} |
| Filter(0) | 1900 | 20 | 131925 | m, n, o | (hasLabel(o:LABEL_TYPE_Item) AND Property(o,content(23)) ~= /{ AUTOSTRING31}/) |
| Expand(All)(0) | 4993 | 43975 | 43993 | m, n, o | (m)-[:RELATIONSHIP_observedBy]->(o) |
| Filter(1) | 2 | 18 | 614 | m, n | (hasLabel(m:LABEL_TYPE_PermissionNode) AND any(-_-INNER-_- in Collection(List({ AUTOINT1}, { AUTOINT2}, { AUTOINT3}, { AUTOINT4}, { AUTOINT5}, { AUTOINT6}, { AUTOINT7}, { AUTOINT8}, { AUTOINT9}, { AUTOINT10}, { AUTOINT11}, { AUTOINT12}, { AUTOINT13}, { AUTOINT14}, { AUTOINT15}, { AUTOINT16}, { AUTOINT17}, { AUTOINT18}, { AUTOINT19}, { AUTOINT20}, { AUTOINT21}, { AUTOINT22}, { AUTOINT23}, { AUTOINT24}, { AUTOINT25}, { AUTOINT26}, { AUTOINT27}, { AUTOINT28}, { AUTOINT29}, { AUTOINT30})) where Property(m,id(0)) == -_-INNER-_-)) |
| Expand(All)(1) | 11 | 18 | 19 | m, n | (n)-[:RELATIONSHIP_scopes]->(m) |
| NodeUniqueIndexSeek | 1 | 1 | 1 | n | :LABEL_TYPE_Project(id) |
+---------------------+---------------+-------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
And here is the PROFILE for RETURN count(o) query (for the improved query):
Compiler CYPHER 2.2
Planner COST
Limit
|
+EagerAggregation
|
+Filter(0)
|
+Expand(All)(0)
|
+Filter(1)
|
+Expand(All)(1)
|
+NodeUniqueIndexSeek
+---------------------+---------------+--------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+---------------------+---------------+--------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Limit | 44 | 1 | 0 | count(o) | { AUTOINT32} |
| EagerAggregation | 44 | 1 | 0 | count(o) | |
| Filter(0) | 1900 | 101 | 440565 | m, n, o | (hasLabel(o:LABEL_TYPE_Item) AND Property(o,content(23)) ~= /{ AUTOSTRING31}/) |
| Expand(All)(0) | 4993 | 146855 | 146881 | m, n, o | (m)-[:RELATIONSHIP_observedBy]->(o) |
| Filter(1) | 2 | 26 | 850 | m, n | (hasLabel(m:LABEL_TYPE_PermissionNode) AND any(-_-INNER-_- in Collection(List({ AUTOINT1}, { AUTOINT2}, { AUTOINT3}, { AUTOINT4}, { AUTOINT5}, { AUTOINT6}, { AUTOINT7}, { AUTOINT8}, { AUTOINT9}, { AUTOINT10}, { AUTOINT11}, { AUTOINT12}, { AUTOINT13}, { AUTOINT14}, { AUTOINT15}, { AUTOINT16}, { AUTOINT17}, { AUTOINT18}, { AUTOINT19}, { AUTOINT20}, { AUTOINT21}, { AUTOINT22}, { AUTOINT23}, { AUTOINT24}, { AUTOINT25}, { AUTOINT26}, { AUTOINT27}, { AUTOINT28}, { AUTOINT29}, { AUTOINT30})) where Property(m,id(0)) == -_-INNER-_-)) |
| Expand(All)(1) | 11 | 26 | 27 | m, n | (n)-[:RELATIONSHIP_scopes]->(m) |
| NodeUniqueIndexSeek | 1 | 1 | 1 | n | :LABEL_TYPE_Project(id) |
+---------------------+---------------+--------+--------+-------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Remaining suggestions:
Use MATCH ... WITH x MATCH ...->(x) syntax: this did not help me at all, so far
Use Lucene indexes: still to do See results in "Update 3"
Use precomputation: this will not help me, since the queries are going to be rather variant
Update 3
I've been playing with full-text search, and indexed the content property as follows:
IndexManager indexManager = getGraphDb().index();
Map<String, String> customConfiguration = MapUtil.stringMap(IndexManager.PROVIDER, "lucene", "type", "fulltext");
Index<Node> index = indexManager.forNodes("INDEX_FULL_TEXT_content_Item", customConfiguration);
index.add(node, "content", value);
When I run the following query this takes approx. 1200ms:
START o=node:INDEX_FULL_TEXT_content_Item("content:*some string*")
MATCH (n:LABEL_TYPE_Project {id:14})
-[:RELATIONSHIP_scopes]->(m:LABEL_TYPE_PermissionNode)
-[:RELATIONSHIP_observedBy]->(o:LABEL_TYPE_Item)
WHERE m.id IN [1, 2, 6, 12, 12064, 19614, 19742, 19863, 21453, 21454, 21457, 21657, 21658, 31123, 31127, 31130, 47691, 55603, 55650, 56026, 56028, 56029, 56050, 56052, 85383, 85406, 85615, 105665, 1035242, 1035243]
RETURN count(o);
Here is the PROFILE for this query:
Compiler CYPHER 2.2
Planner COST
EagerAggregation
|
+Filter(0)
|
+Expand(All)(0)
|
+NodeHashJoin
|
+Filter(1)
| |
| +NodeByIndexQuery
|
+Expand(All)(1)
|
+NodeUniqueIndexSeek
+---------------------+---------------+--------+--------+-------------+------------------------------------------------------------------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+---------------------+---------------+--------+--------+-------------+------------------------------------------------------------------------+
| EagerAggregation | 50 | 1 | 0 | count(o) | |
| Filter(0) | 2533 | 166 | 498 | m, n, o | (Property(n,id(0)) == { AUTOINT0} AND hasLabel(n:LABEL_TYPE_Project)) |
| Expand(All)(0) | 32933 | 166 | 332 | m, n, o | (m)<-[:RELATIONSHIP_scopes]-(n) |
| NodeHashJoin | 32933 | 166 | 0 | m, o | o |
| Filter(1) | 1 | 553 | 553 | o | hasLabel(o:LABEL_TYPE_Item) |
| NodeByIndexQuery | 1 | 553 | 554 | o | Literal(content:*itzndby*); INDEX_FULL_TEXT_content_Item |
| Expand(All)(1) | 64914 | 146855 | 146881 | m, o | (m)-[:RELATIONSHIP_observedBy]->(o) |
| NodeUniqueIndexSeek | 27 | 26 | 30 | m | :LABEL_TYPE_PermissionNode(id) |
+---------------------+---------------+--------+--------+-------------+------------------------------------------------------------------------+

Things to think about/try: in general with query optimization, the #1 name of the game is to figure out ways to consider less data in the first place, in the answering of the query. It's far less fruitful to consider the same data faster, than it is to consider less data.
Lucene indexes on your content fields. My understanding is that regex you're doing isn't narrowing cypher's search path any, so it's basically having to look at every o:LABEL_TYPE_Item and run the regex against that field. Your regex is only looking for a substring though, so lucene may help cut down the number of nodes cypher has to consider before it can give you a result.
Your relationship paths are variable length, (-[:RELATIONSHIP_scopes*1]->) yet the image you give us suggests you only ever need one hop. On both relationship hops, depending on how your graph is structured (and how much data you have) you might be looking through way more information than you need to there. Consider those relationship hops and your data model carefully; can you replace with -[:RELATIONSHIP_scopes]-> instead? Note that you have a WHERE clause on m nodes, you may be traversing more of those than required.
Check the query plan (via PROFILE, google for docs). One trick I see a lot of people using is pushing the most restrictive part of their query to the top, in front of a WITH block. This reduces the number of "starting points".
What I mean is taking this query...
MATCH (foo)-[:stuff*]->(bar) // (bunch of other complex stuff)
WHERE bar.id = 5
RETURN foo
And turning it into this:
MATCH bar
WHERE bar.id = 5
WITH bar
MATCH (foo)-[:stuff*]->(bar)
RETURN foo;
(Check output via PROFILE, this trick can be used to force the query execution plan to do the most selective thing first, drastically reducing the amount of the graph that cypher considers/traverses...better performance)
Precompute; if you have a particular set of nodes that you use all the time (those with the IDs you identify) you can create a custom index node of your own. Let's call it (foo:SpecialIndex { label: "My Nifty Index" }). This is akin to a "view" in a relational database. You link the stuff you want to access quickly to foo. Then your query, instead of having that big WHERE id IN [blah blah] clause, it simply looks up foo:SpecialIndex, traverses to the hit points, then goes from there. This trick works well when the list of entry points in your list of IDs is large, rapidly growing, or both. This keeps all the same computation you'd do normally, but shifts some of it to be done ahead of time so you don't do it every time you run the query.
Got any supernodes in that graph? (A supernode is an extremely densely connected node, i.e. one with a million outbound relationships) -- don't do that. Try to arrange your data model such that you don't have supernodes, if at all possible.
JVM/Node Cache tweaks. Sometimes you can get an advantage by changing your node caching strategy, or available memory to do the caching. The idea here is that instead of hitting data on disk, if you warm your cache up then you get at least some of the I/O out of the way. This one can help in some cases, but it wouldn't be my first stop unless the way you've configured the JVM or neo4j is already somewhat memory-poor. This one probably also helps you a little less because it tries to make your current access pattern faster, rather than improving your actual access pattern.

can you share your output of :schema in the browser?
if you don't have it do:
create constraint on (p:LABEL_TYPE_Project) assert p.id is unique;
create constraint on (m:LABEL_TYPE_PermissionNode) assert m.id is unique;
The manual indexes you created only help for Item.content if you index it with FULLTEXT_CONFIG and then use START o=node:items("content:(some string)") MATCH ...
As in Neo4j you can always traverse relationships in both directions, you don't need the inverse relationships, it only hurts performance because queries then tend to check one cycle more.
You don't need variable length paths [*1] in your query, change it to:
MATCH (n:LABEL_TYPE_Project {id:'14'})-[:RELATIONSHIP_scopes]->
(m:LABEL_TYPE_PermissionNode)-[:RELATIONSHIP_observedBy]->(o:LABEL_TYPE_Item)
WHERE m.id in ['1', '2', ... '1035242', '1035243']
AND o.content =~ '.*itzndby.*' RETURN o LIMIT 20
For real queries use parameters, for project-id and permission.id ->
MATCH (n:LABEL_TYPE_Project {id: {p_id}})-[:RELATIONSHIP_scopes]->(m:LABEL_TYPE_PermissionNode)-[:RELATIONSHIP_observedBy]->(o:LABEL_TYPE_Item)
WHERE m.id in {m_ids} AND o.content =~ '.*'+{item_content}+'.*'
RETURN o LIMIT 20
remember a realistic query performance only shows up on a warmed up system, so run the query at least twice
you might also want to split up your query
MATCH (n:LABEL_TYPE_Project {id: {p_id}})-[:RELATIONSHIP_scopes]->(m:LABEL_TYPE_PermissionNode)
WHERE m.id in {m_ids}
WITH distinct m
MATCH (m)-[:RELATIONSHIP_observedBy]->(o:LABEL_TYPE_Item)
WHERE o.content =~ '.*'+{item_content}+'.*'
RETURN o LIMIT 20
Also learn about PROFILE you can prefix your query it in the old webadmin: http://localhost:7474/webadmin/#/console/
If you use Neo4j 2.2-M03 there is built in support for query plan visualization with EXPLAIN and PROFILE prefixes.

Related

How can I calculate the path between two nodes with hops in range (1,5) (neo4j) ?

I'm trying to find all possible path between two nodes. I've used few cypher queries which does the required job but it take a lot of time if the hops increases. This is the query
match p = (n{name:"Node1"})-[:Route*1..5]-(b{name:"Node2"}) return p
Also if I use shortestpath it limits the result if a path with minimum hop is found. So I don't get the results with 2 or more than two hops if a direct connection (1 hop) is found between the nodes.
match p = shortestpath((n{name:"Node1"})-[:Route*1..5]-(b{name:"Node2"})) return p
and if I increase the hop to 2 or more it throws an exception.
shortestPath(...) does not support a minimal length different from 0 or 1
Is there any other alternative framework or algorithm to get all path with minimum time ?
P.S. I'm looking for something in the order of ms. Currently all queries with hops greater than 3 takes few seconds to complete.
I gather you are trying to speed up your original query involving variable-length paths. The shortestpath function is not appropriate for your query, as it literally tries to find a shortest path -- not all paths up to a certain length.
The execution plan for your original query (using sample data) looks like this:
+-----------------------+----------------+------+---------+-------------------+---------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Identifiers | Other |
+-----------------------+----------------+------+---------+-------------------+---------------------------------------------+
| +ProduceResults | 0 | 1 | 0 | p | p |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +Projection | 0 | 1 | 0 | anon[30], b, n, p | ProjectedPath(Set(anon[30], n),) |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +Filter | 0 | 1 | 2 | anon[30], b, n | n.name == { AUTOSTRING0} |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +VarLengthExpand(All) | 0 | 2 | 7 | anon[30], b, n | (b)<-[:Route*]-(n) |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +Filter | 0 | 1 | 3 | b | b.name == { AUTOSTRING1} |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +AllNodesScan | 3 | 3 | 4 | b | |
+-----------------------+----------------+------+---------+-------------------+---------------------------------------------+
So, your original query is scanning through every node to find the node(s) that match the b pattern. Then, it expands all variable-length paths starting at b. And then it filters those paths to find the one(s) that end with a node that matches the pattern for n.
Here are a few suggestions that should speed up your query, although you'll have to test it on your data to see by how much:
Give each node a label. For example, Foo.
Create an index that can speed up the search for your end nodes. For example:
CREATE INDEX ON :Foo(name);
Modify your query to force the use of the index on both end nodes. For example:
MATCH p =(n:Foo { name:"Node1" })-[:Route*1..5]-(b:Foo { name:"Node2" })
USING INDEX n:Foo(name)
USING INDEX b:Foo(name)
RETURN p;
After the above changes, the execution plan is:
+-----------------+------+---------+-----------------------------+-----------------------------+
| Operator | Rows | DB Hits | Identifiers | Other |
+-----------------+------+---------+-----------------------------+-----------------------------+
| +ColumnFilter | 1 | 0 | p | keep columns p |
| | +------+---------+-----------------------------+-----------------------------+
| +ExtractPath | 1 | 0 | anon[33], anon[34], b, n, p | |
| | +------+---------+-----------------------------+-----------------------------+
| +PatternMatcher | 1 | 3 | anon[33], anon[34], b, n | |
| | +------+---------+-----------------------------+-----------------------------+
| +SchemaIndex | 1 | 2 | b, n | { AUTOSTRING1}; :Foo(name) |
| | +------+---------+-----------------------------+-----------------------------+
| +SchemaIndex | 1 | 2 | n | { AUTOSTRING0}; :Foo(name) |
+-----------------+------+---------+-----------------------------+-----------------------------+
This query plan uses the index to directly get the b and n nodes -- without scanning. This, by itself, should provide a speed improvement. And then this plan uses the "PatternMatcher" to find the variable-length paths between those end nodes. You will have to try this query out to see how efficient the "PatternMatcher" is in doing that.
From your description I assume that you want to get a shortest path based on some weight like a duration property on the :Route relationships.
If that is true using shortestPath in cypher is not helpful since it just takes into account the number of hops. Weighted shortest paths are not yet available in Cypher in an efficient way.
The Java API has support for weighted shortest paths via dijekstra or astar via the GraphAlgoFactory class. For the simple case that your cost function is just the value of a relationship property (as mentioned above) you can also use an existing REST endpoint.

Neo4J search greater than on Index

I have a Neo4J database with a large number of datasets (~15M), where I want to perform a greater than search on one of its properties. I have the corresponding property indexed. The property is a float value.
When I do an exact match like MATCH (i:Label) WHERE i.property = $value RETURN count(i) I get the result within a very short time. But when I do the same search with greater than, i.e. MATCH (i:Label) WHERE i.property > $value RETURN count(i) it just takes forever. What is the correct way to do this in Cypher?
Edit: Execution plan:
+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
74 ms
Compiler CYPHER 2.2
Planner COST
EagerAggregation
|
+Filter
|
+NodeByLabelScan
+------------------+---------------+-------------+------------------------------+
| Operator | EstimatedRows | Identifiers | Other |
+------------------+---------------+-------------+------------------------------+
| EagerAggregation | 2064 | count(r) | |
| Filter | 4260557 | r | r.date > Subtract(Divide( |
| | | | TimestampFunction(),{ |
| | | | AUTOINT0}),Literal(86400)) |
| NodeByLabelScan | 14201858 | r | :Request |
+------------------+---------------+-------------+------------------------------+
Total database accesses: ?
Another approach is to create additional/aggregation nodes for that property and searching thru those nodes.
Example
Let say the property is a value from 0 - 100.
Create following nodes
* 0to30
* 31to60
* 61to100
Create relationship from you nodes to this 'aggregate' nodes.
Than searching thru those nodes
MATCH (l:Label)-[i:IN]->(a:0to30)
RETURN l
Unfortunately Neo4j doesn't use it's indexes for inequalities in 2.2.x. In the upcoming 2.3.x this should be supported.

Profiling neo4j query: filter to db hits

I'm curious how filters work in neo4j queries. They result in db hits (according to PROFILE), and it seems that they shouldn't.
An example query:
PROFILE MATCH (a:act)<-[r:relationship]-(n)
WHERE a.chapter='13' and a.year='2009'
RETURN r, n
NodeIndexSeek: (I created the an index on the label act for chapter property) returns 6 rows.
Filter: a.year == {AUTOSTRING1} which results in 12 db hits.
Why does it need to do any db hits if it's already fetched the 6 matching instances of a in earlier db reads, shouldn't it just filter them down without going back to do more db reads?
I realise I'm equating 'db hits' with 'db reads' here, which may not be accurate. If not, what exactly are 'db hits'?
Lastly, the number of db hits incurred by a filter appear to approximately match:
<number of filtering elements> * 2 * <number of already queried nodes to filter on>
where 'number of filtering elements' is the number of filters provided, i.e.
WHERE a.year='2009' and a.property_x='thing'
is two elements.
Thanks for any help.
EDIT:
Here are the results of PROFILE and EXPLAIN on the query.
This is just an example query. I've found the behaviour of
filter db hits = <number of filtering elements> * 2 * <number of already queried nodes to filter on>
to be generally true in queries I've run.
PROFILE MATCH (a:act)<-[r:CHILD_OF]-(n)
WHERE a.chapter='13' AND a.year='2009'
RETURN r, n
8 rows
55 ms
Compiler CYPHER 2.2
Planner COST
Projection
|
+Expand(All)
|
+Filter
|
+NodeIndexSeek
+---------------+---------------+------+--------+-------------+---------------------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+---------------+---------------+------+--------+-------------+---------------------------+
| Projection | 1 | 8 | 0 | a, n, r | r; n |
| Expand(All) | 1 | 8 | 9 | a, n, r | (a)<-[r:CHILD_OF]-(n) |
| Filter | 0 | 1 | 12 | a | a.year == { AUTOSTRING1} |
| NodeIndexSeek | 1 | 6 | 7 | a | :act(chapter) |
+---------------+---------------+------+--------+-------------+---------------------------+
Total database accesses: 28
EXPLAIN MATCH (a:act)<-[r:CHILD_OF]-(n)
WHERE a.chapter='13' AND a.year='2009'
RETURN r, n
4 ms
Compiler CYPHER 2.2
Planner COST
Projection
|
+Expand(All)
|
+Filter
|
+NodeIndexSeek
+---------------+---------------+-------------+---------------------------+
| Operator | EstimatedRows | Identifiers | Other |
+---------------+---------------+-------------+---------------------------+
| Projection | 1 | a, n, r | r; n |
| Expand(All) | 1 | a, n, r | (a)<-[r:CHILD_OF]-(n) |
| Filter | 0 | a | a.year == { AUTOSTRING1} |
| NodeIndexSeek | 1 | a | :act(chapter) |
+---------------+---------------+-------------+---------------------------+
Total database accesses: ?
Because reading a node (record) and reading property (records) is not the same db-operation.
You are right that the filter hit's should be at most 6 though.
Usually Neo4j pulls filters and predicates to the earliest possible moment, so it should filter directly after the index lookup.
In some situations though (due to the predicate) it can only filter after finding the paths, then the number of db-hits might equal the number of checked paths.
Which Neo4j version are you using? Can you share your full query plan?

Cypher / Should I use the WITH clause to pass values to next MATCH?

Using Neo4j 2.1.X, let's suppose this query, returning the user 123's friends that bought a Car:
MATCH (u1:User("123"))-[:KNOWS]-(friend)
MATCH (friend)-[:BUYS]->(c:Car)
RETURN friend
In this article, it is written regarding the WITH clause:
So, how does it work? Well, with is basically just a stream, as lazy
as it can be (as lazy as return can be), passing results on to the
next query.
So it seems I should transform the query like this:
MATCH (u1:User("123"))-[:KNOWS]-(friend)
WITH friend
MATCH (friend)-[:BUYS]->(c:Car)
RETURN friend
Should I? Or does the current version of Cypher already handle MATCH chaining while passing values through them?
The more accurate starting point you give in the upfront of your query, the more efficient it will be.
Your first match is not so accurate, indeed it will use the traversal matcher to match all possible relationships.
Taken the following neo4j console example : http://console.neo4j.org/r/jsx71g
And your first query who will look like this in the example :
MATCH (n:User { login: 'nash99' })-[:KNOWS]->(friend)
RETURN count(*)
You can see the amount of dbhits in the upfront :
Execution Plan
ColumnFilter
|
+EagerAggregation
|
+Filter
|
+TraversalMatcher
+------------------+------+--------+-------------+-----------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+------+--------+-------------+-----------------------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(*) |
| EagerAggregation | 1 | 0 | | |
| Filter | 8 | 320 | | Property(n,login(2)) == { AUTOSTRING0} |
| TraversalMatcher | 160 | 201 | | friend, UNNAMED32, friend |
+------------------+------+--------+-------------+-----------------------------------------+
Total database accesses: 521
If you use a more accurate starting point, you're the king of the road when you start from this point, look at this example query and see the difference in db hits :
Execution Plan
ColumnFilter
|
+EagerAggregation
|
+SimplePatternMatcher
|
+Filter
|
+NodeByLabel
+----------------------+------+--------+------------------------+-----------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+------+--------+------------------------+-----------------------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(*) |
| EagerAggregation | 1 | 0 | | |
| SimplePatternMatcher | 8 | 0 | n, friend, UNNAMED51 | |
| Filter | 1 | 40 | | Property(n,login(2)) == { AUTOSTRING0} |
| NodeByLabel | 20 | 21 | n, n | :User |
+----------------------+------+--------+------------------------+-----------------------------------------+
Total database accesses: 61
So to terminate your query, I will do something like this :
MATCH (n:User { login: 'nash99' })
WITH n
MATCH (n)-[:KNOWS]->(friend)-[:BUYS]->(c:Car)
RETURN friend
You can also specify that the friends can not be the same as the user :
MATCH (n:User { login: 'nash99' })
WITH n
MATCH (n)-[:KNOWS]->(friend)-[:BUYS]->(c:Car)
WHERE NOT friend.id = n.id
RETURN friend
Note that there is no difference between the above query and the following in matter of db hits :
MATCH (n:User { login: 'nash99' })
WITH n
MATCH (n)-[:KNOWS]->(friend)
WITH friend
MATCH (friend)-[:BUYS)->(c:Car)
RETURN (friend)
I recommend that you use the neo4j console to look at the result details showing you the above informations.
If you need to quickly protoype a graph for test, you can use Graphgen, export the graph in cypher statements and load these statements in the neo4j console.
Here is the link to the graphgen generation I used for the console http://graphgen.neoxygen.io/?graph=29l9XJ0HxJ2pyQ
Chris

neo4j benchmark, multiple queries, measure time

Is there any way that I can perform my benchmarks for multiple queries in neo4j?
Assuming that I have loaded my graph, I want to initiate 10000 distinct shortest path queries in the database, without loading the data to a client. Is there a way that I can do this in a batch and get the execution times?
Try using the profile keyword inside of the neo4j-shell. This will give you some basic facts about how quickly, and how a query executes.
Here's a simple example:
neo4j-sh (?)$ CREATE (a {label:"foo"})-[:bar]->(b {label: "bar"})-[:bar]->(c {label: "baz"});
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 3
Relationships created: 2
Properties set: 3
1180 ms
neo4j-sh (?)$ profile match (a {label: "foo"}), (c {label: "baz"}), p=shortestPath(a-[*]-c) return p;
+--------------------------------------------------------------------------------------+
| p |
+--------------------------------------------------------------------------------------+
| [Node[0]{label:"foo"},:bar[0]{},Node[1]{label:"bar"},:bar[1]{},Node[2]{label:"baz"}] |
+--------------------------------------------------------------------------------------+
1 row
ColumnFilter
|
+ShortestPath
|
+Filter(0)
|
+AllNodes(0)
|
+Filter(1)
|
+AllNodes(1)
+--------------+------+--------+-------------+-----------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+--------------+------+--------+-------------+-----------------------------------------+
| ColumnFilter | 1 | 0 | | keep columns p |
| ShortestPath | 1 | 0 | p | |
| Filter(0) | 1 | 6 | | Property(c,label(0)) == { AUTOSTRING1} |
| AllNodes(0) | 3 | 4 | c, c | |
| Filter(1) | 1 | 6 | | Property(a,label(0)) == { AUTOSTRING0} |
| AllNodes(1) | 3 | 4 | a, a | |
+--------------+------+--------+-------------+-----------------------------------------+
This other answer indicates that you're usually looking for lower DbHits values to be indicative of better performance, since those are expensive.
The WebAdmin tools (usually at http://localhost:7474/webadmin/ for a local neo4j installation), has Data browser and Console tabs that allow you to enter your query, see the results, and also see the actual time it took to perform the query.
Interestingly, from my limited testing of the Data browser and Console tabs, the latter seems to report faster query times for the same queries. So, the Console probably has less overhead, possibly making its timing results a bit more accurate.

Resources