Neo4J search greater than on Index - neo4j

I have a Neo4J database with a large number of datasets (~15M), where I want to perform a greater than search on one of its properties. I have the corresponding property indexed. The property is a float value.
When I do an exact match like MATCH (i:Label) WHERE i.property = $value RETURN count(i) I get the result within a very short time. But when I do the same search with greater than, i.e. MATCH (i:Label) WHERE i.property > $value RETURN count(i) it just takes forever. What is the correct way to do this in Cypher?
Edit: Execution plan:
+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
74 ms
Compiler CYPHER 2.2
Planner COST
EagerAggregation
|
+Filter
|
+NodeByLabelScan
+------------------+---------------+-------------+------------------------------+
| Operator | EstimatedRows | Identifiers | Other |
+------------------+---------------+-------------+------------------------------+
| EagerAggregation | 2064 | count(r) | |
| Filter | 4260557 | r | r.date > Subtract(Divide( |
| | | | TimestampFunction(),{ |
| | | | AUTOINT0}),Literal(86400)) |
| NodeByLabelScan | 14201858 | r | :Request |
+------------------+---------------+-------------+------------------------------+
Total database accesses: ?

Another approach is to create additional/aggregation nodes for that property and searching thru those nodes.
Example
Let say the property is a value from 0 - 100.
Create following nodes
* 0to30
* 31to60
* 61to100
Create relationship from you nodes to this 'aggregate' nodes.
Than searching thru those nodes
MATCH (l:Label)-[i:IN]->(a:0to30)
RETURN l

Unfortunately Neo4j doesn't use it's indexes for inequalities in 2.2.x. In the upcoming 2.3.x this should be supported.

Related

How can I calculate the path between two nodes with hops in range (1,5) (neo4j) ?

I'm trying to find all possible path between two nodes. I've used few cypher queries which does the required job but it take a lot of time if the hops increases. This is the query
match p = (n{name:"Node1"})-[:Route*1..5]-(b{name:"Node2"}) return p
Also if I use shortestpath it limits the result if a path with minimum hop is found. So I don't get the results with 2 or more than two hops if a direct connection (1 hop) is found between the nodes.
match p = shortestpath((n{name:"Node1"})-[:Route*1..5]-(b{name:"Node2"})) return p
and if I increase the hop to 2 or more it throws an exception.
shortestPath(...) does not support a minimal length different from 0 or 1
Is there any other alternative framework or algorithm to get all path with minimum time ?
P.S. I'm looking for something in the order of ms. Currently all queries with hops greater than 3 takes few seconds to complete.
I gather you are trying to speed up your original query involving variable-length paths. The shortestpath function is not appropriate for your query, as it literally tries to find a shortest path -- not all paths up to a certain length.
The execution plan for your original query (using sample data) looks like this:
+-----------------------+----------------+------+---------+-------------------+---------------------------------------------+
| Operator | Estimated Rows | Rows | DB Hits | Identifiers | Other |
+-----------------------+----------------+------+---------+-------------------+---------------------------------------------+
| +ProduceResults | 0 | 1 | 0 | p | p |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +Projection | 0 | 1 | 0 | anon[30], b, n, p | ProjectedPath(Set(anon[30], n),) |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +Filter | 0 | 1 | 2 | anon[30], b, n | n.name == { AUTOSTRING0} |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +VarLengthExpand(All) | 0 | 2 | 7 | anon[30], b, n | (b)<-[:Route*]-(n) |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +Filter | 0 | 1 | 3 | b | b.name == { AUTOSTRING1} |
| | +----------------+------+---------+-------------------+---------------------------------------------+
| +AllNodesScan | 3 | 3 | 4 | b | |
+-----------------------+----------------+------+---------+-------------------+---------------------------------------------+
So, your original query is scanning through every node to find the node(s) that match the b pattern. Then, it expands all variable-length paths starting at b. And then it filters those paths to find the one(s) that end with a node that matches the pattern for n.
Here are a few suggestions that should speed up your query, although you'll have to test it on your data to see by how much:
Give each node a label. For example, Foo.
Create an index that can speed up the search for your end nodes. For example:
CREATE INDEX ON :Foo(name);
Modify your query to force the use of the index on both end nodes. For example:
MATCH p =(n:Foo { name:"Node1" })-[:Route*1..5]-(b:Foo { name:"Node2" })
USING INDEX n:Foo(name)
USING INDEX b:Foo(name)
RETURN p;
After the above changes, the execution plan is:
+-----------------+------+---------+-----------------------------+-----------------------------+
| Operator | Rows | DB Hits | Identifiers | Other |
+-----------------+------+---------+-----------------------------+-----------------------------+
| +ColumnFilter | 1 | 0 | p | keep columns p |
| | +------+---------+-----------------------------+-----------------------------+
| +ExtractPath | 1 | 0 | anon[33], anon[34], b, n, p | |
| | +------+---------+-----------------------------+-----------------------------+
| +PatternMatcher | 1 | 3 | anon[33], anon[34], b, n | |
| | +------+---------+-----------------------------+-----------------------------+
| +SchemaIndex | 1 | 2 | b, n | { AUTOSTRING1}; :Foo(name) |
| | +------+---------+-----------------------------+-----------------------------+
| +SchemaIndex | 1 | 2 | n | { AUTOSTRING0}; :Foo(name) |
+-----------------+------+---------+-----------------------------+-----------------------------+
This query plan uses the index to directly get the b and n nodes -- without scanning. This, by itself, should provide a speed improvement. And then this plan uses the "PatternMatcher" to find the variable-length paths between those end nodes. You will have to try this query out to see how efficient the "PatternMatcher" is in doing that.
From your description I assume that you want to get a shortest path based on some weight like a duration property on the :Route relationships.
If that is true using shortestPath in cypher is not helpful since it just takes into account the number of hops. Weighted shortest paths are not yet available in Cypher in an efficient way.
The Java API has support for weighted shortest paths via dijekstra or astar via the GraphAlgoFactory class. For the simple case that your cost function is just the value of a relationship property (as mentioned above) you can also use an existing REST endpoint.

Profiling neo4j query: filter to db hits

I'm curious how filters work in neo4j queries. They result in db hits (according to PROFILE), and it seems that they shouldn't.
An example query:
PROFILE MATCH (a:act)<-[r:relationship]-(n)
WHERE a.chapter='13' and a.year='2009'
RETURN r, n
NodeIndexSeek: (I created the an index on the label act for chapter property) returns 6 rows.
Filter: a.year == {AUTOSTRING1} which results in 12 db hits.
Why does it need to do any db hits if it's already fetched the 6 matching instances of a in earlier db reads, shouldn't it just filter them down without going back to do more db reads?
I realise I'm equating 'db hits' with 'db reads' here, which may not be accurate. If not, what exactly are 'db hits'?
Lastly, the number of db hits incurred by a filter appear to approximately match:
<number of filtering elements> * 2 * <number of already queried nodes to filter on>
where 'number of filtering elements' is the number of filters provided, i.e.
WHERE a.year='2009' and a.property_x='thing'
is two elements.
Thanks for any help.
EDIT:
Here are the results of PROFILE and EXPLAIN on the query.
This is just an example query. I've found the behaviour of
filter db hits = <number of filtering elements> * 2 * <number of already queried nodes to filter on>
to be generally true in queries I've run.
PROFILE MATCH (a:act)<-[r:CHILD_OF]-(n)
WHERE a.chapter='13' AND a.year='2009'
RETURN r, n
8 rows
55 ms
Compiler CYPHER 2.2
Planner COST
Projection
|
+Expand(All)
|
+Filter
|
+NodeIndexSeek
+---------------+---------------+------+--------+-------------+---------------------------+
| Operator | EstimatedRows | Rows | DbHits | Identifiers | Other |
+---------------+---------------+------+--------+-------------+---------------------------+
| Projection | 1 | 8 | 0 | a, n, r | r; n |
| Expand(All) | 1 | 8 | 9 | a, n, r | (a)<-[r:CHILD_OF]-(n) |
| Filter | 0 | 1 | 12 | a | a.year == { AUTOSTRING1} |
| NodeIndexSeek | 1 | 6 | 7 | a | :act(chapter) |
+---------------+---------------+------+--------+-------------+---------------------------+
Total database accesses: 28
EXPLAIN MATCH (a:act)<-[r:CHILD_OF]-(n)
WHERE a.chapter='13' AND a.year='2009'
RETURN r, n
4 ms
Compiler CYPHER 2.2
Planner COST
Projection
|
+Expand(All)
|
+Filter
|
+NodeIndexSeek
+---------------+---------------+-------------+---------------------------+
| Operator | EstimatedRows | Identifiers | Other |
+---------------+---------------+-------------+---------------------------+
| Projection | 1 | a, n, r | r; n |
| Expand(All) | 1 | a, n, r | (a)<-[r:CHILD_OF]-(n) |
| Filter | 0 | a | a.year == { AUTOSTRING1} |
| NodeIndexSeek | 1 | a | :act(chapter) |
+---------------+---------------+-------------+---------------------------+
Total database accesses: ?
Because reading a node (record) and reading property (records) is not the same db-operation.
You are right that the filter hit's should be at most 6 though.
Usually Neo4j pulls filters and predicates to the earliest possible moment, so it should filter directly after the index lookup.
In some situations though (due to the predicate) it can only filter after finding the paths, then the number of db-hits might equal the number of checked paths.
Which Neo4j version are you using? Can you share your full query plan?

Cypher / Should I use the WITH clause to pass values to next MATCH?

Using Neo4j 2.1.X, let's suppose this query, returning the user 123's friends that bought a Car:
MATCH (u1:User("123"))-[:KNOWS]-(friend)
MATCH (friend)-[:BUYS]->(c:Car)
RETURN friend
In this article, it is written regarding the WITH clause:
So, how does it work? Well, with is basically just a stream, as lazy
as it can be (as lazy as return can be), passing results on to the
next query.
So it seems I should transform the query like this:
MATCH (u1:User("123"))-[:KNOWS]-(friend)
WITH friend
MATCH (friend)-[:BUYS]->(c:Car)
RETURN friend
Should I? Or does the current version of Cypher already handle MATCH chaining while passing values through them?
The more accurate starting point you give in the upfront of your query, the more efficient it will be.
Your first match is not so accurate, indeed it will use the traversal matcher to match all possible relationships.
Taken the following neo4j console example : http://console.neo4j.org/r/jsx71g
And your first query who will look like this in the example :
MATCH (n:User { login: 'nash99' })-[:KNOWS]->(friend)
RETURN count(*)
You can see the amount of dbhits in the upfront :
Execution Plan
ColumnFilter
|
+EagerAggregation
|
+Filter
|
+TraversalMatcher
+------------------+------+--------+-------------+-----------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+------------------+------+--------+-------------+-----------------------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(*) |
| EagerAggregation | 1 | 0 | | |
| Filter | 8 | 320 | | Property(n,login(2)) == { AUTOSTRING0} |
| TraversalMatcher | 160 | 201 | | friend, UNNAMED32, friend |
+------------------+------+--------+-------------+-----------------------------------------+
Total database accesses: 521
If you use a more accurate starting point, you're the king of the road when you start from this point, look at this example query and see the difference in db hits :
Execution Plan
ColumnFilter
|
+EagerAggregation
|
+SimplePatternMatcher
|
+Filter
|
+NodeByLabel
+----------------------+------+--------+------------------------+-----------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+------+--------+------------------------+-----------------------------------------+
| ColumnFilter | 1 | 0 | | keep columns count(*) |
| EagerAggregation | 1 | 0 | | |
| SimplePatternMatcher | 8 | 0 | n, friend, UNNAMED51 | |
| Filter | 1 | 40 | | Property(n,login(2)) == { AUTOSTRING0} |
| NodeByLabel | 20 | 21 | n, n | :User |
+----------------------+------+--------+------------------------+-----------------------------------------+
Total database accesses: 61
So to terminate your query, I will do something like this :
MATCH (n:User { login: 'nash99' })
WITH n
MATCH (n)-[:KNOWS]->(friend)-[:BUYS]->(c:Car)
RETURN friend
You can also specify that the friends can not be the same as the user :
MATCH (n:User { login: 'nash99' })
WITH n
MATCH (n)-[:KNOWS]->(friend)-[:BUYS]->(c:Car)
WHERE NOT friend.id = n.id
RETURN friend
Note that there is no difference between the above query and the following in matter of db hits :
MATCH (n:User { login: 'nash99' })
WITH n
MATCH (n)-[:KNOWS]->(friend)
WITH friend
MATCH (friend)-[:BUYS)->(c:Car)
RETURN (friend)
I recommend that you use the neo4j console to look at the result details showing you the above informations.
If you need to quickly protoype a graph for test, you can use Graphgen, export the graph in cypher statements and load these statements in the neo4j console.
Here is the link to the graphgen generation I used for the console http://graphgen.neoxygen.io/?graph=29l9XJ0HxJ2pyQ
Chris

neo4j benchmark, multiple queries, measure time

Is there any way that I can perform my benchmarks for multiple queries in neo4j?
Assuming that I have loaded my graph, I want to initiate 10000 distinct shortest path queries in the database, without loading the data to a client. Is there a way that I can do this in a batch and get the execution times?
Try using the profile keyword inside of the neo4j-shell. This will give you some basic facts about how quickly, and how a query executes.
Here's a simple example:
neo4j-sh (?)$ CREATE (a {label:"foo"})-[:bar]->(b {label: "bar"})-[:bar]->(c {label: "baz"});
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 3
Relationships created: 2
Properties set: 3
1180 ms
neo4j-sh (?)$ profile match (a {label: "foo"}), (c {label: "baz"}), p=shortestPath(a-[*]-c) return p;
+--------------------------------------------------------------------------------------+
| p |
+--------------------------------------------------------------------------------------+
| [Node[0]{label:"foo"},:bar[0]{},Node[1]{label:"bar"},:bar[1]{},Node[2]{label:"baz"}] |
+--------------------------------------------------------------------------------------+
1 row
ColumnFilter
|
+ShortestPath
|
+Filter(0)
|
+AllNodes(0)
|
+Filter(1)
|
+AllNodes(1)
+--------------+------+--------+-------------+-----------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+--------------+------+--------+-------------+-----------------------------------------+
| ColumnFilter | 1 | 0 | | keep columns p |
| ShortestPath | 1 | 0 | p | |
| Filter(0) | 1 | 6 | | Property(c,label(0)) == { AUTOSTRING1} |
| AllNodes(0) | 3 | 4 | c, c | |
| Filter(1) | 1 | 6 | | Property(a,label(0)) == { AUTOSTRING0} |
| AllNodes(1) | 3 | 4 | a, a | |
+--------------+------+--------+-------------+-----------------------------------------+
This other answer indicates that you're usually looking for lower DbHits values to be indicative of better performance, since those are expensive.
The WebAdmin tools (usually at http://localhost:7474/webadmin/ for a local neo4j installation), has Data browser and Console tabs that allow you to enter your query, see the results, and also see the actual time it took to perform the query.
Interestingly, from my limited testing of the Data browser and Console tabs, the latter seems to report faster query times for the same queries. So, the Console probably has less overhead, possibly making its timing results a bit more accurate.

Neo4j cypher query to return relationship property and sum of all matching relationship properties

I'm trying to return a relationship property (called proportion) plus the sum of that property for all relationship matched by a Cypher query in Neo4j. I've gotten this far:
START alice=node(3)
MATCH p=(alice)<-[r:SUPPORTED_BY]-(n)
RETURN reduce(total=0, rel in relationships(p): total + rel.proportion), sum(r.proportion) AS total;
This returns:
+-----------------+
| reduced | total |
+-----------------+
| 2 | 2 |
| 1 | 1 |
+-----------------+
where I was expecting:
+-----------------+
| reduced | total |
+-----------------+
| 2 | 3 |
| 1 | 3 |
+-----------------+
As a beginner user of Cypher, I'm not really sure how to approach this query; I'm clearly not using reduce correctly. Any advice would be appreciated.
You need to use WITH to split up the query into two parts:
find the sum of all proportions, and pass that as bound name to the next part
find the individual proportions
.
START alice=node(3)
MATCH alice<-[r:SUPPORTED_BY]-()
WITH sum(r.proportion) AS total
MATCH alice<-[r:SUPPORTED_BY]-(other)
RETURN other.name, r.proportion, total

Resources