Neo4j Cypher execution plan when query has WHERE and WITH clause - neo4j

I have a Neo4j graph database that stores the Staffing Relations and Nodes. I have to write a cypher that will find the home and office address
of a resource (or employee) along with their empId and name.
This is needed so that Staffing Solution can staff resources according to their home location as well as near to their office.
MATCH (employee:Employee) <-[:ADDRESS_TO_EMPLOYEE]- (homeAddress:HomeAddress)
WHERE employee.id = '70'
WITH employee, homeAddress
MATCH (employee)-[:EMPLOYEE_TO_OFFICEADDRESS]->(officeAddress:OfficeAddress)
RETURN employee.empId, employee.name,
homeAddress.street, homeAddress.area, homeAddress.city,
officeAddress.street, officeAddress.area, officeAddress.city
This cypher returns the desired results.
However, if I move the WHERE condition in the last, just before the RETURN clause.
MATCH (employee:Employee) <-[:ADDRESS_TO_EMPLOYEE]- (homeAddress:HomeAddress)
WITH employee, homeAddress
MATCH (employee)-[:EMPLOYEE_TO_OFFICEADDRESS]->(officeAddress:OfficeAddress)
WHERE employee.id = '70'
RETURN employee.empId, employee.name,
homeAddress.street, homeAddress.area, homeAddress.city,
officeAddress.street, officeAddress.area, officeAddress.city
It again gives me the same result.
So which one is more optimized as the query execution plan is same in both the cases?. I mean same number of DB hits and returned Records.
Now, if I remove the WITH clause,
MATCH (employee:Employee) <-[:ADDRESS_TO_EMPLOYEE]-
(homeAddress:HomeAddress),
MATCH (employee)-[:EMPLOYEE_TO_OFFICEADDRESS]->(officeAddress:OfficeAddress)
WHERE employee.id = '70'
RETURN employee.empId, employee.name,
homeAddress.street, homeAddress.area, homeAddress.city,
officeAddress.street, officeAddress.area, officeAddress.city
Then again the results is same, execution plan is also same.
Do I really need WITH in this case?
Any help would be greatly appreciated.

First, you can use Profile and Explain to get the performance of your query. Though, as long as you get the results you want in the time you want, the cypher doesn't matter too much, as the behavior will change depending on the Cypher Planner (version) running in the db. So as long as the cypher passes unit and load tests, the rest doesn't matter (assuming reasonably accurate tests).
Second, In general, less is more. Imagine you had to read your own cypher, and look up the info yourself on paper printouts. Isn't MATCH (officeAddress:OfficeAddress)<-[:EMPLOYEE_TO_OFFICEADDRESS]-(employee:Employee {id:'70'})<-[:ADDRESS_TO_EMPLOYEE]-(homeAddress:HomeAddress) so much easier to tell what exactly you are looking for? The easier it is for the Cypher planner to read what you want, the more likely the Cypher planner will plan the most efficient lookup strategy. Also, keeping your WHERE clause close to the relevant match also helps the planner. So try to keep your cyphers as simple as possible, while still being accurate for what you want.
In your Cypher, the only part that really matters is the WITH. WITH creates a logical break in the cypher, and a scope change for variables, As you aren't doing anything with the with, it's better to drop it. The only side effect it can produce in this case, is tricking the Cypher to do more work than necessary for the first match, to filter it down later. If an Employee is expected to have more than 1 home address, than WITH employee, COLLECT(homeAddress) as homeAdress will reduce that match to 1 row per employee, making the next match cheaper, but since I'm sure both sides of the match should only yield 1 result, it doesn't matter what the planner does first. (In general, you use with to aggregate results down to less rows, to make the rest of the cypher cheaper. Which shouldn't apply in this context)

You should always put a WHERE clause as early as possible in a query. That will filter out data that the rest of the query will not have to deal with, avoiding possible unneeded work.
You should avoid writing a WITH clause that is just passing forward all the defined variables (and is not required syntactically), since it is essentially a no-op. It wastes (a little bit of) time for the planner to process, and makes the Cypher code a bit harder to understand.
This simpler version of your query should produce the same query plan:
MATCH (officeAddress:OfficeAddress)<-[:EMPLOYEE_TO_OFFICEADDRESS]-(employee:Employee)<-[:ADDRESS_TO_EMPLOYEE]-(homeAddress:HomeAddress)
WHERE employee.id = '70'
RETURN
employee.empId, employee.name,
homeAddress.street, homeAddress.area, homeAddress.city,
officeAddress.street, officeAddress.area, officeAddress.city
And the following version (using the map projection syntax) is even simpler (with a similar query plan).
MATCH (officeAddress:OfficeAddress)<-[:EMPLOYEE_TO_OFFICEADDRESS]-(employee:Employee)<-[:ADDRESS_TO_EMPLOYEE]-(homeAddress:HomeAddress)
WHERE employee.id = '70'
RETURN
employee{.empId, .name},
homeAddress{.street, .area, .city},
officeAddress{.street, .area, .city}
The results of the above query have a different structure, though:
╒═══════════════════════════╤══════════════════════════════════════╤══════════════════════════════════════╕
│"employee" │"homeAddress" │"officeAddress" │
╞═══════════════════════════╪══════════════════════════════════════╪══════════════════════════════════════╡
│{"name":"sam","empId":"70"}│{"area":1,"city":"foo","street":"123"}│{"area":2,"city":"bar","street":"345"}│
└───────────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┘

Related

Adding a property filter to cypher query explodes memory, why?

I'm trying to write a query that explores a DAG-type graph (a bill of materials) for all construction paths leading down to a specific part number (second MATCH), among all the parts associated with a given product (first MATCH). There is a strange behavior I don't understand:
This query runs in a reasonable time using Neo4j community edition (~2 s):
WITH '12345' as snid, 'ABCDE' as pid
MATCH (m:Product {full_sn:snid})-[:uses]->(p:Part)
WITH snid, pid, collect(p) AS mparts
MATCH path=(anc:Part)-[:has*]->(child:Part)
WHERE ALL(node IN nodes(path) WHERE node IN mparts)
WITH snid, path, relationships(path)[-1] AS rel,
nodes(path)[-2] AS parent, nodes(path)[-1] AS child
RETURN stuff I want
However, to get the query I want, I must add a filter on the child using the part number pid in the second MATCH statement:
MATCH path=(anc:Part)-[:has*]->(child:Part {pn:pid})
And when I try to run the new query, neo4j browser compains that there is not enough memory. (Neo.TransientError.General.OutOfMemoryError). When I run it with EXPLAIN, the db hits are exploding into the 10s of billions, as if I'm asking it for a massive cartestian product: but all I have done is added a restriction on the child, so this should be reducing the search space, shouldn't it?
I also tried adding an index on :Part(pn). Now the profile shown by EXPLAIN looks very efficient, but I still have the same memory error.
If anyone can help me understand why this change between the two queries is causing problems, I'd greatly appreciate it!
Best wishes,
Ben
MATCH path=(anc:Part)-[:has*]->(child:Part)
The * is exploding to every downstream child node.
That's appropriate if that is what's desired. If you make this an optional match and limit to the collect items, this should restrict the return results.
OPTIONAL MATCH path=(anc:Part)-[:has*]->(child:Part)
This is conceptionally (& crudely) similar to an inner join in SQL.

Complexity of a neo4j query

I need to measure the performance of any query.
for example :
MATCH (n:StateNode)-[r:has_city]->(n1:CityNode)
WHERE n.shortName IN {0} and n1.name IN {1}
WITH n1
Match (aa:ActiveStatusNode{isActive:toBoolean('true')})--(n2:PannaResume)-[r1:has_location]->(n1)
WHERE (n2.firstName="master") OR (n2.lastName="grew" )
WITH n2
MATCH (o:PannaResumeOrganizationNode)<-[h:has_organization]-(n2)-[r2:has_skill]->(n3:Skill)
WHERE (0={3} OR o.organizationId={3}) AND (0={4} OR n3.name IN {2} OR n3.name IN {5})
WITH size(collect(n3)) as count, n2
MATCH (n2) where (0={4} OR count={4})
RETURN DISTINCT n2
I have tried profile & explain clauses but they only return number of db hits. Is it possible to get big notations for a neo4j query ie cn we measure performance in terms of big O notations ? Are there any other ways to check query performance apart from using profile & explain ?
No, you cannot convert a Cypher to Big O notation.
Cypher does not describe how to fetch information, only what kind of information you want to return. It is up to the Cypher planner in the Neo4j database to convert a Cypher into an executable query (using heuristics about what info it has to find, what indexes are available to it, and internal statistics about the dataset being queried. So simply changing the state of the database can change the complexity of a Cypher.)
A very simple example of this is the Cypher Cypher 3.1 MATCH (a{id:1})-[*0..25]->(b) RETURN DISTINCT b. Using a fairly average connected graph with cycles, running against Neo4j 3.1.1 will time out for being too complex (Because the planner tries to find all paths, even though it doesn't need that redundant information), while Neo4j 3.2.3 will return very quickly (Because the Planner recognizes it only needs to do a graph scan like depth first search to find all connected nodes).
Side note, you can argue for BIG O notation on the return results. For example MATCH (a), (b) must have a minimum complexity of n^2 because the result is a Cartesian product, and execution can't be less complex then the answer. This understanding of how complexity affects row counts can help you write Cyphers that reduce the amount of work the Planner ends up planning.
For example, using WITH COLLECT(n) as data MATCH (c:M) to reduce the number of rows the Planner ends up doing work against before the next part of a Cypher from nm (first match count times second match count) to m (1 times second match count).
However, since Cypher makes no promises about how data is found, there is no way to guarantee the complexity of the execution. We can only try to write Cyphers that are more likely to get an optimal execution plan, and use EXPLAIN/PROFILE to evaluate if the planner is able to find a relatively optimal solution.
The PROFILE results show you how the neo4j server actually plans to process your Cypher query. You need to analyze the execution plan revealed by the PROFILE results to get the big O complexity. There are no tools to do that that I am aware of (although it would be a great idea for someone to create one).
You should also be aware that the execution plan for a query can change over time as the characteristics of the DB change, and also when changing to a different version of neo4j.
Nothing of this is sort is readily available. But it can be derived/approximated with some additional effort.
On profiling a query, we get a list of functions that neo4j will run to achieve the desired result.
Each of this function will be associated with the worst to best case complexities in theory. And some of them will run in parallel too. This will impact runtimes, depending on the cores that your server has.
For example match (a:A) match (a:B) results in Cartesian product. And this will be of O(count(a)*count(b))
Similarly each function of in your query-plan does have such time complexities.
So aggregations of this individual time complexities of these functions will give you an overall approximation of time-complexity of the query.
But this will change from time to time with each version of neo4j since they community can always change the implantation of a query or to achieve better runtimes / structural changes / parallelization/ less usage of ram.
If what you are looking for is an indication of the optimization of neo4j query db-hits is a good indicator.

Neo4j- incorrect count in multiple match query

When I am trying to execute this query
match(u:User)-[ro:OWNS]->(p:PushDevice) where p.type='gcm'
match(com:Comment)
return count(com) as total_comments,count(ro) as device
this is returning the same number in both total_comments and device which is the number of total comment.
I feel like your query should work, though I'm more confident that this will work:
MATCH (u:User)-[ro:OWNS]->(p:PushDevice) WHERE p.type='gcm'
WITH count(ro) AS device
MATCH (com:Comment)
RETURN count(com) as total_comments, device
Your query is generating a row for every combination of your MATCH results. If you just returned the ro and com values, this would be more clear. See this console for an example. That console has 2 comments and a single OWNS relationship, but the result shows 2 rows (both rows have the same OWNS relationship). So, your query is essentially counting the number of rows -- not what you expected.
Here is an example of a query that would work as you you expected:
MATCH (u:User)-[ro:OWNS]->(p:PushDevice {type:'gcm'})
WITH COUNT(ro) AS device
MATCH (com:Comment)
RETURN count(com) AS total_comments, device;
[EDITED]
This would also work logically, but is less performant (as it takes a cartesian product and then filters out duplicates):
MATCH (u:User)-[ro:OWNS]->(p:PushDevice { type: 'gcm' })
MATCH (com:Comment)
RETURN COUNT(DISTINCT com), COUNT(DISTINCT ro);
Observation
The power of neo4j comes from its efficient handling of relationships. So, the most efficient queries tend to be for connected subgraphs (where all nodes are connected by relationships).
Since your query is not for a single connected subgraph, getting the answer you want is naturally going to be a bit more convoluted and can be inefficient.
If you determine that the suggested queries are too slow, you can try making 2 separate queries instead. That may also make make your code easier to understand.

Is it the optimal way of expressing "go through all nodes" queries in Cypher?

I have a quite large social graph in which I execute global queries like this one:
match (n:User)-[r:LIKES]->(k:User)
where not (k:User)-[]->(n:User)
return count(r);
They take a lot of time and memory, so I am curious if they are expressed in optimal way. I have felling that when I execute such query Cypher is firstly matching everything that fits the expression (and that takes a lot of memory) and then starts to count things. I would rather like to go through every node, check the pattern and update the counter if necessary. This way such queries would not require a lot of memory. So how in fact such query is executed? If it is not optimal, is there a way to make it better (in Cypher)?
If you used the query just as you wrote it, you may not be getting what you think you are. Putting labels on node "variables" can cause them to be treated as fresh (partial) patterns instead of bound nodes. Is your query any faster if you use
MATCH (n:User)-[r:LIKES]->(k:User)
WHERE NOT (n)<--(k)
RETURN count(r)
Here's how this works (not considering internal optimizations, which I don't begin to understand).
For each User node, every outgoing LIKES relationship is followed. If the other end of the LIKES relationship is a User node, the two nodes and the relationship are bound to the names n, k, and r and passed to the WHERE clause. Every outgoing relationship on the bound k node is then tested to see if it connects to the bound n node. If no such relationship is found, the match is considered successful. The count() function in the RETURN clause counts the resulting collection of relationships that were passed from the match.
If you have a densely connected graph, and particularly if there are many other relationships between nodes other than LIKES relationship, this can be quite an extensive search.
As a further experiment, you might try changing the WHERE clause to read
WHERE NOT (k)-->(n)
and see if it makes any difference. I don't think it will, but I could be wrong.

efficiency of where clause in cypher vs match

I'm trying to find 10 posts that were not LIKED by user "mike" using cypher. Will putting a where clause with a NOT relationship be efficient than matching with an optional relationship then checking if that relationship is null in the where clause? Specifically I want to make sure it won't do the equivalent of a full table scan and make sure that this is a scalable query.
Here's what I'm using
START user=node:node_auto_index(uname:"mike"),
posts=node:node_auto_index("postId:*")
WHERE not (user-[:LIKES]->posts)
RETURN posts SKIP 20 LIMIT 10;
Or can I do something where I filter on a MATCH optional relationship
START user=node:node_auto_index(uname="mike"),
posts=node:node_auto_index("postId:*")
MATCH user-[r?:LIKES]->posts
WHERE r IS NULL
RETURN posts SKIP 100 LIMIT 10;
Some quick tests on the console seem to show faster performance in the 2nd approach. Am I right to assume the 2nd query is faster? And, if so why?
i think in the first query the engine runs through all postID nodes and manually checks the condition of not (user-[:LIKES]->posts) for each post ID
whereas in the second example (assuming you use at least v1.9.02) the engine picks up only the post nodes, which actually aren't connected to the user. this is just optimalization where the engine does not go through all postIDs nodes.
if possible, always use the MATCH clause in your queries instead of WHERE, and try to omit the asterix in the declaration START n=node:index('name:*')

Resources