Is there a way to show cypher execution plan? - neo4j

I've seen a topic (Understanding Neo4j Cypher Profile keyword and execution plan) where profile keyword is mentioned.
I couldn't use it in Neo4j 2.0.0RC1 Community.
Peter wrote it's not fully implemented.
Will it ever be supported?
I mean, it could be interesting to watch the plan changes as we tune the query...

You can still find the neo4j shell, where you can run the profile command.
Either by connecting to the running server by starting bin/neo4j-shell
Or by switching to the old web-ui in the "(i)" Info-menu on the left side and selecting the bottommost link "webadmin" -> http://localhost:7474/webadmin
Profiling information will be added to browser later on, when it is easier to read and understand.

As of Neo4j 2.2 there are additional profiling facilities available. Some features that were only available via neo4j-shell or the REST endpoints are now available also in the Neo4j-browser, and some features are new altogether.
You can now use the PROFILE command with your cypher query directly in the Neo4j-browser repl to execute the query and see a visualization of the execution plan.
PROFILE
MATCH (n:Peter {foo: 'Paul'})
RETURN n.bar, ID(n)
-------------
n.bar ID(n)
Mary 951
Additionally, you can now inspect a query plan without having to actually execute it, for instance to verify a query that would alter the database. Do so with the EXPLAIN command prepended to the query. See 15.2 How do I profile a query? from the documentation.
EXPLAIN
MATCH (n:Peter {foo: 'Paul'})
SET n.foo = 'Mary', n.bar = 'Paul'
RETURN n.foo, ID(n)
------------------------------------------
// Nothing returned, query is not executed
A related new feature is also the new 'cost based' query planner, as well as the ability to force the use of either the 'cost based' or the 'rule based' query planner for all queries or for any particular query. The documentation notes that not all queries may be solvable by the 'cost based' query planner, in which case the setting will be ignored and the 'rule based' planner used. See 15.1 How are queries executed?
To force the use of either query planner for all queries, set the query.planner.version configuration setting in conf/neo4j.properties (Neo4j server) or by calling the .setConfig() method on your GraphDatabaseService object (Neo4j embedded). Set it to COST or RULE, and to give the decision on which query planner to use back to Neo4j, set it to default (or remove the setting altogether). See 24.5 Configuration Settings, Starting an embedded database with configuration settings.
To force the use of either query planner for a particular query, prepend your query with CYPHER planner=cost or CYPHER planner=rule. See 15.1 How are queries executed?
CYPHER planner=cost
MATCH (n:Peter {foo: 'Paul'})
RETURN n.bar, ID(n)
You can PROFILE or EXPLAIN queries with either of the query planners and see any differences in how they implement your queries.
For help interpreting the execution plan, see the relevant chapter of the documentation, 16. Execution Plans.

Related

Neo4j Cypher query optimization for better performance

I'm unable to test this on the big data set right now, but just wondering what Cypher query in Neo4j will work faster in order to properly design the system:
Approach #1:
WHERE apoc.coll.containsAllSorted($profileDetailedCriterionIds, childD.mandatoryCriterionIds)
Approach #2:
(p:Profile {id: $profileId})-[:HAS_VOTE_ON]-(c:Criterion)<-[:HAS_VOTE_ON]-(childD)
WHERE c.id IN childD.mandatoryCriterionIds
WITH childD, COLLECT(c.id) as cIds,
WHERE size(cIds) >= size(childD.mandatoryCriterionIds)
where $profileDetailedCriterionIds is a Set of ids provided via query parameter
What approach should I chose for better performance?
Run both queries in a Neo4j browser but put a keyword PROFILE at the start of the query. When both queries are done, it will display a PROFILING or explanation on how the query was executed. Then go to last tab on the left and look that part where you use the APOC function and compare the db hits and page caching without the APOC function.

Neo4j Cypher execution plan when query has WHERE and WITH clause

I have a Neo4j graph database that stores the Staffing Relations and Nodes. I have to write a cypher that will find the home and office address
of a resource (or employee) along with their empId and name.
This is needed so that Staffing Solution can staff resources according to their home location as well as near to their office.
MATCH (employee:Employee) <-[:ADDRESS_TO_EMPLOYEE]- (homeAddress:HomeAddress)
WHERE employee.id = '70'
WITH employee, homeAddress
MATCH (employee)-[:EMPLOYEE_TO_OFFICEADDRESS]->(officeAddress:OfficeAddress)
RETURN employee.empId, employee.name,
homeAddress.street, homeAddress.area, homeAddress.city,
officeAddress.street, officeAddress.area, officeAddress.city
This cypher returns the desired results.
However, if I move the WHERE condition in the last, just before the RETURN clause.
MATCH (employee:Employee) <-[:ADDRESS_TO_EMPLOYEE]- (homeAddress:HomeAddress)
WITH employee, homeAddress
MATCH (employee)-[:EMPLOYEE_TO_OFFICEADDRESS]->(officeAddress:OfficeAddress)
WHERE employee.id = '70'
RETURN employee.empId, employee.name,
homeAddress.street, homeAddress.area, homeAddress.city,
officeAddress.street, officeAddress.area, officeAddress.city
It again gives me the same result.
So which one is more optimized as the query execution plan is same in both the cases?. I mean same number of DB hits and returned Records.
Now, if I remove the WITH clause,
MATCH (employee:Employee) <-[:ADDRESS_TO_EMPLOYEE]-
(homeAddress:HomeAddress),
MATCH (employee)-[:EMPLOYEE_TO_OFFICEADDRESS]->(officeAddress:OfficeAddress)
WHERE employee.id = '70'
RETURN employee.empId, employee.name,
homeAddress.street, homeAddress.area, homeAddress.city,
officeAddress.street, officeAddress.area, officeAddress.city
Then again the results is same, execution plan is also same.
Do I really need WITH in this case?
Any help would be greatly appreciated.
First, you can use Profile and Explain to get the performance of your query. Though, as long as you get the results you want in the time you want, the cypher doesn't matter too much, as the behavior will change depending on the Cypher Planner (version) running in the db. So as long as the cypher passes unit and load tests, the rest doesn't matter (assuming reasonably accurate tests).
Second, In general, less is more. Imagine you had to read your own cypher, and look up the info yourself on paper printouts. Isn't MATCH (officeAddress:OfficeAddress)<-[:EMPLOYEE_TO_OFFICEADDRESS]-(employee:Employee {id:'70'})<-[:ADDRESS_TO_EMPLOYEE]-(homeAddress:HomeAddress) so much easier to tell what exactly you are looking for? The easier it is for the Cypher planner to read what you want, the more likely the Cypher planner will plan the most efficient lookup strategy. Also, keeping your WHERE clause close to the relevant match also helps the planner. So try to keep your cyphers as simple as possible, while still being accurate for what you want.
In your Cypher, the only part that really matters is the WITH. WITH creates a logical break in the cypher, and a scope change for variables, As you aren't doing anything with the with, it's better to drop it. The only side effect it can produce in this case, is tricking the Cypher to do more work than necessary for the first match, to filter it down later. If an Employee is expected to have more than 1 home address, than WITH employee, COLLECT(homeAddress) as homeAdress will reduce that match to 1 row per employee, making the next match cheaper, but since I'm sure both sides of the match should only yield 1 result, it doesn't matter what the planner does first. (In general, you use with to aggregate results down to less rows, to make the rest of the cypher cheaper. Which shouldn't apply in this context)
You should always put a WHERE clause as early as possible in a query. That will filter out data that the rest of the query will not have to deal with, avoiding possible unneeded work.
You should avoid writing a WITH clause that is just passing forward all the defined variables (and is not required syntactically), since it is essentially a no-op. It wastes (a little bit of) time for the planner to process, and makes the Cypher code a bit harder to understand.
This simpler version of your query should produce the same query plan:
MATCH (officeAddress:OfficeAddress)<-[:EMPLOYEE_TO_OFFICEADDRESS]-(employee:Employee)<-[:ADDRESS_TO_EMPLOYEE]-(homeAddress:HomeAddress)
WHERE employee.id = '70'
RETURN
employee.empId, employee.name,
homeAddress.street, homeAddress.area, homeAddress.city,
officeAddress.street, officeAddress.area, officeAddress.city
And the following version (using the map projection syntax) is even simpler (with a similar query plan).
MATCH (officeAddress:OfficeAddress)<-[:EMPLOYEE_TO_OFFICEADDRESS]-(employee:Employee)<-[:ADDRESS_TO_EMPLOYEE]-(homeAddress:HomeAddress)
WHERE employee.id = '70'
RETURN
employee{.empId, .name},
homeAddress{.street, .area, .city},
officeAddress{.street, .area, .city}
The results of the above query have a different structure, though:
╒═══════════════════════════╤══════════════════════════════════════╤══════════════════════════════════════╕
│"employee" │"homeAddress" │"officeAddress" │
╞═══════════════════════════╪══════════════════════════════════════╪══════════════════════════════════════╡
│{"name":"sam","empId":"70"}│{"area":1,"city":"foo","street":"123"}│{"area":2,"city":"bar","street":"345"}│
└───────────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┘

Does neo4j MATCH then filter or filter then MATCH

When I run the query
MATCH paths=(l:Left)-[:CONNECTED_TO*..5]->(r:Right)
WHERE (l.id IN $left_ids) AND (r.id IN $right_ids)
RETURN paths
ie, give me all paths with Left connected to Right so long as left is in left_ids and right is in right_ids.
Should I expect neo4j to
perform the cartesian product of Left and Right, and then filter the results - or
does it only search for paths once it has worked out which nodes are allowed?
Also - is there any obvious way for me to search this out for myself - ie is there a query planner, or some good meaty documentation that I've missed?
Take a look in the Profiling chapter of Neo4j docs:
EXPLAIN
If you want to see the execution plan but not run the statement,
prepend your Cypher statement with EXPLAIN. The statement will always
return an empty result and make no changes to the database.
PROFILE
If you want to run the statement and see which operators are doing
most of the work, use PROFILE. This will run your statement and keep
track of how many rows pass through each operator, and how much each
operator needs to interact with the storage layer to retrieve the
necessary data. Please note that profiling your query uses more
resources, so you should not profile unless you are actively working
on a query.
So you can prepend your queries with PROFILE or EXPLAIN and see the execution plan generated by Neo4j. This way:
PROFILE MATCH paths=(l:Left)-[:CONNECTED_TO*..5]->(r:Right)
WHERE (l.id IN $left_ids) AND (r.id IN $right_ids)
RETURN paths

Can "DISTINCT" in a CYPHER query be responsible of a memory error when the query returns no result?

working on a pretty small graph of 5000 nodes with low density (mean connectivity < 5), I get the following error which I never got before upgrading to neo4j 3.3.0. The graph contains 900 molecules and their scaffold hierarchy, down to 5 levels.
(:Molecule)<-[:substructureOf*1..5]-(:Scaffold)
Neo.TransientError.General.StackOverFlowError
There is not enough stack size to perform the current task. This is generally considered to be a database error, so please contact Neo4j support. You could try increasing the stack size: for example to set the stack size to 2M, add `dbms.jvm.additional=-Xss2M' to in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation just add -Xss2M as command line flag.
The query is actually very simple, I use distinct because several path may lead to a single scaffold.
match (m:Molecule) <-[:substructureOf*3]- (s:Scaffold) return distinct s limit 20
This query returns the above error message whereas the next query does work.
match (m:Molecule) <-[:substructureOf*3]- (s:Scaffold) return s limit 20
Interestingly, the query works on a much larger database, in this small one the deepest hierarchy happened to be 2. Therefore the result of the last query is "No changes, no records)".
How comes that adding DISTINCT to the query fails with that memory error? Is there a way to avoid it, because I cannot guess the depth of the hierarchy which can be different for each molecules.
I tried the following values for as suggested in other posts.
#dbms.memory.heap.initial_size=512m
#dbms.memory.heap.max_size=512m
dbms.memory.heap.initial_size=512m
dbms.memory.heap.max_size=4096m
dbms.memory.heap.initial_size=4096m
dbms.memory.heap.max_size=4096m
None of these addressed the issue.
Thanks in advance for any help or clues.
Thanks for the additional info, I was able to replicate this on Neo4j 3.3.0 and 3.3.1, and this likely has to do with the behavior of the pruning-var-expand operation (that is meant to help when using variable-length expansions and distinct results) that was introduced in 3.2.x, and only when using an exact number of expansions (not a range). Neo4j engineering will be looking into this.
In the meantime, your requirement is such that we can use a different kind of query to get the results you want that should avoid this operation. Try giving this one a try:
match (s:Scaffold)
where (s)-[:substructureOf*3]->(:Molecule)
return distinct s limit 20
And if you do need to perform queries that may produce this error, you may be able to circumvent them by prepending your query with CYPHER 3.1, which will execute this with a plan produced by an older version of Cypher which doesn't use the pruning var expand operation.

How to get node's id with cypher request?

I'm using neo4j and making executing this query:
MATCH (n:Person) RETURN n.name LIMIT 5
I'm getting the names but i need the ids too.
Please help!
Since ID isn't a property, it's returned using the ID function.
MATCH (n:Person) RETURN ID(n) LIMIT 5
Not sure how helpful or relevant this is, but when I'm using the NodeJS API the record objects returned from Cypher queries have an identity field on the same level as the properties object (e.g record.get(0).properties, record.get(0).identity). I'm assuming you aren't just doing plain Cypher queries and actually using a driver to send the queries - so you might not have to run another MATCH statement.
I'm aware that the OP is asking about Cypher specifically - but it might be helpful to other users that stumble upon this question.
Or you can take a look on the Neo4j Cypher Refcard
You can get a short look to a lots of functions and patterns you can write.
And more about functions on The Neo4j Developer Manual - Chapter 3. Cypher - 3.4. Functions

Resources