I'm unable to test this on the big data set right now, but just wondering what Cypher query in Neo4j will work faster in order to properly design the system:
Approach #1:
WHERE apoc.coll.containsAllSorted($profileDetailedCriterionIds, childD.mandatoryCriterionIds)
Approach #2:
(p:Profile {id: $profileId})-[:HAS_VOTE_ON]-(c:Criterion)<-[:HAS_VOTE_ON]-(childD)
WHERE c.id IN childD.mandatoryCriterionIds
WITH childD, COLLECT(c.id) as cIds,
WHERE size(cIds) >= size(childD.mandatoryCriterionIds)
where $profileDetailedCriterionIds is a Set of ids provided via query parameter
What approach should I chose for better performance?
Run both queries in a Neo4j browser but put a keyword PROFILE at the start of the query. When both queries are done, it will display a PROFILING or explanation on how the query was executed. Then go to last tab on the left and look that part where you use the APOC function and compare the db hits and page caching without the APOC function.
Related
I need to measure the performance of any query.
for example :
MATCH (n:StateNode)-[r:has_city]->(n1:CityNode)
WHERE n.shortName IN {0} and n1.name IN {1}
WITH n1
Match (aa:ActiveStatusNode{isActive:toBoolean('true')})--(n2:PannaResume)-[r1:has_location]->(n1)
WHERE (n2.firstName="master") OR (n2.lastName="grew" )
WITH n2
MATCH (o:PannaResumeOrganizationNode)<-[h:has_organization]-(n2)-[r2:has_skill]->(n3:Skill)
WHERE (0={3} OR o.organizationId={3}) AND (0={4} OR n3.name IN {2} OR n3.name IN {5})
WITH size(collect(n3)) as count, n2
MATCH (n2) where (0={4} OR count={4})
RETURN DISTINCT n2
I have tried profile & explain clauses but they only return number of db hits. Is it possible to get big notations for a neo4j query ie cn we measure performance in terms of big O notations ? Are there any other ways to check query performance apart from using profile & explain ?
No, you cannot convert a Cypher to Big O notation.
Cypher does not describe how to fetch information, only what kind of information you want to return. It is up to the Cypher planner in the Neo4j database to convert a Cypher into an executable query (using heuristics about what info it has to find, what indexes are available to it, and internal statistics about the dataset being queried. So simply changing the state of the database can change the complexity of a Cypher.)
A very simple example of this is the Cypher Cypher 3.1 MATCH (a{id:1})-[*0..25]->(b) RETURN DISTINCT b. Using a fairly average connected graph with cycles, running against Neo4j 3.1.1 will time out for being too complex (Because the planner tries to find all paths, even though it doesn't need that redundant information), while Neo4j 3.2.3 will return very quickly (Because the Planner recognizes it only needs to do a graph scan like depth first search to find all connected nodes).
Side note, you can argue for BIG O notation on the return results. For example MATCH (a), (b) must have a minimum complexity of n^2 because the result is a Cartesian product, and execution can't be less complex then the answer. This understanding of how complexity affects row counts can help you write Cyphers that reduce the amount of work the Planner ends up planning.
For example, using WITH COLLECT(n) as data MATCH (c:M) to reduce the number of rows the Planner ends up doing work against before the next part of a Cypher from nm (first match count times second match count) to m (1 times second match count).
However, since Cypher makes no promises about how data is found, there is no way to guarantee the complexity of the execution. We can only try to write Cyphers that are more likely to get an optimal execution plan, and use EXPLAIN/PROFILE to evaluate if the planner is able to find a relatively optimal solution.
The PROFILE results show you how the neo4j server actually plans to process your Cypher query. You need to analyze the execution plan revealed by the PROFILE results to get the big O complexity. There are no tools to do that that I am aware of (although it would be a great idea for someone to create one).
You should also be aware that the execution plan for a query can change over time as the characteristics of the DB change, and also when changing to a different version of neo4j.
Nothing of this is sort is readily available. But it can be derived/approximated with some additional effort.
On profiling a query, we get a list of functions that neo4j will run to achieve the desired result.
Each of this function will be associated with the worst to best case complexities in theory. And some of them will run in parallel too. This will impact runtimes, depending on the cores that your server has.
For example match (a:A) match (a:B) results in Cartesian product. And this will be of O(count(a)*count(b))
Similarly each function of in your query-plan does have such time complexities.
So aggregations of this individual time complexities of these functions will give you an overall approximation of time-complexity of the query.
But this will change from time to time with each version of neo4j since they community can always change the implantation of a query or to achieve better runtimes / structural changes / parallelization/ less usage of ram.
If what you are looking for is an indication of the optimization of neo4j query db-hits is a good indicator.
When I run the query
MATCH paths=(l:Left)-[:CONNECTED_TO*..5]->(r:Right)
WHERE (l.id IN $left_ids) AND (r.id IN $right_ids)
RETURN paths
ie, give me all paths with Left connected to Right so long as left is in left_ids and right is in right_ids.
Should I expect neo4j to
perform the cartesian product of Left and Right, and then filter the results - or
does it only search for paths once it has worked out which nodes are allowed?
Also - is there any obvious way for me to search this out for myself - ie is there a query planner, or some good meaty documentation that I've missed?
Take a look in the Profiling chapter of Neo4j docs:
EXPLAIN
If you want to see the execution plan but not run the statement,
prepend your Cypher statement with EXPLAIN. The statement will always
return an empty result and make no changes to the database.
PROFILE
If you want to run the statement and see which operators are doing
most of the work, use PROFILE. This will run your statement and keep
track of how many rows pass through each operator, and how much each
operator needs to interact with the storage layer to retrieve the
necessary data. Please note that profiling your query uses more
resources, so you should not profile unless you are actively working
on a query.
So you can prepend your queries with PROFILE or EXPLAIN and see the execution plan generated by Neo4j. This way:
PROFILE MATCH paths=(l:Left)-[:CONNECTED_TO*..5]->(r:Right)
WHERE (l.id IN $left_ids) AND (r.id IN $right_ids)
RETURN paths
I've got my graph database, populated with nodes, relationships, properties etc. I'd like to see an overview of how the whole database is connected, each relationship to each node, properties of a node etc.
I don't mean view each individual node, but rather something like an ERD from a relational database, something like this, with the node labels. Is this possible?
You can use the metadata by running the command call db.schema().
In Neo4j v4 call db.schema() is deprecated, you can now use call db.schema.visualization()
As far as I know, there is no straight-forward way to get a nicely pictured diagram of a neo4j database structure.
There is a pre-defined query in the neo4j browser which finds all node types and their relationships. However, it traverses the complete graph and may fail due to memory errors if you have to much data.
Also, there is neoprofiler. It's a tool which claims to so what you ask. I never tried and it didn't get too many updates lately. Still worth a try: https://github.com/moxious/neoprofiler
Even though this is not a graphical representation, this query will give you an idea on what type of nodes are connected to other nodes with what type of relationship.
MATCH (n)
OPTIONAL MATCH (n)-[r]->(x)
WITH DISTINCT {l1: labels(n), r: type(r), l2: labels(x)}
AS `first degree connection`
RETURN `first degree connection`;
You could use this query to then unwind the labels to write that next cypher query dynamically (via a scripting language and using the REST API) and then paste that query back into the neo4j browser to get an example set of the data.
But this should be good enough to get an overview of your graph. Expand from here.
I've seen a topic (Understanding Neo4j Cypher Profile keyword and execution plan) where profile keyword is mentioned.
I couldn't use it in Neo4j 2.0.0RC1 Community.
Peter wrote it's not fully implemented.
Will it ever be supported?
I mean, it could be interesting to watch the plan changes as we tune the query...
You can still find the neo4j shell, where you can run the profile command.
Either by connecting to the running server by starting bin/neo4j-shell
Or by switching to the old web-ui in the "(i)" Info-menu on the left side and selecting the bottommost link "webadmin" -> http://localhost:7474/webadmin
Profiling information will be added to browser later on, when it is easier to read and understand.
As of Neo4j 2.2 there are additional profiling facilities available. Some features that were only available via neo4j-shell or the REST endpoints are now available also in the Neo4j-browser, and some features are new altogether.
You can now use the PROFILE command with your cypher query directly in the Neo4j-browser repl to execute the query and see a visualization of the execution plan.
PROFILE
MATCH (n:Peter {foo: 'Paul'})
RETURN n.bar, ID(n)
-------------
n.bar ID(n)
Mary 951
Additionally, you can now inspect a query plan without having to actually execute it, for instance to verify a query that would alter the database. Do so with the EXPLAIN command prepended to the query. See 15.2 How do I profile a query? from the documentation.
EXPLAIN
MATCH (n:Peter {foo: 'Paul'})
SET n.foo = 'Mary', n.bar = 'Paul'
RETURN n.foo, ID(n)
------------------------------------------
// Nothing returned, query is not executed
A related new feature is also the new 'cost based' query planner, as well as the ability to force the use of either the 'cost based' or the 'rule based' query planner for all queries or for any particular query. The documentation notes that not all queries may be solvable by the 'cost based' query planner, in which case the setting will be ignored and the 'rule based' planner used. See 15.1 How are queries executed?
To force the use of either query planner for all queries, set the query.planner.version configuration setting in conf/neo4j.properties (Neo4j server) or by calling the .setConfig() method on your GraphDatabaseService object (Neo4j embedded). Set it to COST or RULE, and to give the decision on which query planner to use back to Neo4j, set it to default (or remove the setting altogether). See 24.5 Configuration Settings, Starting an embedded database with configuration settings.
To force the use of either query planner for a particular query, prepend your query with CYPHER planner=cost or CYPHER planner=rule. See 15.1 How are queries executed?
CYPHER planner=cost
MATCH (n:Peter {foo: 'Paul'})
RETURN n.bar, ID(n)
You can PROFILE or EXPLAIN queries with either of the query planners and see any differences in how they implement your queries.
For help interpreting the execution plan, see the relevant chapter of the documentation, 16. Execution Plans.
Is it possible to have cypher query paginated. For instance, a list of products, but I don't want to display/retrieve/cache all the results as i can have a lot of results.
I'm looking for something similar to the offset / limit in SQL.
Is cypher skip + limit + orderby a good option ? http://docs.neo4j.org/chunked/stable/query-skip.html
SKIP and LIMIT combined is indeed the way to go. Using ORDER BY inevitably makes cypher scan every node that is relevant to your query. Same thing for using a WHERE clause. Performance should not be that bad though.
Its like normal sql, the syntax is as follow
match (user:USER_PROFILE)-[USAGE]->uUsage
where HAS(uUsage.impressionsPerHour) AND (uUsage.impressionsPerHour > 100)
ORDER BY user.hashID
SKIP 10
LIMIT 10;
This syntax suit to last version (2.x)
Neo4j apparently uses "indexed-backed order by" nowadays, which means if you are using alphabetical ORDERBY on indexed node properties within your SKIP/LIMIT query, then Neo4j will not perform a full scan of all "relevant nodes" as other have mentioned (their responses were long ago, so keep that in mind). The index will allow Neo4j to optimize on the basis that it already stores indexed properties in ORDERBY order (alphabetical), such that your pagination will be even faster than without the index.