When processing the same query repeatedly, can i assume that there is almost no I/O cost except the first query, since memory maybe holds the data that the following queries need. Assume that the query fits the memory.
If the page cache is sized reasonably well that assumption should be true. Note that the graph is cached, but not the query results. So every time you run the query the graph is traversed (but nodes/rels/properties are hopefully picked from page cache and not from ssd/hdd).
Related
I have a rather long and complex paginated query. I'm trying to optimize it. In the worst case - first, I have to execute the data query in a one call to Neo4j, and then I have to execute pretty much the same query for the count. Of course, I do everything in one transaction. Anyway, I don't like the overall execution time, so I extracted the most common part for both - data and count queries and execute it on the first call. This common query returns the IDs of nodes, which I then pass as parameters to the rest of data and count queries. Now, everything works much faster. One thing I don't like is that a common query can sometimes return quite a large set of IDs.. it can be 20k..50k Long IDs.
So my question is - because I'm doing this in a one transaction - is there a way to preserve such Set of IDs somewhere in Neo4j between common query and data/count query calls and just refer them somehow in the subsequent data/count queries without moving between app JVM and Neo4j?
Also, am I crazy for doing this, or is this a good approach to optimize a complex paginated query?
Only with a custom procedure.
Otherwise you'd need to return them.
But usually it's uncommon to both provide counts (even google doesn't provide "real" counts) and data.
One way is to just stream the results with the reactive driver as long as the user scrolls.
Otherwise I would just query for pageSize+1 and return "more than pageSize results".
If you just stream the id's back (and don't collect them as an aggregation) you can start using the id's received already to issue your new queries (even in parallel).
I'm running a production website for 4 years with azure SQL.
With help of 'Top Slow Request' query from alexsorokoletov on github I have 1 super slow query according to Azure query stats.
The one on top is the one that uses a lot of CPU.
When looking at the linq query and the execution plans / live stats, I can't find the bottleneck yet.
And the live stats
The join from results to project is not directly, there is a projectsession table in between, not visible in the query, but maybe under the hood of entity framework.
Might I be affected by parameter sniffing? Can I reset a hash? Maybe the optimized query plan was used in 2014 and now result table is about 4Million rows and the query is far from optimal?
If I run this query in Management Studio its very fast!
Is it just the stats that are wrong?
Regards
Vincent - The Netherlands.
I would suggest you try adding option(hash join) at the end of the query, if possible. Once you start getting into large arity, loops join is not particularly efficient. That would prove out if there is a more efficient plan (likely yes).
Without seeing more of the details (your screenshots are helpful but cut off whether auto-param or forced parameterization has kicked in and auto-parameterized your query), it is hard to confirm/deny this explicitly. You can read more about parameter sniffing in a blog post I wrote a bit longer ago than I care to admit ;) :
https://blogs.msdn.microsoft.com/queryoptteam/2006/03/31/i-smell-a-parameter/
Ultimately, if you update stats, dbcc freeproccache, or otherwise cause this plan to recompile, your odds of getting a faster plan in the cache are higher if you have this particular query + parameter values being executed often enough to sniff that during plan compilation. Your other option is to add optimize for unknown hints which will disable sniffing and direct the optimizer to use an average value for the frequency of any filters over parameter values. This will likely encourage more hash or merge joins instead of loops joins since the cardinality estimates of the operators in the tree will likely increase.
I am yet trying to make use of neo4j to perform a complex query (similar to shortest path search except I have very strange conditions applied to this search like minimum path length in terms of nodes traversed count).
My dataset contains around 2.5M nodes of one single type and around 1.5 billion edges (One single type as well). Each given node has on average 1000 directional relation to a "next" node.
Yet, I have a query that allows me to retrieve this shortest path given all of my conditions but the only way I found to have decent response time (under one second) is to actually limit the number of results after each new node added to the path, filter it, order it and then pursue to the next node (This is kind of a greedy algorithm I suppose).
I'd like to limit them a lot less than I do in order to yield more path as a result, but the problem is the exponential complexity of this search that makes going from LIMIT 40 to LIMIT 60 usually a matter of x10 ~ x100 processing time.
This being said, I am yet evaluating several solutions to increase the speed of the request but I'm quite unsure of the result they will yield as I'm not sure about how neo4j really stores my data internally.
The solution I think about yet is to actually add a property to my relationships which would be an integer in between 1 and 15 because I usually will only query the relationships that have one or two max different values for this property. (like only relationships that have this property to 8 or 9 for example).
As I can guess yet, for each relationship, neo4j then have to gather the original node properties and use it to apply my further filters which takes a very long time when crossing 4 nodes long path with 1000 relationships each (I guess O(1000^4)). Am I right ?
With relationship properties, will it have direct access to it without further data fetching ? Is there any chance it will make my queries faster? How are neo4j edges properties stored ?
UPDATE
Following #logisima 's advice I've written a procedure directly with the Java traversal API of neo4j. I then switched to the raw Java procedure API of Neo4J to leverage even more power and flexibility as my use case required it.
The results are really good : the lower bound complexity is overall a little less thant it was before but the higher bound is like ten time faster and when at least some of the nodes that will be used for the traversal are in the cache of Neo4j, the performances just becomes astonishing (depth 20 in less than a second for one of my tests when I only need depth 4 usually).
But that's not all. The procedures makes it very very easily customisable while keeping the performances at their best and optimizing every single operation at its best. The results is that I can use far more powerful filters in far less computing time and can easily update my procedure to add new features. Last but not least Procedures are very easily pluggable with spring-data for neo4j (which I use to connect neo4j to my HTTP API). Where as with cypher, I would have to auto generate the queries (as being very complex, there was like 30 java classes to do the trick properly) and I should have used jdbc for neo4j while handling a separate connection pool only for this request. Cannot recommend more to use the awesome neo4j java API.
Thanks again #logisima
If you're trying to do a custom shortespath algo, then you should write a cypher procedure with the traversal API.
The principe of Cypher is to make pattern matching, and you want to traverse the graph in a specific way to find your good solution.
The response time should be really faster for your use-case !
I'm running my cypher queryies on a very large social network (over 1B records). I'm trying to get all paths between two person with variable relationship lengths. I get a reasonable response time running a query for a single relationship length (between 0.5 -2 seconds) [the person ids are index].
MATCH paths=( (pr1:person)-[*0..1]-(pr2:person) )
WHERE pr1.id='123456'
RETURN paths
However when I run the query with multiple lengths (i.e. 2 or more) my response time goes up to several minutes. Assuming that each person has in average the same number of connection I should be running my queries for 2-3 minutes Max (but I get up to 5+ min).
MATCH paths=( (pr1:person)-[*0..2]-(pr2:person) )
pr1.id='123456'
RETURN paths
I tried to use the EXPLAIN did not show extreme values for the VarLengthExpand(All) .
Maybe the traversing is not using the index for the pr2.
Is there anyway to improve the performance of my query?
Since variable-length relationship searches have exponential complexity, your *0..2 query might be generating a very large number of paths, which can cause the neo4j server (or your client code, like the neo4j browser) to run a long time or even run out of memory.
This query might be able to finish and show you how many matching paths there are:
MATCH (pr1:person)-[*0..2]-(:person)
WHERE pr1.id='123456'
RETURN COUNT(*);
If the returned number is very large, then you should modify your query to reduce the size of the result. For example, you can adding a LIMIT clause after your original RETURN clause to limit the number of returned paths.
By the way, the EXPLAIN clause just estimates the query cost, and can be way off. The PROFILE clause performs the actual query, and gives you an accurate accounting of the DB hits (however, if your query never finishes running, then a PROFILE of it will also never finish).
Rather than using the explain, try the "profile" instead.
In my scenario I have a few dozens of Cypher queries executed one after another. If any of them returns some data (reveals some knowledge), at the end of the loop the graph is changed accordingly and all the queries are executed again.
Currently I store all the queries as Strings. There are never more than 20 loops, but still having to parse all the queries every time seems a an overhead. Is there a way to optimize it, like by storing the queries in some precompiled state? Or there's nothing to worry about?
Any other hints that would make the above scenario work faster?
As others have pointed out in the comments, you should use query parameters where possible. This has two benefits:
You can reuse the queries in your code without having to parse / construct the strings given whatever values you want to include.
Performance. The cypher compiler caches the execution plan for Cypher queries (ie queries it has seen before). If you use query parameters you will not incur the overhead of generating the query plan when executing the Cypher query again.
http://neo4j.com/docs/stable/cypher-parameters.html
http://neo4j.com/docs/stable/tutorials-cypher-parameters-java.html