Jena Query Execution time - jena

I wonder if for sparql query, there is some logging that can be activated that can provide the query execution time, or is this just something that must be done as part of the code that call the query ?

Not as such.
It's also important to remember that Jena uses a streaming query engine so when you call QueryExecution.execSelect() it is not executing the full query rather preparing an iterator that can answer the query. Only when you iterate the results does the query actually get executed so calling code must take this into account when looking to take timings.
Exact behaviour differs with the query kind:
SELECT
execSelect() returns a ResultSet backed by an iterator that evaluates the query as it is iterated, exhaust the ResultSet by iteration to actually execute the query. So time the full iterator exhaustion to time the query execution.
ASK
execAsk() creates the iterator and calls hasNext() on it to get the boolean result, so only need to time execAsk()
CONSTRUCT/DESCRIBE
execConstruct()/execDescribe() fully evaluates the query and returns the resulting model so can just time this call.
Alternatively execConstructTriples()/execDescribeTriples() just prepares an iterator, exhaust the iterator to actually execute the query.
You might want to take a look at a tool like SPARQL Query Benchmarker if you are just looking to benchmark specific queries on your data (or see examples of how to do this kind of timings).
Disclaimer - This was a tool developed and released as OSS as part of my $dayjob some years back. It's using quite outdated versions of Jena but the core techniques still apply.

Related

Neo4j Cypher optimization of complex paginated query

I have a rather long and complex paginated query. I'm trying to optimize it. In the worst case - first, I have to execute the data query in a one call to Neo4j, and then I have to execute pretty much the same query for the count. Of course, I do everything in one transaction. Anyway, I don't like the overall execution time, so I extracted the most common part for both - data and count queries and execute it on the first call. This common query returns the IDs of nodes, which I then pass as parameters to the rest of data and count queries. Now, everything works much faster. One thing I don't like is that a common query can sometimes return quite a large set of IDs.. it can be 20k..50k Long IDs.
So my question is - because I'm doing this in a one transaction - is there a way to preserve such Set of IDs somewhere in Neo4j between common query and data/count query calls and just refer them somehow in the subsequent data/count queries without moving between app JVM and Neo4j?
Also, am I crazy for doing this, or is this a good approach to optimize a complex paginated query?
Only with a custom procedure.
Otherwise you'd need to return them.
But usually it's uncommon to both provide counts (even google doesn't provide "real" counts) and data.
One way is to just stream the results with the reactive driver as long as the user scrolls.
Otherwise I would just query for pageSize+1 and return "more than pageSize results".
If you just stream the id's back (and don't collect them as an aggregation) you can start using the id's received already to issue your new queries (even in parallel).

The way to check progress of Cypher query execution (Neo4j)

Are there any function to know the progress of the query execution or estimate the time to return the query result ?
Almost the same question were asked three years ago.
(Is there any way of checking progress of Cypher query execution?)
At that time ,there was no such function.
Sadly, there is no way to see the progress of query.
Neo4j comes with the procedure CALL dbms.listQueries() where you can see some informations about your queries (execution time, cpu, locks, ...) but not the progress of your query (you can also type :queries in the neo4j browser).
Generally what I do to see the progress of a write query with periodic commit (ex: a LOAD CSV), is to write a read query that counts the number of updated/created node.
Cheers.

Optimization: same Cypher query run multiple times

In my scenario I have a few dozens of Cypher queries executed one after another. If any of them returns some data (reveals some knowledge), at the end of the loop the graph is changed accordingly and all the queries are executed again.
Currently I store all the queries as Strings. There are never more than 20 loops, but still having to parse all the queries every time seems a an overhead. Is there a way to optimize it, like by storing the queries in some precompiled state? Or there's nothing to worry about?
Any other hints that would make the above scenario work faster?
As others have pointed out in the comments, you should use query parameters where possible. This has two benefits:
You can reuse the queries in your code without having to parse / construct the strings given whatever values you want to include.
Performance. The cypher compiler caches the execution plan for Cypher queries (ie queries it has seen before). If you use query parameters you will not incur the overhead of generating the query plan when executing the Cypher query again.
http://neo4j.com/docs/stable/cypher-parameters.html
http://neo4j.com/docs/stable/tutorials-cypher-parameters-java.html

Is it possible to execute read only cypher queries from java?

I'd like to know just what the title says.
The reason I'd want this is to permit constrained read-only cypher queries to be executed; the data results would later be interpreted and serialized by a separate API layer.
I've seen code that makes basic assumptions in an attempt to mimic this behavior, e.g. the code might filter out any Cypher query that contains certain special words associated with write query structures (merge, create, delete, set, and so on).
This approach tends to be limited and naive though; if it very simply looks for those tokens, it would prevent a query like MATCH n WHERE n.label =~ '.*create.*' RETURN n even though it's a read-only query.
I'd really prefer not to do a full parse on a candidate query and then descend through the AST trying to figure out whether something is read-only or not (although I would gladly accept an answer that shows how to do this easily in java)
EDIT - I'm aware it's possible to start the entire database in read-only mode via the configuration property read_only=true, but this would be undesirable; no other aspect of the java API would be able to change the database.
EDIT 2 - I found another possible strategy, but I'm not sure of its advisability. Comments welcome on this, and potential downsides:
try (Transaction ignore = graphDb.beginTx()) {
ExecutionResult result = executionEngine.execute(query);
// Do nifty stuff with result, then...
// Force transaction to fail.
ignore.failure();
}
The idea here is that if queries happen within transactions and the transaction is always force-failed, then nothing can ever be written to the DB no matter what the result.
Read-only Cypher is (not yet) directly supported. However I can think of two workarounds for that:
1) assuming you're running a Neo4j enterprise cluster: you can set read_only=true on one instance. That instance is then used for the read only queries where the other cluster instances are used for r/w. A load balancer in front of the cluster can be set up to send the requests to the right instance.
2) Use a TransactionEventHandler that vetos a transaction if its TransactionData contains write operations. Just for fun I've invested some minutes to implement that, see https://github.com/sarmbruster/read-only-cypher - feedback is appreciated.

Breeze.js reverses the query order when executed locally

So a slightly weird one that I can't find any cause for really.
My app is set up to basically run almost all queries through one standard method that handles things like querying against the local cache etc. So essentially the queries are all pretty standardised.
Then I have just one, with a strange orderby issue. The query includes a specific orderby clause, and if I run the query first time, the cache is checked, no results found, queries the remote data source, get data, all correct and ordered.
When I return to the page, the query is executed again, and the query is executed against the local cache, where it does find the data and returns it... the weird parts is the order is reversed. Bear in mind the parameters going in are exactly the same, the only difference is the query is executed with executeQueryLocally, and results are found, and returned (in the first query, it is still executed with executeQueryLocally, it's just that no results are found and it goes on to execute it remotely).
I really can't see any specific issue as to why the results are reversed (I say they are reversed, I can't actually guarantee that - they might just be unordered and happen to come out in a reversed order)
This isn't really causing a headache, it's just weird, especially as it appears to be only one query where this happens).
Thoughts?
Server side queries and client side queries are not guaranteed to return results in any specific order UNLESS you have an "orderBy" clause specified. The reason that order may be different without the "orderBy" clause is that the data is being stored very differently on the server vs the client and unless a specific order is specified both will attempt to satisfy the query as efficiently as possible given the storage implementation.
One interesting side note is that per the ANSI 92 SQL standard, even your SQL database is not required to return data in the same order for the same query ( again unless you have an ORDER BY clause). It's just that it's very rare to see it happen.

Resources