So I have this piece of code in Jena, that measures the execution time of a select query
Timer timer = new Timer();
timer.startTimer();
Query query = QueryFactory.create(queryString);
QueryExecution qexec = QueryExecutionFactory.create(query,dataset);
ResultSet results = qexec.execSelect();
long endTime = timer.endTimer();
Now the issue is the that this variable endTime shows a running time result that is smaller than what the query execution time should have been. The dataset is a Jena TDB location.
To test this out, I ran the same query using Apache Jena's Fuseki on the same TDB store and I find that the execution time is different (maybe the actual execution time). What is the right way to find execution time using Jena. I don't want to execute everything using Fuseki and find the answer.
QueryExecutionFactory.create(query,dataset);
All this does is create an execution that can execute your query, importantly it does not execute your query.
To start execution you need to call one of the execX() methods which will depend on the query type e.g. execSelect() for SELECT queries
Execution in Jena is lazy so in order to time execution you need to actually enumerate the results, execution does not finish until results are fully enumerated e.g.
ResultSet results = qexec.execSelect();
long numResults = ResultSetFormatter.consume(results);
And at that point you can stop your timer
Related
I wonder if for sparql query, there is some logging that can be activated that can provide the query execution time, or is this just something that must be done as part of the code that call the query ?
Not as such.
It's also important to remember that Jena uses a streaming query engine so when you call QueryExecution.execSelect() it is not executing the full query rather preparing an iterator that can answer the query. Only when you iterate the results does the query actually get executed so calling code must take this into account when looking to take timings.
Exact behaviour differs with the query kind:
SELECT
execSelect() returns a ResultSet backed by an iterator that evaluates the query as it is iterated, exhaust the ResultSet by iteration to actually execute the query. So time the full iterator exhaustion to time the query execution.
ASK
execAsk() creates the iterator and calls hasNext() on it to get the boolean result, so only need to time execAsk()
CONSTRUCT/DESCRIBE
execConstruct()/execDescribe() fully evaluates the query and returns the resulting model so can just time this call.
Alternatively execConstructTriples()/execDescribeTriples() just prepares an iterator, exhaust the iterator to actually execute the query.
You might want to take a look at a tool like SPARQL Query Benchmarker if you are just looking to benchmark specific queries on your data (or see examples of how to do this kind of timings).
Disclaimer - This was a tool developed and released as OSS as part of my $dayjob some years back. It's using quite outdated versions of Jena but the core techniques still apply.
I am refering to this graphgist : https://neo4j.com/graphgist/project-management
I'm actually trying to update a project plan when a duration on one task changes.
In the GraphGist, the whole project is always calculated from the initial activity to the last activity. This doesn't work great for me in a multi-project environment where I don't really know what the starting point is, and I don't know either the end point. What I would like for now, is just to update the earliest start of any activity which depends on a task I just updated.
The latest I have is the following :
MATCH p1=(:Activity {description:'Perform needs analysis'})<-[:REQUIRES*]-(j:Activity)
UNWIND nodes(p1) as task
MATCH (pre:Activity)<-[:REQUIRES]-(task:Activity)
WITH MAX(pre.duration+pre.earliest_start) as updateEF,task
SET task.earliest_start = updateEF
The intent is to get all the paths in the projects which depends on the task I just updated (in this case : "perform needs analysis"), also at every step of the path I'm checking if there aren't other dependencies which would override my duration update.
So, of course it only works on the direct connections.
if I have A<-[:requires]-B<-[:requires]-C
if I increase duration A, I believe it updates B based on A, but then C is calculated with the duration of B before B duration was updated.
How can I make this recursive? maybe using REDUCE?
(still searching...)
This is a very interesting issue.
You want to update the nodes 1 step away from the originally updated node, and then update the nodes 2 steps away (incorporating the previously-updated values as appropriate), and then 3 steps away, and so on until every node reachable from the original node has been updated.
The Cypher planner does not generate code that performs this kind of query/update pattern, where new values are propagated step by step through paths.
However, there is a workaround using the APOC plugin. For example, using apoc.periodic.iterate:
CALL apoc.periodic.iterate(
"MATCH p=(:Activity {description:'Perform needs analysis'})<-[:REQUIRES*]-(task:Activity)
RETURN task ORDER BY LENGTH(p)",
"MATCH (pre:Activity)<-[:REQUIRES]-(task)
WITH MAX(pre.duration+pre.earliest_start) as updateEF, task
SET task.earliest_start = updateEF",
{batchSize:1})
The first Cypher statement passed to the procedure generates the task nodes, ordered by distance from the original node. The second Cypher statement gets the pre nodes for each task, and sets the appropriate earliest_start value for that task. The batchSize:1 option tells the procedure to perform every iteration of the second statement in its own transaction, so that subsequent iterations will see the updated values.
NOTE: If the same task can be encountered multiple times at different distances, you will have to determine if this approach is right for you. Also, you cannot have other operations writing to the DB at the same time, as that could lead to inconsistent results.
Are there any function to know the progress of the query execution or estimate the time to return the query result ?
Almost the same question were asked three years ago.
(Is there any way of checking progress of Cypher query execution?)
At that time ,there was no such function.
Sadly, there is no way to see the progress of query.
Neo4j comes with the procedure CALL dbms.listQueries() where you can see some informations about your queries (execution time, cpu, locks, ...) but not the progress of your query (you can also type :queries in the neo4j browser).
Generally what I do to see the progress of a write query with periodic commit (ex: a LOAD CSV), is to write a read query that counts the number of updated/created node.
Cheers.
I have an API in Django and its structure is something like -
FetchData():
run cypher query1
run cypher query2
run cypher query3
return
When I run these queries in neo4j query window each take around 100ms. But when I call this API, query1 takes 1s and other 2 take expected 100ms to execute. This pattern is repeated every time I call the API.
Can anyone explain what should be done here to run the first query in expected time.
Neo4j tries to cache the graph in RAM. Upon first invocations caches are not warmed up yet, so it takes longer to do the IO operations. Subsequent invocations don't hit IO and read directly from RAM.
That sounds weird. The cache should only need to be warmed if the server or db is shut down, not after each of your API calls. Are you using paramterized queries? The only thing I can think of is maybe each set of queries is different causing them to have to be re-parsed and re-planned.
I'm performing a query using an sqlite db where I pull out a quite large data set of call records from a database. On the same page I want to show the breakdown of counts per day on the call records, so I perform about 30 count queries on the database.
Is there a way I can filter the set that I retrieve initially and perform the counts on the in memory set, so I don't have to run those continuous queries? I need those counts for graphing and display purposes but even with an index on date, it takes about 10 seconds to run the initial query plus all of the count queries.
What I'm basically asking is there a way to perform the counts on the records returned or perform analysis on it, or is there a smarter way to cache this data?
#set = Record.get_records_for_range(date1, date2)
while date1 < date2
#count = Record.count_records_for_date(date1)
date1 = date1 + 1
end
is basically what I'm doing. Surely there's a simpler and faster way?
Using #set.length will get you the count of the in memory set without querying the database because it is performed by ruby not active record (like .count is)
Read about it here https://batsov.com/articles/2014/02/17/the-elements-of-style-in-ruby-number-13-length-vs-size-vs-count/
Here is a quote pulled out of that article
length is a method that’s not part of Enumerable - it’s part of a concrete class (like String or Array) and it’s usually running in O(1) (constant) time. That’s as fast as it gets, which means that using it is probably a good idea.