with gevent.Timeout(0.1) as tt:
time.sleep(1)
above ,not raise Exception
with gevent.Timeout(0.1) as tt:
gevent.sleep(1)
throw gevent.timeout.Timeout: 0.1 seconds
there difference is time.sleep() and gevent.sleep()!
time.sleep() actually pauses all execution of code and does not allow any other code to run. Gevent is an event-loop, which means that it allows other "threads" (greenlets) to run while blocking.
Essentially gevent has a list of tasks it is executing. It only allows 1 task to run at a time. If you say time.sleep(1), that task is still running, but not doing anything. If you say gevent.sleep(1), it pauses the current task and allows the other tasks to run.
gevent.Timeout() actually launches a second task to monitor the amount of time has elapsed. Since time.sleep() never yields, that second task never gets a chance to throw the error.
Related
I have a Sidekiq job that runs every 4 minutes.
This job checks if the current code block is being executed before executing the code again
process = ProcessTime.where("name = 'ad_queue_process'").first
# Return if job is running
return if process.is_running == true
If Sidekiq restarts midway through the code block, code that updates the status of the job never runs
# Done running, update the process times and allow it to be ran again
process.update_attributes(is_running: false, last_execution_time: Time.now)
Which leads the the Job never running unless i run an update statement to set is_running = false
Is there any way to execute code before Sidekiq is restarted?
Update:
Thanks to #Aaron, and following our discussion (comments below), the ensure block (which is executed by the forked worker-threads) can only be ran for a few unguaranteed milliseconds before the main-thread forcefully terminates these worker-threads, in order for the main-thread to do some "cleanup" up the exception stack, in order to avoid getting SIGKILL-ed by Heroku. Therefore, make sure that your ensure code should be really fast!
TL;DR:
def perform(*args)
# your code here
ensure
process.update_attributes(is_running: false, last_execution_time: Time.now)
end
The ensure above is always called regardless if the method "succeeded" or an Exception is raised. I tested this: see this repl code, and click "Run"
In other words, this is always called even on a SignalException, even if the signal is SIGTERM (gracefully shutdown signal), but ONLY EXCEPT on SIGKILL (force unrescueable shutdown). You can verify this behaviour by checking my repl code, and then change Process.kill('TERM', Process.pid) to Process.kill('KILL', Process.pid), and then click "run" again (you'll notice that the puts won't be called)
Looking at Heroku docs, I quote:
When Heroku is going to shut down a dyno (for a restart or a new deploy, etc.), it first sends a SIGTERM signal to the processes in the dyno.
After Heroku sends SIGTERM to your application, it will wait a few seconds and then send SIGKILL to force it to shut down, even if it has not finished cleaning up. In this example, the ensure block does not get called at all, the program simply exits
... which means that the ensure block will be called because it's a SIGTERM and not a SIGKILL, only except if the shutting down takes a looong time, which may due to (some reasons I could think of ATM):
Something inside your perform code (or any ruby code in the stack; even gems) that also rescued the SignalException, or even rescued the root Exception class because SignalException is a subclass of Exception) but takes a long time cleaning up (i.e. cleaning up connections to DB or something, or I/O stuff that hangs your application)
Or, your own ensure block above takes a looong time. I.E when doing the process.update_attributes(...), for some reason the DB temporary hangs / network delay or timeout, then that update might not succeed at all! and will ran out of time, of which from my quote above, after a few seconds after the SIGTERM, the application will be forced to be stopped by Heroku sending a SIGKILL.
... which all means that my solution is still not fully reliable, but should work under normal situations
Handle sidekiq shutdown exception
class SomeWorker
include Sidekiq::Worker
sidekiq_options queue: :default
def perform(params)
...
rescue Sidekiq::Shutdown
SomeWorker.perform_async(params)
end
end
We're using neo4j (3.1.5-enterprise) for one of our services. (Over HTTP)
We set dbms.transaction.timeout=150s in our neo4j config file .
We have a scenario which may take more time than 150 seconds, but what we would like is for the transaction to be expired after 150 seconds anyway.
For some reason its not happening and the transaction continue until it fully executed but its not being stopped after 150 seconds, any guess why?
In our application logs I can see the following exception (more stacktrace details below):
neo.db.NeoHttpDriver - Errors in response:
[NeoResponseError{
code='Neo.DatabaseError.Statement.ExecutionFailed',
message='Transaction timeout. (Overtime: 23793 ms).',
stackTrace='org.neo4j.kernel.guard.GuardTimeoutException: Transaction timeout. (Overtime: 23793 ms).
...
Also, our service steps(in the specific scenario that may take long time) in general is open a transaction, lock some common entity and proceed. Since the transaction is not expired and released(and therefor the common entity continue to be locked) after 150 seconds, then other threads may also be locked for a long time.
Thanks!
Orel
Exception stacktrace:
15:00:59.627 [DefaultThreadPool-7] DEBUG c.e.e.m.neo.db.NeoHttpDriver - Errors in response: [NeoResponseError{code='Neo.DatabaseError.Statement.ExecutionFailed', message='Transaction timeout. (Overtime: 23793 ms).', stackTrace='org.neo4j.kernel.guard.GuardTimeoutException: Transaction timeout. (Overtime: 23793 ms).
at org.neo4j.kernel.guard.TimeoutGuard.check(TimeoutGuard.java:71)
at org.neo4j.kernel.guard.TimeoutGuard.check(TimeoutGuard.java:57)
at org.neo4j.kernel.guard.TimeoutGuard.check(TimeoutGuard.java:49)
at org.neo4j.kernel.impl.api.GuardingStatementOperations.nodeCursorById(GuardingStatementOperations.java:300)
at org.neo4j.kernel.impl.api.OperationsFacade.nodeHasProperty(OperationsFacade.java:343)
at org.neo4j.cypher.internal.spi.v3_1.TransactionBoundQueryContext$NodeOperations.hasProperty(TransactionBoundQueryContext.scala:319)
at org.neo4j.cypher.internal.compatibility.ExceptionTranslatingQueryContextFor3_1$ExceptionTranslatingOperations$$anonfun$hasProperty$1.apply$mcZ$sp(ExceptionTranslatingQueryContextFor3_1.scala:245)
at org.neo4j.cypher.internal.compatibility.ExceptionTranslatingQueryContextFor3_1$ExceptionTranslatingOperations$$anonfun$hasProperty$1.apply(ExceptionTranslatingQueryContextFor3_1.scala:245)
at org.neo4j.cypher.internal.compatibility.ExceptionTranslatingQueryContextFor3_1$ExceptionTranslatingOperations$$anonfun$hasProperty$1.apply(ExceptionTranslatingQueryContextFor3_1.scala:245)
at org.neo4j.cypher.internal.spi.v3_1.ExceptionTranslationSupport$class.translateException(ExceptionTranslationSupport.scala:32)
at org.neo4j.cypher.internal.compatibility.ExceptionTranslatingQueryContextFor3_1.translateException(ExceptionTranslatingQueryContextFor3_1.scala:34)
at org.neo4j.cypher.internal.compatibility.ExceptionTranslatingQueryContextFor3_1$ExceptionTranslatingOperations.hasProperty(ExceptionTranslatingQueryContextFor3_1.scala:245)
at org.neo4j.cypher.internal.compiler.v3_1.spi.DelegatingOperations.hasProperty(DelegatingQueryContext.scala:221)
at org.neo4j.cypher.internal.compiler.v3_1.pipes.AbstractSetPropertyOperation.setProperty(SetOperation.scala:98)
at org.neo4j.cypher.internal.compiler.v3_1.pipes.SetEntityPropertyOperation.set(SetOperation.scala:117)
at org.neo4j.cypher.internal.compiler.v3_1.pipes.SetPipe$$anonfun$internalCreateResults$1.apply(SetPipe.scala:31)
at org.neo4j.cypher.internal.compiler.v3_1.pipes.SetPipe$$anonfun$internalCreateResults$1.apply(SetPipe.scala:30)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator$$anonfun$next$1.apply(ResultIterator.scala:71)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator$$anonfun$next$1.apply(ResultIterator.scala:68)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator$$anonfun$failIfThrows$1.apply(ResultIterator.scala:94)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.decoratedCypherException(ResultIterator.scala:103)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.failIfThrows(ResultIterator.scala:92)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.next(ResultIterator.scala:68)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.next(ResultIterator.scala:49)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.foreach(ResultIterator.scala:49)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.to(ResultIterator.scala:49)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.toList(ResultIterator.scala:49)
at org.neo4j.cypher.internal.compiler.v3_1.EagerResultIterator.<init>(ResultIterator.scala:35)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.toEager(ResultIterator.scala:53)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.DefaultExecutionResultBuilderFactory$ExecutionWorkflowBuilder.buildResultIterator(DefaultExecutionResultBuilderFactory.scala:109)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.DefaultExecutionResultBuilderFactory$ExecutionWorkflowBuilder.createResults(DefaultExecutionResultBuilderFactory.scala:99)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.DefaultExecutionResultBuilderFactory$ExecutionWorkflowBuilder.build(DefaultExecutionResultBuilderFactory.scala:68)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.InterpretedExecutionPlanBuilder$$anonfun$getExecutionPlanFunction$1.apply(ExecutionPlanBuilder.scala:164)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.InterpretedExecutionPlanBuilder$$anonfun$getExecutionPlanFunction$1.apply(ExecutionPlanBuilder.scala:148)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.InterpretedExecutionPlanBuilder$$anon$1.run(ExecutionPlanBuilder.scala:123)
at org.neo4j.cypher.internal.compatibility.CompatibilityFor3_1$ExecutionPlanWrapper$$anonfun$run$1.apply(CompatibilityFor3_1.scala:275)
at org.neo4j.cypher.internal.compatibility.CompatibilityFor3_1$ExecutionPlanWrapper$$anonfun$run$1.apply(CompatibilityFor3_1.scala:273)
at org.neo4j.cypher.internal.compatibility.exceptionHandlerFor3_1$runSafely$.apply(CompatibilityFor3_1.scala:190)
at org.neo4j.cypher.internal.compatibility.CompatibilityFor3_1$ExecutionPlanWrapper.run(CompatibilityFor3_1.scala:273)
at org.neo4j.cypher.internal.PreparedPlanExecution.execute(PreparedPlanExecution.scala:26)
at org.neo4j.cypher.internal.ExecutionEngine.execute(ExecutionEngine.scala:107)
at org.neo4j.cypher.internal.javacompat.ExecutionEngine.executeQuery(ExecutionEngine.java:59)
at org.neo4j.server.rest.transactional.TransactionHandle.safelyExecute(TransactionHandle.java:371)
at org.neo4j.server.rest.transactional.TransactionHandle.executeStatements(TransactionHandle.java:323)
at org.neo4j.server.rest.transactional.TransactionHandle.execute(TransactionHandle.java:230)
at org.neo4j.server.rest.transactional.TransactionHandle.execute(TransactionHandle.java:119)
at org.neo4j.server.rest.web.TransactionalService.lambda$executeStatements$0(TransactionalService.java:203)
Most likely the problem is that the tx is waiting on a lock. Prior to Neo4j 3.2, dbms.transaction.timeout cannot cover the case of terminating a transaction that's waiting on a lock (or rather, it will mark it for termination, but the actual termination won't happen until the lock is acquired).
In Neo4j 3.2, dbms.lock.acquisition.timeout was introduced, which interrupts waiting on a lock and allows the thread to check if the tx has been terminated and take appropriate action.
The following is based on an answer provided by Neo4j Support:
dbms.lock.acquisition.timeout
As a starting point, dbms.lock.acquisition.timeout was only added in Neo4j 3.2, it does not exist for 3.1. where we don't yet have lock acquisition timeout, hence wait times on locks can over-runs past the set limit. Things like GC can also extend the time. However, as you're currently on 3.1.5, dbms.lock.acquisition.timeout would not yet be enforced.
dbms.transaction.timeout
dbms.transaction.timeout marks a transaction for termination, but the actual logic of checking this and performing the termination happens on a running thread, not one waiting on locks, and doesn't cause the thread to wake up and check. Presumably the logic for terminating a thread upon timeout is that some other thread periodically checks execution time for a transaction, and if it has exceeded the transaction timeout, sets a boolean variable on the transaction to indicate that it is marked for termination. The actual termination of the thread likely happens in an event loop for the transaction, where it checks that variable to see if it's marked for termination, then terminates and rolls back. A thread that attempts to acquire a lock enters a waiting state when the lock is already held by another thread. During this waiting state, the event loop is not being processed, so the thread never reaches the point in the event loop where it can check if it's been marked as terminated and take care of it.
Bottom line:
dbms.transaction.timeout does not cause a hard timeout, it only marks the transaction as timed-out, which will cause it to rollback once the flag is checked.
I have been seeing below error message for quite some time now but could not figure out what leads to the failure.
Error:
concurrent.futures._base.CancelledError: ('sort_index-f23b0553686b95f2d91d4a3fda85f229', 7)
On restart of dask cluster it runs successfully.
If running a dask-cloudprovider ECSCluster or FargateCluster the concurrent.futures._base.CancelledError can result from a long-running step in computation where there is no output (logging or otherwise) to the Client. In these cases, due to the lack of interaction with the client, the scheduler regards itself as "idle" and times out after the configured cloudprovider.ecs.scheduler_timeout period, which defaults to 5 minutes. The CancelledError error message is misleading, but if you look in the logs for the scheduler task itself it will record the idle timeout.
The solution is to set scheduler_timeout to a higher value, either via config or by passing directly to the ECSCluster/FargateCluster constructor.
What is the right approach when writing celery tasks that communicate with service that have a rate limits and sometimes is missing (not responding) for a long time of period?
Do I have to use task retry? What if service is missing too much time? Is there a way to store this tasks for later execution after a long time period?
What if this is a subtask in a long task?
First, i suggest you to set a socket timeout for avoid long waiting of a response.
Than you can catch the socket TimeOutException and in this particoular case to a retry with big amount of time, like 15 minutes.
Anyway normally I use an incrementalRetry with a percentage of increment, this will increase the time every time the task retry, this is useful when you write task that depends on external services that can be available for long time.
You can set on task an high number of retry like 50 and than set the standard retry time by using the var
#20 seconds
self.default_retry_delay = 20
after you can implement a method like this for your task
def incrementalRetry(self, exc, perc = 20, args = None):
"""By default the retry delay is increased by 20 percent"""
if args:
self.request.args = args
delay = self.default_retry_delay
if self.request.kwargs.has_key('retry_deleay'):
delay = self.request.kwargs['retry_deleay']
retry_delay = delay+round((delay*perc)/100,2)
#print "delay"+str(retry_delay)
self.retry(self.request.args,
self.request.kwargs.update({'retry_deleay':retry_delay}),
exc=exc,countdown=retry_delay, max_retries=self.max_retries)
What if this is a subtask in a long task?
If you don't need the result you can launch it in async mode by using task.delay(args=[])
A nice feature is also the task group that allow you to launch different tasks and after all are finished you can to something else in you work flow.
Suppose I use QTPs recovery scenario manager to set the playback synchronization timeout to 0. The handler would return with "continue with next statement".
I'd do that to make sure that any following playback statements don't waste their time waiting for the next non-existing/non-matching step before failing:
I have a lot of GUI tests that kind of get stuck because let's say if 10 controls are missing, their (consecutive) playback steps produce 10 timeout waits before failing. If the playback timeout is 30 seconds, I loose 10x30 seconds=5 minutes execution time while it really would be sufficient to wait for 30 seconds ONCE (because the app does not change anymore -- we waited a full timeout period already).
Now if I have 100 test cases (=action iterations), this possibly happens 100 times, wasting 500 minutes of my test exec time window.
That's why I come up with the idea of a recovery scenario function setting the timeout to 0 after/upon the first failed playback step. This would accelerate the speed while skipping the rightly-FAILED step, yet would not compromise the precision/reliability of identifying the next matching GUI context (which creates a PASSED step).
Then of course upon the next passed playback step, I would want to restore the original timeout value. How could I do that? This is my question.
One cannot define a recovery scenario function that is called for PASSED steps.
I am currently thinking about setting a method function for Reporter.ReportEvent, and "sniffing" for PASSED log entries there. I'd install that method function in the scenario recovery function which sets timeout to 0. Then, when the "sniffer" function senses a ReportEvent call with PASSED status during one of the following playback steps, I'd reset everything (i.e. restore the original timeout, and uninstall the method function). (I am 99% sure, however, that .Click and .Set methods do not call ReportEvent to write their result status...so this option might probably not work.)
Better ideas? This really bugs me.
It sounds to me like you tests aren't designed correctly, if you fail to find an object why do you continue?
One possible (non recovery scenario) solution would be to use RegisterUserFunc to override the methods you are using in order to do an obj.Exist(0) before running the required method.
Function MyClick(obj)
If obj.Exist(1) Then
obj.Click
Else
Reporter.ReportEvent micFail, "Click failed, no object", "Object does not exist"
End If
End Function
RegisterUserFunc "Link", "Click", "MyClick"
RegisterUserFunc "WebButton", "Click", "MyClick"
''# etc
If you have many controls of which some may be missing and you know that after 10 seconds you mentioned (when the first timeout occurs), nothing more will show up, then you can use the exists method with a timeout parameter.
Something like this:
timeout = 10
For Each control in controls
If control.exists(timeout) Then
do something with the control
Else
timeout = 0
End If
Next
Now only the first timeout will be 10 seconds. Each and every subsequent timeout in your collection of controls will have the timeout set to 0 which will save your time.