What causes dask job failure with CancelledError exception - dask

I have been seeing below error message for quite some time now but could not figure out what leads to the failure.
Error:
concurrent.futures._base.CancelledError: ('sort_index-f23b0553686b95f2d91d4a3fda85f229', 7)
On restart of dask cluster it runs successfully.

If running a dask-cloudprovider ECSCluster or FargateCluster the concurrent.futures._base.CancelledError can result from a long-running step in computation where there is no output (logging or otherwise) to the Client. In these cases, due to the lack of interaction with the client, the scheduler regards itself as "idle" and times out after the configured cloudprovider.ecs.scheduler_timeout period, which defaults to 5 minutes. The CancelledError error message is misleading, but if you look in the logs for the scheduler task itself it will record the idle timeout.
The solution is to set scheduler_timeout to a higher value, either via config or by passing directly to the ECSCluster/FargateCluster constructor.

Related

Is there way to run code before Sidekiq is restarted in the middle of a job?

I have a Sidekiq job that runs every 4 minutes.
This job checks if the current code block is being executed before executing the code again
process = ProcessTime.where("name = 'ad_queue_process'").first
# Return if job is running
return if process.is_running == true
If Sidekiq restarts midway through the code block, code that updates the status of the job never runs
# Done running, update the process times and allow it to be ran again
process.update_attributes(is_running: false, last_execution_time: Time.now)
Which leads the the Job never running unless i run an update statement to set is_running = false
Is there any way to execute code before Sidekiq is restarted?
Update:
Thanks to #Aaron, and following our discussion (comments below), the ensure block (which is executed by the forked worker-threads) can only be ran for a few unguaranteed milliseconds before the main-thread forcefully terminates these worker-threads, in order for the main-thread to do some "cleanup" up the exception stack, in order to avoid getting SIGKILL-ed by Heroku. Therefore, make sure that your ensure code should be really fast!
TL;DR:
def perform(*args)
# your code here
ensure
process.update_attributes(is_running: false, last_execution_time: Time.now)
end
The ensure above is always called regardless if the method "succeeded" or an Exception is raised. I tested this: see this repl code, and click "Run"
In other words, this is always called even on a SignalException, even if the signal is SIGTERM (gracefully shutdown signal), but ONLY EXCEPT on SIGKILL (force unrescueable shutdown). You can verify this behaviour by checking my repl code, and then change Process.kill('TERM', Process.pid) to Process.kill('KILL', Process.pid), and then click "run" again (you'll notice that the puts won't be called)
Looking at Heroku docs, I quote:
When Heroku is going to shut down a dyno (for a restart or a new deploy, etc.), it first sends a SIGTERM signal to the processes in the dyno.
After Heroku sends SIGTERM to your application, it will wait a few seconds and then send SIGKILL to force it to shut down, even if it has not finished cleaning up. In this example, the ensure block does not get called at all, the program simply exits
... which means that the ensure block will be called because it's a SIGTERM and not a SIGKILL, only except if the shutting down takes a looong time, which may due to (some reasons I could think of ATM):
Something inside your perform code (or any ruby code in the stack; even gems) that also rescued the SignalException, or even rescued the root Exception class because SignalException is a subclass of Exception) but takes a long time cleaning up (i.e. cleaning up connections to DB or something, or I/O stuff that hangs your application)
Or, your own ensure block above takes a looong time. I.E when doing the process.update_attributes(...), for some reason the DB temporary hangs / network delay or timeout, then that update might not succeed at all! and will ran out of time, of which from my quote above, after a few seconds after the SIGTERM, the application will be forced to be stopped by Heroku sending a SIGKILL.
... which all means that my solution is still not fully reliable, but should work under normal situations
Handle sidekiq shutdown exception
class SomeWorker
include Sidekiq::Worker
sidekiq_options queue: :default
def perform(params)
...
rescue Sidekiq::Shutdown
SomeWorker.perform_async(params)
end
end

Transactions not expired after timeout expired

We're using neo4j (3.1.5-enterprise) for one of our services. (Over HTTP)
We set dbms.transaction.timeout=150s in our neo4j config file .
We have a scenario which may take more time than 150 seconds, but what we would like is for the transaction to be expired after 150 seconds anyway.
For some reason its not happening and the transaction continue until it fully executed but its not being stopped after 150 seconds, any guess why?
In our application logs I can see the following exception (more stacktrace details below):
neo.db.NeoHttpDriver - Errors in response:
[NeoResponseError{
code='Neo.DatabaseError.Statement.ExecutionFailed',
message='Transaction timeout. (Overtime: 23793 ms).',
stackTrace='org.neo4j.kernel.guard.GuardTimeoutException: Transaction timeout. (Overtime: 23793 ms).
...
Also, our service steps(in the specific scenario that may take long time) in general is open a transaction, lock some common entity and proceed. Since the transaction is not expired and released(and therefor the common entity continue to be locked) after 150 seconds, then other threads may also be locked for a long time.
Thanks!
Orel
Exception stacktrace:
15:00:59.627 [DefaultThreadPool-7] DEBUG c.e.e.m.neo.db.NeoHttpDriver - Errors in response: [NeoResponseError{code='Neo.DatabaseError.Statement.ExecutionFailed', message='Transaction timeout. (Overtime: 23793 ms).', stackTrace='org.neo4j.kernel.guard.GuardTimeoutException: Transaction timeout. (Overtime: 23793 ms).
at org.neo4j.kernel.guard.TimeoutGuard.check(TimeoutGuard.java:71)
at org.neo4j.kernel.guard.TimeoutGuard.check(TimeoutGuard.java:57)
at org.neo4j.kernel.guard.TimeoutGuard.check(TimeoutGuard.java:49)
at org.neo4j.kernel.impl.api.GuardingStatementOperations.nodeCursorById(GuardingStatementOperations.java:300)
at org.neo4j.kernel.impl.api.OperationsFacade.nodeHasProperty(OperationsFacade.java:343)
at org.neo4j.cypher.internal.spi.v3_1.TransactionBoundQueryContext$NodeOperations.hasProperty(TransactionBoundQueryContext.scala:319)
at org.neo4j.cypher.internal.compatibility.ExceptionTranslatingQueryContextFor3_1$ExceptionTranslatingOperations$$anonfun$hasProperty$1.apply$mcZ$sp(ExceptionTranslatingQueryContextFor3_1.scala:245)
at org.neo4j.cypher.internal.compatibility.ExceptionTranslatingQueryContextFor3_1$ExceptionTranslatingOperations$$anonfun$hasProperty$1.apply(ExceptionTranslatingQueryContextFor3_1.scala:245)
at org.neo4j.cypher.internal.compatibility.ExceptionTranslatingQueryContextFor3_1$ExceptionTranslatingOperations$$anonfun$hasProperty$1.apply(ExceptionTranslatingQueryContextFor3_1.scala:245)
at org.neo4j.cypher.internal.spi.v3_1.ExceptionTranslationSupport$class.translateException(ExceptionTranslationSupport.scala:32)
at org.neo4j.cypher.internal.compatibility.ExceptionTranslatingQueryContextFor3_1.translateException(ExceptionTranslatingQueryContextFor3_1.scala:34)
at org.neo4j.cypher.internal.compatibility.ExceptionTranslatingQueryContextFor3_1$ExceptionTranslatingOperations.hasProperty(ExceptionTranslatingQueryContextFor3_1.scala:245)
at org.neo4j.cypher.internal.compiler.v3_1.spi.DelegatingOperations.hasProperty(DelegatingQueryContext.scala:221)
at org.neo4j.cypher.internal.compiler.v3_1.pipes.AbstractSetPropertyOperation.setProperty(SetOperation.scala:98)
at org.neo4j.cypher.internal.compiler.v3_1.pipes.SetEntityPropertyOperation.set(SetOperation.scala:117)
at org.neo4j.cypher.internal.compiler.v3_1.pipes.SetPipe$$anonfun$internalCreateResults$1.apply(SetPipe.scala:31)
at org.neo4j.cypher.internal.compiler.v3_1.pipes.SetPipe$$anonfun$internalCreateResults$1.apply(SetPipe.scala:30)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator$$anonfun$next$1.apply(ResultIterator.scala:71)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator$$anonfun$next$1.apply(ResultIterator.scala:68)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator$$anonfun$failIfThrows$1.apply(ResultIterator.scala:94)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.decoratedCypherException(ResultIterator.scala:103)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.failIfThrows(ResultIterator.scala:92)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.next(ResultIterator.scala:68)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.next(ResultIterator.scala:49)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.foreach(ResultIterator.scala:49)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.to(ResultIterator.scala:49)
at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.toList(ResultIterator.scala:49)
at org.neo4j.cypher.internal.compiler.v3_1.EagerResultIterator.<init>(ResultIterator.scala:35)
at org.neo4j.cypher.internal.compiler.v3_1.ClosingIterator.toEager(ResultIterator.scala:53)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.DefaultExecutionResultBuilderFactory$ExecutionWorkflowBuilder.buildResultIterator(DefaultExecutionResultBuilderFactory.scala:109)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.DefaultExecutionResultBuilderFactory$ExecutionWorkflowBuilder.createResults(DefaultExecutionResultBuilderFactory.scala:99)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.DefaultExecutionResultBuilderFactory$ExecutionWorkflowBuilder.build(DefaultExecutionResultBuilderFactory.scala:68)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.InterpretedExecutionPlanBuilder$$anonfun$getExecutionPlanFunction$1.apply(ExecutionPlanBuilder.scala:164)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.InterpretedExecutionPlanBuilder$$anonfun$getExecutionPlanFunction$1.apply(ExecutionPlanBuilder.scala:148)
at org.neo4j.cypher.internal.compiler.v3_1.executionplan.InterpretedExecutionPlanBuilder$$anon$1.run(ExecutionPlanBuilder.scala:123)
at org.neo4j.cypher.internal.compatibility.CompatibilityFor3_1$ExecutionPlanWrapper$$anonfun$run$1.apply(CompatibilityFor3_1.scala:275)
at org.neo4j.cypher.internal.compatibility.CompatibilityFor3_1$ExecutionPlanWrapper$$anonfun$run$1.apply(CompatibilityFor3_1.scala:273)
at org.neo4j.cypher.internal.compatibility.exceptionHandlerFor3_1$runSafely$.apply(CompatibilityFor3_1.scala:190)
at org.neo4j.cypher.internal.compatibility.CompatibilityFor3_1$ExecutionPlanWrapper.run(CompatibilityFor3_1.scala:273)
at org.neo4j.cypher.internal.PreparedPlanExecution.execute(PreparedPlanExecution.scala:26)
at org.neo4j.cypher.internal.ExecutionEngine.execute(ExecutionEngine.scala:107)
at org.neo4j.cypher.internal.javacompat.ExecutionEngine.executeQuery(ExecutionEngine.java:59)
at org.neo4j.server.rest.transactional.TransactionHandle.safelyExecute(TransactionHandle.java:371)
at org.neo4j.server.rest.transactional.TransactionHandle.executeStatements(TransactionHandle.java:323)
at org.neo4j.server.rest.transactional.TransactionHandle.execute(TransactionHandle.java:230)
at org.neo4j.server.rest.transactional.TransactionHandle.execute(TransactionHandle.java:119)
at org.neo4j.server.rest.web.TransactionalService.lambda$executeStatements$0(TransactionalService.java:203)
Most likely the problem is that the tx is waiting on a lock. Prior to Neo4j 3.2, dbms.transaction.timeout cannot cover the case of terminating a transaction that's waiting on a lock (or rather, it will mark it for termination, but the actual termination won't happen until the lock is acquired).
In Neo4j 3.2, dbms.lock.acquisition.timeout was introduced, which interrupts waiting on a lock and allows the thread to check if the tx has been terminated and take appropriate action.
The following is based on an answer provided by Neo4j Support:
dbms.lock.acquisition.timeout
As a starting point, dbms.lock.acquisition.timeout was only added in Neo4j 3.2, it does not exist for 3.1. where we don't yet have lock acquisition timeout, hence wait times on locks can over-runs past the set limit. Things like GC can also extend the time. However, as you're currently on 3.1.5, dbms.lock.acquisition.timeout would not yet be enforced.
dbms.transaction.timeout
dbms.transaction.timeout marks a transaction for termination, but the actual logic of checking this and performing the termination happens on a running thread, not one waiting on locks, and doesn't cause the thread to wake up and check. Presumably the logic for terminating a thread upon timeout is that some other thread periodically checks execution time for a transaction, and if it has exceeded the transaction timeout, sets a boolean variable on the transaction to indicate that it is marked for termination. The actual termination of the thread likely happens in an event loop for the transaction, where it checks that variable to see if it's marked for termination, then terminates and rolls back. A thread that attempts to acquire a lock enters a waiting state when the lock is already held by another thread. During this waiting state, the event loop is not being processed, so the thread never reaches the point in the event loop where it can check if it's been marked as terminated and take care of it.
Bottom line:
dbms.transaction.timeout does not cause a hard timeout, it only marks the transaction as timed-out, which will cause it to rollback once the flag is checked.

Dataflow concurrency error with ValueState

The Beam 2.1 pipeline uses ValueState in a stateful DoFn. It runs fine with a single worker but when scaling is enabled will fail with "Unable to read value from state" and the root exception below. Any ideas what could cause this?
Caused by: java.util.concurrent.ExecutionException: com.google.cloud.dataflow.worker.KeyTokenInvalidException: Unable to fetch data due to token mismatch for key ��
at com.google.cloud.dataflow.worker.repackaged.com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:500)
at com.google.cloud.dataflow.worker.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:459)
at com.google.cloud.dataflow.worker.repackaged.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
at com.google.cloud.dataflow.worker.repackaged.com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:62)
at com.google.cloud.dataflow.worker.WindmillStateReader$WrappedFuture.get(WindmillStateReader.java:309)
at com.google.cloud.dataflow.worker.WindmillStateInternals$WindmillValue.read(WindmillStateInternals.java:384)
... 16 more
Caused by: com.google.cloud.dataflow.worker.KeyTokenInvalidException: Unable to fetch data due to token mismatch for key ��
at com.google.cloud.dataflow.worker.WindmillStateReader.consumeResponse(WindmillStateReader.java:469)
at com.google.cloud.dataflow.worker.WindmillStateReader.startBatchAndBlock(WindmillStateReader.java:411)
at com.google.cloud.dataflow.worker.WindmillStateReader$WrappedFuture.get(WindmillStateReader.java:306)
... 17 more
I believe that exception should just be rethrown. It is thrown by the state mechanism to indicate that additional work on that key should not be performed, and will be automatically retried by the Dataflow runner.
These typically indicate that either that particular work should be performed on a different worker (thus proceeding wouldn't be helpful).
It may be possible that misusing state -- storing the state object from one key and attempting to use it on a different key -- could also lead to these errors. If that is the case, you may be able to see more diagnostic messages in either the worker or shuffler logs in Stackdriver logging.
If neither retrying nor looking at logging and how you use the state objects help, please provide a job ID demonstrating the problem.

How to protect against Redis::TimeoutError: Connection timed out on Heroku

This might seem like a silly question, but I figured someone here on StackOverflow might have some ideas around this. What the hell, right?
I'm using Heroku workers with 1X Dynos to run Resque. Sometimes I get this error: Redis::TimeoutError: Connection timed out. It happens in the redis gem; here's the stacktrace:
Redis::TimeoutError Connection timed out
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis/connection/ruby.rb:55 rescue in _read_from_socket
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis/connection/ruby.rb:48 _read_from_socket
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis/connection/ruby.rb:41 gets
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis/connection/ruby.rb:273 read
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis/client.rb:245 block in read
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis/client.rb:233 io
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis/client.rb:244 read
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis/client.rb:175 block in call_pipelined
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis/client.rb:214 block (2 levels) in process
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis/client.rb:340 ensure_connected
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis/client.rb:204 block in process
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis/client.rb:286 logging
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis/client.rb:203 process
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis/client.rb:174 call_pipelined
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis/client.rb:146 block in call_pipeline
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis/client.rb:273 with_reconnect
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis/client.rb:144 call_pipeline
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis.rb:2101 block in pipelined
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis.rb:37 block in synchronize
vendor/ruby-2.1.5/lib/ruby/2.1.0/monitor.rb:211 mon_synchronize
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis.rb:37 synchronize
vendor/bundle/ruby/2.1.0/gems/redis-3.2.0/lib/redis.rb:2097 pipelined
vendor/bundle/ruby/2.1.0/gems/redis-namespace-1.5.1/lib/redis/namespace.rb:413 namespaced_block
vendor/bundle/ruby/2.1.0/gems/redis-namespace-1.5.1/lib/redis/namespace.rb:265 pipelined
vendor/bundle/ruby/2.1.0/gems/resque-1.25.2/lib/resque.rb:214 push
vendor/bundle/ruby/2.1.0/gems/resque-1.25.2/lib/resque/job.rb:153 create
vendor/bundle/ruby/2.1.0/gems/resque_solo-0.1.0/lib/resque_ext/job.rb:7 create_solo
vendor/bundle/ruby/2.1.0/gems/resque-1.25.2/lib/resque.rb:317 enqueue_to
vendor/bundle/ruby/2.1.0/gems/resque-1.25.2/lib/resque.rb:298 enqueue
We have resque-retry set up, but I guess it doesn't matter if the enqueue call can't even connect to redis.
My current solution is to wrap every Resque.enqueue call with begin/rescue so that we can re-try the enqueue call (as per https://github.com/resque/resque/issues/840). But isn't there a better way?
Heroku Redis allows you to change your instance timeout and maxmemory-policy settings. Theses settings will be kept across upgrades and HA failover.
From documentation:
The timeout setting sets the number of seconds Redis waits before killing idle connections. A value of zero means that connections will not be closed. The default value is 300 seconds (5 minutes). You can change this value using the CLI:
$ heroku redis:timeout maturing-deeply-2628 --seconds 60
Timeout for maturing-deeply-2628 (REDIS_URL) set to 60 seconds.
Connections to the redis instance will be stopped after idling for 60 seconds.
Looks like in case of resque the 0 (do not use connection time out) would be the best choise (--seconds 0).

Infinispan synchronization timeout exception

I'm afraid I'm a little bit of a noob at this, but here goes.
I have an issue with Infinispan and timeout during synch.
I'm running two instances of JBoss AS7 with Infinispan 5.2.7-FINAL. TCP for synch.
Sometimes when deleting an entry from a cache on one of the nodes, the synchronization seems to fail, but checking the log on both nodes tells med the entry is gone and all the required cleanup involved from application side (tearing down Camel routes, etc) has been done. The initiating node claims to time-out when notifying it's peer:
09:53:33,182 INFO [ConfigurationService] (http-/10.217.xx.yy:8080-3) ConfigurationService: deleteSmsAccountConfigurationData: got smsAccountID of SmsAccountIdentifier{smsAccountId=tel:46xxxxxx}
09:53:33,183 INFO [com.ex.control.throttling.controller.ThrottlingService] (http-/10.217.xx.yy:8080-3) ThrottlingService: deleteThrottleForResource: deleting throttle from resource tel:46xxxxxx
09:53:45,232 INFO [com.ex.control.routing.startup.DynamicRouteBootStrap] (OOB-21,smapexs01-54191) DynamnicRouteBootStrap: stopRoute: Attempting to stop sendSmppRoute: sendsmpproutetel:46xxxxxx recieveSmppRoute: recievesmpproutetel:46xxxxxx
09:53:45,233 INFO [org.apache.camel.impl.DefaultShutdownStrategy] (OOB-21,smapexs01-54191) Starting to graceful shutdown 1 routes (timeout 300 seconds)
09:53:45,235 INFO [org.apache.camel.impl.DefaultShutdownStrategy] (Camel (camel-2) thread #1 - ShutdownTask) Route: sendsmpproutetel:46xxxxxx shutdown complete, was consuming from: Endpoint[direct://tel:46xxxxxx]
09:53:45,235 INFO [org.apache.camel.impl.DefaultShutdownStrategy] (OOB-21,smapexs01-54191) Graceful shutdown of 1 routes completed in 0 seconds
09:53:48,190 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (http-/10.217.xx.yy:8080-3) ISPN000136: Execution error: org.infinispan.CacheException: org.jgroups.TimeoutException: timeout sending message to smapexs03-38104
On this node it takes less than a second (I guess) to tear it down, on the other node it takes 6 seconds, is there a timeout of say 5 seconds somewhere that I have overseen?
I've searched this forum and others, but have not been able to sort it out.
Like I said, I'm a bit of a noob, I hope this makes sense to you. Anything that points me in the right direction is very welcome!
Thanks,
Peter

Resources