One of the tasks in my DAG sometimes hangs when accessing Cloud Storage. It seems the code stops at the download function here:
hook = GoogleCloudStorageHook(google_cloud_storage_conn_id='google_cloud_default')
for input_file in hook.list(bucket, prefix=folder):
hook.download(bucket=bucket, object=input_file)
In my tests the folder contains a single 20Mb json file.
The task normally takes 20-30 seconds, but in some cases it will run for 5 minutes, and after that its state is updated to SCHEDULED and stuck there (waited for more than 6 hours). I suspect the 5 minutes are due to the configuration scheduler_zombie_task_threshold 300 but not sure.
If I clear the task manually on the Web UI, the task is quickly queued and run again correctly. I am getting around the issue by setting an execution_timeout which updates the task correctly to FAILED or UP_FOR_RETRY state when it takes longer than 10 minutes; but I'd like to fix the underlying issue to avoid relying on a fixed timeout threshold, any suggestions?
There was a discussion on the Cloud Composer Discuss group about this: https://groups.google.com/d/msg/cloud-composer-discuss/alnKzMjEj8Q/0lbp3bTlAgAJ. It is a problem with the Celery executor when Airflow workers die.
Although Composer is working on a fix, if you want this to happen less frequently in the current version, you may consider reducing your parallelism Airflow configuration or creating a new environment with a larger machine-type.
So I'm using Heroku to host a simple script, which runs whenever a specific page is loaded. The script takes longer than 30 seconds to run, which Heroku returns as an H12 error - Request Timeout (https://devcenter.heroku.com/articles/limits#router). I can't use a background process for this task, as I'm using its run time as a loading screen for the user. I know the process will still complete, but I want a 200 code to be spent when the script finishes.
Is there a way to send a single byte every, say, 20 seconds, so that the request doesn't time-out, and will stop whenever the script finishes? (a response from the heroku page will start a rolling 55-second window preventing timeout). Do I have to run another process simultaneously to check if the longer process is finished, sending a kind of 'heartbeat' to the requesting page, letting it know the process is still running - and preventing heroku from timing out? I'm extremely new to rails, any and all help is appreciated!
Whenever I deploy with capistrano or run cap production delayed_job:restart, I end up with the currently-running delayed_job row remaining locked.
The delayed_job process is successfully stopped, a new delayed_job process is started, and a new row is locked by the new process. The problem is that the last process' row is still sitting there & marked as locked. So I have to go into the database manually, delete the row, and then manually add that job back into the queue for the new delayed_job process to get to.
Is there a way for the database cleanup & re-queue of the previous job to happen automatically?
I have the same problem. This happens whenever a job is forcibly killed. Part of the problem is that worker processes are managed by the daemons gem rather than delayed_job itself. I'm currently investigating ways to fix this, such as:
Setting a longer timeout before daemons forcibly terminates (nothing about this in docs for delayed_joob or daemons)
Clearing locks before starting delayed_job workers
I'll post back here whan and if I come up with a solution.
Adjust your Daemon wait time or raise an exception on SIGINT.
#John Carney is correct. In short, all delayed_job workers get sent something like a SIGINT (nice interrupt) on a redeploy. delayed_job workers, by default, will complete their current job (if they are working on one) and then gracefully terminate.
However, if the job that they are working on is a longer-running job, there's an amount of time the Daemon manager waits before it gets annoyed and sends a more serious interrupt signal, like a SIGTERM or SIGKILL. This wait time and what gets sent really depends on your setup and configuration.
When that happens, the delayed_job worker gets killed immediately without being able to finish the job it is working on or even cleanup after itself and mark the job as no longer locked.
This ends up in a "stranded" job that is marked as "locked" but locked to a process/worker that no longer exists. Not good.
That's the crux of the issue and what is happening. To get around this, you have two main options, depending on your what your jobs look like (we use both):
1. Raise an exception when an interrupt is received.
You can do this by setting the raise_signal_exceptions configuration to either :term or true:
Delayed::Worker.raise_signal_exceptions = :term
This configuration options accepts :term, true or false (default). You can read more on the original commit here.
I would try first with :term and see if that solves your issue. If not, you may need to set it to true.
Setting to :term or true will gracefully raise an exception and unlock the job for another delayed_job worker to pickup the job and start working on it.
Setting it to true means that your delayed_job workers won't even attempt to finish the current job that they are working on. They will just immediately raise an exception, unlock the job and terminate themselves.
2. Adjust how your workers are interrupted/terminated/killed on a redeploy.
This really depends on your redeploy, etc. In our case, we are using Cloud66 to handle deploys so we just had to configure this with them. But this is what ours looks like:
stop_sequence: int, 172800, term, 90, kill # Allows long-running delayed jobs to finish before being killed (i.e. on redeploy). Sends SIGINT, waits 48 hours, sends SIGTERM, waits 90 seconds, sends SIGKILL.
On a redeploy, this tells the Daemon manager to follow these steps will each delayed_job worker:
Send a SIGINT.
Wait 172800 seconds (2 days) - we have very long-running jobs.
Send a SIGTERM, if the worker is still alive.
Wait 90 seconds.
Send a SIGKILL, if the worker is still alive.
Anyway, that should help get you on the right track to configuring this properly for yourself.
We use both methods by setting a lengthy timeout as well as raising an exception when a SIGTERM is received. This ensures that if there is a job that runs past the 2 day limit, it will at least raise an exception and unlock the job, allowing us to investigate instead of just leaving a stranded job that is locked to a process that no longer exists.
In our application we are using rake task for sending mails to around 11 000 users. Each email sending is executed as a delayed job as given following.
#Users.each do |a|
a.delay.send_email(body,text)
end
It was working perfect two weeks back and suddenly slowed down. Means it was about to send all that emails in single day but currently it takes time.
We have tried to follow this performance issue but couldn't find anything so far.
1. We investigated the code, tried with single delayed job. Commented out the part taking from db etc. But it is doing in the same time
2. Tried the email sending part commented out. But time taken was same to execute the delayed job.
Later on noticed about the heroku worker process dyno. We have purchased 1 Worker and 2 Webs currently. Is that the reason it is getting delayed. If so how it was previously working? Adding more workers would increase the performance?
Let's see if i can explain this problem structured enough.
I run a webservice that handles email sending asynchronously using RabbitMQ and a ruby lib called Minion. On certain models (like a comment) we have an after create hook that adds an email-event to Rabbit. This event is then processed through a worker that runs as a rake task, a gist here, that loads the appropriate user and sends the email.
This setup works in 90% of the cases but every now and then the worker crashes due to an ActiveRecord::RecordNotFound exception. But how is this possible, I queue the event after the object is created and it takes additional ms for the event to pass through the Rabbit layer. Could it be that the context of the rake task causes the problem? Is it a bad choice to run long running workers within rake with the environment flag? Help! :)