I have a worker environment set up to run six cron jobs.
When checking Cloudwatch I noticed that i'm receiving 57K emptyReceives/day.
I researched about this and found Long polling can be used to counter this high number of emptyReceives.
So, I tried to reduce this by setting Receive Message Wait Time to 20s in the SQS console for the SQS queue of the worker environment.
But still I'm getting 57K emptyReceives/day.
I checked sample for 5 minutes and i'm getting 200 emptyReceives.
This means a request every 1.5 seconds, right ?. So the setting is not working obviously.
So is there any other setting i need to set before i can use long polling with Worker environment queue?
When checked the tutorial, it says short polling occurs when:
The ReceiveMessage call sets WaitTimeSeconds to 0.
The ReceiveMessage call doesn’t set WaitTimeSeconds, but the queue attribute ReceiveMessageWaitTimeSeconds is set to 0.
From what i understood, the WaitTimeSeconds of ReceiveMessage call must be 0 for this to happen in my case.
Is there any option i can change this ?
Related
I'm currently trying to optimize and scale an API built on Ruby on Rails behind an AWS ALB that sends traffic to NGINX and then into Puma to our Rails application. Our API has a timeout option of 30 seconds maximum which is when we eventually timeout the request. Currently we have a controller action that queues a Sidekiq worker and then we poll a Redis key every 100ms for the first 1 second and then move to polling every 500ms for the remaining 29 seconds. Many of our requests can be completed in under 1 second, but some of them will take the full 30 seconds before they succeed or timeout, telling the user to retry in a little while.
We're currently trying to load test this API and scale it to 500-1000 RPS and we're running into problems where the slower requests will block up all of our connections. When a slow request is running shouldn't Puma be able to take other requests in during the sleep period of the slow requests?
If this was not an API we could easily just immediately respond after we queue the background worker, but in this case we need to wait for the response and hold the connection for up to 30 seconds for the API request.
My first thought is that you can have multiple redis queues and push specific tasks to certain queues.
If you have a queue for the quick tasks and a queue for the slower tasks, then both can run in parallel without the slow tasks holding everything else up.
I am having a job which takes more than 1 hour to execute.Due to that,remaining jobs gets enqueued and not able to start.So I have decided to set maximum run time for background jobs.Is there any way to set timeout for jobs in sidekiq?
You cannot timeout or stop jobs in Sidekiq. This is dangerous and can corrupt your application data.
Sounds like you only have one Sidekiq process with concurrency of 1. You can start multiple Sidekiq processes and they will work on different jobs and you can increase concurrency to do the same.
I want to run a long-running job on cloud run. this task may execute more than 30 minutes and it mostly sends out API requests.
cloud run stops executing after about 20 minutes and from the metrics, it looks like it did not identify that my task is still in the running state. so it probably thinks it is in idling and closing the container. I guess I can run calls to the server while job run to keep the container alive, but is there a way to signal from to container to cloud run that job is still active and not to close the container?
I can tell it is closing the container since the logs just stop. and then, the next call I make to the cloud run endpoint, I can see the "listening" log again from the NodeJS express.
I want to run a long-running job on cloud run.
This is a red herring.
On Cloud Run, there’s no guarantee that the same container will be used. It’s a best effort.
While you don’t process requests, your CPU will be throttled to nearly 0, so what you’re trying to do right now (running a background task and trying to keep container alive by sending it requests) is not a great idea. Most likely your app model is not fit a for Cloud Run, I recommend other compute products that would let you run long-running processes as well.
According to the documentation, Cloud Run will time out after 15 minutes, and that limit can't be increased. Therefore, Cloud Run is not a very good solution for long running tasks. If you have work that needs to run for a long amount of time, consider delegating the work to Compute Engine or some other product that doesn't have time limits.
Yes, You can use.You can create an timer that call your own api after 5 minutes, so no timeout after 15 minutes.Whenever timer executes it will create a dummy request on your server.
Other option you can increase request timeout of container to 1 hour from 5 min, if your backend request gets complete in 1 hour
I hope this pool has a capability (or options) for auto reduce its worker's quantity when these are free for a max idle timeout.
I have read the doc of poolboy and worker_pool and found there is only maximum worker quantity option, but no option for when to reduce it.
Does it exist or how to modify them?
poolboy automatically reduces workers when there is not work for them.
You get a worker to do some work with checkout from the pool, and you release the worker with checking, as an alternative, you enclose the work on transaction which automatically checkouts the worker and after its done it checkins the worker.
When you start the pool, poolboy automatically creates a number of size workers, waiting to handle some work.
When you call checkout, poolboy tries to get one of the workers which is already started, if all of the workers are already checkout because they are doing some work, it checks its max_overflow configuration and it starts to create workers to handle the load until it reaches max_overflow.
When a worker is released, if there are not more jobs for the workers, they are killed.
So if you create a pool like
{pool, [
{size, 100},
{max_overflow, 900}
]}
It will start right away 100 processes, and if you checkout (either with checkout or transaction) more than 100 workers at a time, then for new checkouts will start creating processes until they reach 1000 processes in total (100 created from the first moment and a max overflow of 900 processes), if you continue trying to checkout more processes, it will start giving errors on timeout (unless you call the checkout with infinity in which case it will block until a worker gets free to get the job done, notice that you also can call a worker without blocking the caller).
Now, if you need more behavior than that, like kept the overflow workers going on until it pass 10 minutes of inactivity, you will need to do your own code, on which case you can just get the source code of poolboy (which is easy to ready and follow, main code is on https://github.com/devinus/poolboy/blob/master/src/poolboy.erl and its only 350 lines of code) and update the release of workers according to your needs
https://github.com/seth/pooler also has such a capability. Please refer to https://github.com/seth/pooler#culling-stale-members section.
Whenever I deploy with capistrano or run cap production delayed_job:restart, I end up with the currently-running delayed_job row remaining locked.
The delayed_job process is successfully stopped, a new delayed_job process is started, and a new row is locked by the new process. The problem is that the last process' row is still sitting there & marked as locked. So I have to go into the database manually, delete the row, and then manually add that job back into the queue for the new delayed_job process to get to.
Is there a way for the database cleanup & re-queue of the previous job to happen automatically?
I have the same problem. This happens whenever a job is forcibly killed. Part of the problem is that worker processes are managed by the daemons gem rather than delayed_job itself. I'm currently investigating ways to fix this, such as:
Setting a longer timeout before daemons forcibly terminates (nothing about this in docs for delayed_joob or daemons)
Clearing locks before starting delayed_job workers
I'll post back here whan and if I come up with a solution.
Adjust your Daemon wait time or raise an exception on SIGINT.
#John Carney is correct. In short, all delayed_job workers get sent something like a SIGINT (nice interrupt) on a redeploy. delayed_job workers, by default, will complete their current job (if they are working on one) and then gracefully terminate.
However, if the job that they are working on is a longer-running job, there's an amount of time the Daemon manager waits before it gets annoyed and sends a more serious interrupt signal, like a SIGTERM or SIGKILL. This wait time and what gets sent really depends on your setup and configuration.
When that happens, the delayed_job worker gets killed immediately without being able to finish the job it is working on or even cleanup after itself and mark the job as no longer locked.
This ends up in a "stranded" job that is marked as "locked" but locked to a process/worker that no longer exists. Not good.
That's the crux of the issue and what is happening. To get around this, you have two main options, depending on your what your jobs look like (we use both):
1. Raise an exception when an interrupt is received.
You can do this by setting the raise_signal_exceptions configuration to either :term or true:
Delayed::Worker.raise_signal_exceptions = :term
This configuration options accepts :term, true or false (default). You can read more on the original commit here.
I would try first with :term and see if that solves your issue. If not, you may need to set it to true.
Setting to :term or true will gracefully raise an exception and unlock the job for another delayed_job worker to pickup the job and start working on it.
Setting it to true means that your delayed_job workers won't even attempt to finish the current job that they are working on. They will just immediately raise an exception, unlock the job and terminate themselves.
2. Adjust how your workers are interrupted/terminated/killed on a redeploy.
This really depends on your redeploy, etc. In our case, we are using Cloud66 to handle deploys so we just had to configure this with them. But this is what ours looks like:
stop_sequence: int, 172800, term, 90, kill # Allows long-running delayed jobs to finish before being killed (i.e. on redeploy). Sends SIGINT, waits 48 hours, sends SIGTERM, waits 90 seconds, sends SIGKILL.
On a redeploy, this tells the Daemon manager to follow these steps will each delayed_job worker:
Send a SIGINT.
Wait 172800 seconds (2 days) - we have very long-running jobs.
Send a SIGTERM, if the worker is still alive.
Wait 90 seconds.
Send a SIGKILL, if the worker is still alive.
Anyway, that should help get you on the right track to configuring this properly for yourself.
We use both methods by setting a lengthy timeout as well as raising an exception when a SIGTERM is received. This ensures that if there is a job that runs past the 2 day limit, it will at least raise an exception and unlock the job, allowing us to investigate instead of just leaving a stranded job that is locked to a process that no longer exists.