Sidekiq not adding jobs to queue - ruby-on-rails

Sometime ago I wrote a small Ruby application which uses Sidekiq to convert video files and pushes them further to few online video hosting services. I use two Workers and Queues, one to actually convert file and second to publish converted files. Jobs are pushed to first Queue by Rails application for conversion, and after successful processing Conversion Worker pushes Upload job to second queue.
Rails -> Converter Queue -> Uploader Queue
Recently I discover a massive memory leak in converter library which appears after every few jobs and overloads whole server, so I did a little hack to avoid this by stopping whole Sidekiq Worker process using Interrupt exception and starting it again by Systemd.
It works perfectly until yesterday. I get notification from my client that files are not converted. I did some investigation to find out whats failing and found that jobs are not added to Converter queue. It starts failing without any changes in code or services. When Rails adds jobs to Sidekiq Queue it receives proper Job ID, no exception or warning at all, but the job simply not appears in any Queue. I checked Redis logs, Systemd logs, dmesg, every logs that i could check and did not find even the slightest warning - it seems that jobs get lost in vacuum :/ In fact, after more digging and debugging I discover that if one job is pushed rapidly ( 100 times in a loop ), then there is a chance that Sidekiq will add job to Queue. Of course, sometimes it will add all jobs, and sometimes not even single one.
The second Queue works perfectly - it picks every single job that I add to it. When I try to add 1000 new jobs, second Queue queues them all, when Converter queue gets at best 10 jobs. Things gets really weird when I try to use another Queue - I pushed 100 jobs to a new Queue, of course all of them are added properly and then I instruct Conversion worker to use that new Queue. And it works - I can add new Jobs to that Queue and it seems that all of them are pushed successfully - but when Worker finish processing all jobs that were pushed before that Worker was assigned to this Queue it starts to failing again. Disabling code that restarts Worker after every job didn't help at all.
Funny thing is that in fact jobs are pushed to Queue but only when I pushes them multiple times, and it seems totally random when Job is added properly. This bugs appears from nowhere, for few months things works perfectly and recently starts failing without any changes in code or server. Logs are perfectly clear, Sidekiq is used with the same Redis server without any problems by few other applications - it seems that only this particular Worker have this problem. I did not found any references to similar bug on the web and I spent two days trying to debug this and find source of this weird behavior, and I found nothing, everything seems to work perfectly and Jobs are simply disappearing somewhere between push and Redis database.

Related

Quart.Net is Sometimes Running Overlapping Tasks

I am using Quartz.Net 3.0.7 to manage a scheduler. In my test environment I have two instances of the scheduler running. I have a test process that runs for exactly 2 hours before ending. Quartz is configured to start the process every 10 seconds and I am using the DisallowConcurrentExecution attribute to prevent multiple instances of the task from running at the same time. 80% of the time this is working as expected. Quartz will start up the process and prevent any other instances of the task from starting until after the initial one has completed. If I stop one of the two services hosting Quart, then the other instance picks up the task at the next 10-second mark.
However, after keeping these two Quartz services running for 48 uninterrupted hours, I have discovered a couple of times where things went horribly wrong. At times host B will start up the task, even though the task is still in the middle of its 2 hour execution on host A. At one point I even found the process had started up 3 times on host B, all within a 10 minute period. So, for a two hour period, the one task had three instances running simultaneously. After all three finished, Quartz went back to the expected schedule of only having one instance running at a time.
If these overlapping tasks were happening 100% of the time, I would think there is something wrong on my end, but since it seems to happen only about 20% of the time, I am thinking it must be something in the Quartz implementation. Is this by design or is it a bug? If there is an event I can capture from Quart.Net to tell me that another instance of a task has started up, I can listen for that and stop the existing task from running. I just need to make sure that DisallowConcurrentExecution is getting obeyed and prevent a task from running multiple instances concurrently. Thanks.
Edit:
I added logic that uses context.Scheduler.GetCurrentlyExecutingJobs to look for any jobs that have the same JobDetail.Key but a different FireInstanceId when my task starts up. If I find another currently executing job, I will prevent this instance from doing anything. I am finding that in the duplicate concurrent scenario, Quartz is reporting that there are no other jobs currently executing with the same JobDetail.Key. Should that be possible? Under what case would Quartz.Net start an IJob, lose track of it as an executing job after a few minutes, but allow it to continue executing without cancelling the CancellationToken?
Edit2:
I found an instance in my logs where Quartz started a task as expected. Then, one minute later, Quartz tried to start up 9 additional instances, each with a different FireInstanceId. My custom code blocked the 9 additional instances, because it can see that the original instance was still going, by calling GetCurrentlyExecutingJobs to get a list of running jobs. I double checked and the ConcurrentExecutionDisallowed flag is true on all of the tasks at runtime, so I would expect that Quartz would prevent the duplicate instances. This sounds like a bug. Am I expected to handle this manually or should I expect Quartz to get this right?
Edit3:
I am definitely looking at two different problems. In both cases Quartz.Net is launching my IJob instance with a new FireInstanceId while there is already another FireInstanceId running for the same JobKey. In one scenario I can see that both FireInstanceIds are active by calling GetCurrentlyExecutingJobs. In the second scenario calling GetCurrentlyExecutingJobs shows that the first FireInstanceId is no longer running, even though I can see from my logs that the original instance is still running. Both of these scenarios result in multiple instances of my IJob running at the same time, which is not acceptable. It is easy enough to tackle the first scenario by calling GetCurrentlyExecutingJobs when my IJob starts, but the second scenario is harder. I will have to ping GetCurrentlyExecutingJobs on an interval and stop the task if it’s FireInstanceId has disappeared from the active list. Has anyone else really not noticed this behavior?
I found that if I set this option, that I no longer have overlapping executing jobs. I still wish that Quartz would cancel the job’s cancellation token, though, if it lost track of the executing job.
QuartzProperties.Add("quartz.jobStore.clusterCheckinInterval", "60000");

how to continuously deploy with long running jobs

We currently use delayed_job and rails to manage some long running jobs in our system. Some of these jobs take potentially hours to run, but we also like to deploy rather frequently, often many times a day. The problem with this setup is that we have to restart delayed_job during deployment to pick up code changes, so that any new jobs are processed with the latest code.
The solution we've arrived at is that for any job that needs to run for more than some small amount of time, we fork the delayed job so that it returns immediately, and the forked process handles the work. This way a deploy can restart all the delayed job processes, while the long-running 'job' keeps going until it's finished as an orphaned process.
We've looked at sidekiq, but it looks like we'd have the same issue there when trying to deploy new code.
Has anyone developed a solution they would recommend for dealing with long-running background processes that span multiple deployments?

background tasks executing immediately and parallelly in rails

our rails web app has to download/unpack archives with html pages from ftp on request for user's viewing through the browser.
the archive can be quite big, so user has to wait until it downloads/unpacks on the server.
i implemented progress bar the way that i call fork/Process.detach in user's request, so that his request is done but downloading/unpacking process continues running in the background. and javascript rendered in his browser pings our server for status until all is ready and then it redirects him to unpacked html pages.
as long as user requests one archive, everything goes smoothly, but if he tries to run 2 or more requests at the same time(so that more forks are started), it seems that only one of them completes, and the rest expires/times outs/gets killed by passenger(?). i suppose its the issue with Passenger/forking.
i am not sure if its possible to fix it somehow so i guess i need to switch to another solution. the solution needs to permit immediate and parallel processing of downloads. so that if user requests multiple archives, he has to see download/decompression progress in all of them at the same time.
i was thinking about running background rake job immediately but it seems very slow to startup(also there's a lot of cron rake tasks happening every minute on our server). reason i liked fork was that it was very fast to start. i know there is delayed job, we also use it heavily for other tasks. but can it start multiple processes at the same time immediately without queues?
solved by keeping the fork and using single dj worker. this way i can have as many processes starting at the same time as needed without trouble with passenger/modifying our product's gemset (which we are trying to avoid since it resulted in bugs in the past)
not sure if forking inside dj worker can cause any troubles, so asked at
running fork in delayed job
if id be free to modify gemset, id probably use resque as wrdevos suggested, or sidekiq, or girl_friday(but thats less probable because it depends on the server running).
Use Resque: https://github.com/defunkt/resque
More on bg jobs and Resque here.
https://github.com/blog/542-introducing-resque

running fork in delayed job

we use delayed job in our web application and we need multiple delayed jobs workers happening parallelly, but we don't know how many will be needed.
solution i currently try is running one worker and calling fork/Process.detach inside the needed task.
i was trying to run fork directly in rails application previously but it didnt work too good with passenger.
this solution seems to work well. could there be any caveats in production?
one issue which happened to me today and which anyone trying that should take care of was following:
i noticed that worker is down so i started it. something i didnt think about was that there were 70 jobs waiting in queue. and since processes are forked, they pretty much killed our server for around half an hour by starting all almost immediately and eating all memory in process.. :]
so ensuring that there is god watching over the worker is important.
also worker seems to die often but not sure yet if its connected with forking.

rails backgroundjob running jobs in parallel?

I'm very happy with By so far, only I have this one issue:
When one process takes 1 or 2 hours to complete, all other jobs in the queue seem to wait for that one job to finish. Worse still is when uploading to a server which time's out regularly.
My question: is Bj running jobs in parallel or one after another?
Thank you,
Damir
BackgroundJob will only allow one worker to run per webserver instance. This is by design to keep things simple. Here is a quote from Bj's README:
If one ignores platform specific details the design of Bj is quite simple: the
main Rails application submits jobs to table, stored in the database. The act
of submitting triggers exactly one of two things to occur:
1) a new long running background runner to be started
2) an existing background runner to be signaled
The background runner refuses to run two copies of itself for a given
hostname/rails_env combination. For example you may only have one background
runner processing jobs on localhost in development mode.
The background runner, under normal circumstances, is managed by Bj itself -
you need do nothing to start, monitor, or stop it - it just works. However,
some people will prefer manage their own background process, see 'External
Runner' section below for more on this.
The runner simply processes each job in a highest priority oldest-in fashion,
capturing stdout, stderr, exit_status, etc. and storing the information back
into the database while logging it's actions. When there are no jobs to run
the runner goes to sleep for 42 seconds; however this sleep is interuptable,
such as when the runner is signaled that a new job has been submitted so,
under normal circumstances there will be zero lag between job submission and
job running for an empty queue.
You can learn more on the github page: Here

Resources