All background jobs failing with max concurrent job limit reached - Parse - ios

I have a background job that runs every minute and it has been working fine for the last few weeks. All of a sudden when I log in today, every job is failing instantly with max concurrent job limit reached. I have tried deleting the job and waiting for 15 minutes so any job running currently can finish but when I schedule the job again it just starts failing every time like before. I don't understand why parse thinks I am running a job when I am not.

Someone deleted my previous answer (not sure why). This is and continues to be a Parse Bug. See this Google Group for other people reporting the issue (https://groups.google.com/forum/#!msg/parse-developers/_TwlzCSosgk/KbDLHSRmBQAJ) there are several open Parse bugs about this as well.

Related

How can I schedule a Jenkins job to be started only when it is not already running?

The cron part is easy and covered elsewhere. What I have is a job that runs mostly for six hours but sometimes only two or three. If I schedule the job for every six hours and it only runs for two, I waste four hours waiting for the next one. If I schedule it every hour, then when it runs six hours, five jobs are going to stack up waiting their turn.
Rigging something in the job itself to figure out if it already running is sub optimal. There are four machines involved so I can't examine process tables. I would need a semaphore on disk or in some other shared resource, which involves making sure it is cleared not only when the job finishes, but when it dies.
I could also query Jenkins to see if a job is already running, but that's also more code I need to write and get right.
Is there a way to tell Jenkins directly: Schedule only if you are not already running?
Maybe disableConcurrentBuilds could help? https://www.jenkins.io/doc/book/pipeline/syntax/
And you can also clean the queue if needed

Jenkins job gets stuck in queue. How to make it fail automatically if the worker is offline

So on my jenkins sometimes my worker "slave02" gets offline and needs to manually get unstuck. I will not get into details, because it's not the point of this question here.
The scenario so far:
I've configured a job intentionally to get processed on that exact worker. But obviously it would not start since the worker is offline. I want to get notified when that job gets stuck in queue. I've tried to use Build Timeout Jenkins Plugin and I've configured it to fail the build if it waits for longer than 5 minutes to complete the job.
The problem with this is that the plugin makes sure the job fails 5 minutes after the build gets started... which does not help in my case. Because the job doesn't start, rather it sits in queue waiting to get processed but that never happens. So my question is - is there a way to make the job check if that worker is down to just automatically fail the build and send notification?
I am pretty sure that can be done but I could not find a thread where this type of scenario is being discussed.

Slow and weird draining process on Dataflow job

I have a Dataflow job in streaming mode that is written in Apache Beam Python SDK and overall it works as expected, but, since from time to time we want to update the code, we need to deploy a new job and drain the old one. As soon as I press the "Drain" button, the job starts to act weird. It increases the number of workers to maximum allowed, which is expected, I know, but the CPU usage of those workers is crazy. Also, the "System Latency" metric is increasing. You can see in the screenshot below where I set it to drain, it's very visible.
You can see that the draining didn't finish for like 4 days, I had to cancel the job.
One thing I don't understand is why the "Throughput" of the DoFns is 0 all the time while the job is draining.
How can I change my job to make it more draining friendly?
You can find the code of the job in https://gist.github.com/PlugaruT/bd8ac4c007acbff5097cea33ccc1bbb1

Sidekiq not adding jobs to queue

Sometime ago I wrote a small Ruby application which uses Sidekiq to convert video files and pushes them further to few online video hosting services. I use two Workers and Queues, one to actually convert file and second to publish converted files. Jobs are pushed to first Queue by Rails application for conversion, and after successful processing Conversion Worker pushes Upload job to second queue.
Rails -> Converter Queue -> Uploader Queue
Recently I discover a massive memory leak in converter library which appears after every few jobs and overloads whole server, so I did a little hack to avoid this by stopping whole Sidekiq Worker process using Interrupt exception and starting it again by Systemd.
It works perfectly until yesterday. I get notification from my client that files are not converted. I did some investigation to find out whats failing and found that jobs are not added to Converter queue. It starts failing without any changes in code or services. When Rails adds jobs to Sidekiq Queue it receives proper Job ID, no exception or warning at all, but the job simply not appears in any Queue. I checked Redis logs, Systemd logs, dmesg, every logs that i could check and did not find even the slightest warning - it seems that jobs get lost in vacuum :/ In fact, after more digging and debugging I discover that if one job is pushed rapidly ( 100 times in a loop ), then there is a chance that Sidekiq will add job to Queue. Of course, sometimes it will add all jobs, and sometimes not even single one.
The second Queue works perfectly - it picks every single job that I add to it. When I try to add 1000 new jobs, second Queue queues them all, when Converter queue gets at best 10 jobs. Things gets really weird when I try to use another Queue - I pushed 100 jobs to a new Queue, of course all of them are added properly and then I instruct Conversion worker to use that new Queue. And it works - I can add new Jobs to that Queue and it seems that all of them are pushed successfully - but when Worker finish processing all jobs that were pushed before that Worker was assigned to this Queue it starts to failing again. Disabling code that restarts Worker after every job didn't help at all.
Funny thing is that in fact jobs are pushed to Queue but only when I pushes them multiple times, and it seems totally random when Job is added properly. This bugs appears from nowhere, for few months things works perfectly and recently starts failing without any changes in code or server. Logs are perfectly clear, Sidekiq is used with the same Redis server without any problems by few other applications - it seems that only this particular Worker have this problem. I did not found any references to similar bug on the web and I spent two days trying to debug this and find source of this weird behavior, and I found nothing, everything seems to work perfectly and Jobs are simply disappearing somewhere between push and Redis database.

Quartz.Net jobs not always running - can't find any reason why

We're using Quartz.Net to schedule about two hundred repeating jobs. Each job uses the same IJob implementing class, but they can have different schedules. In practice, they end up having the same schedule, so we have about two hundred job details, each with their own (identical) repeating/simple trigger, scheduled. The interval is one hour.
The task this job performs is to download an rss feed, and then download all of the media files linked to in the rss feed. Prior to downloading, it wipes the directory where it is going to place the files. A single run of a job takes anywhere from a couple seconds to a dozen seconds (occasionally more).
Our method of scheduling is to call GetScheduler() on a new StdSchedulerFactory (all jobs are scheduled at once into the same IScheduler instance). We follow the scheduling with an immediate Start().
The jobs appear to run fine, but upon closer inspection we are seeing that a minority of the jobs occasionally - or almost never - run.
So, for example, all two hundred jobs should have run at 6:40 pm this evening. Most of them did. But a handful did not. I determine this by looking at the file timestamps, which should certainly be updated if the job runs (because it deletes and redownloads the file).
I've enabled Quartz.Net logging, and added quite a few logging statements to our code as well.
I get log messages that indicate Quartz is creating and executing jobs for roughly one minute after the round of jobs starts.
After that, all activity stops. No jobs run, no log messages are created. Zero.
And then, at the next firing interval, Quartz starts up again and my log files update, and various files start downloading. But - it certainly appears like some JobDetail instances never make it to the head of the line (so to speak) or do so very infrequently. Over the entire weekend, some jobs appeared to update quite frequently, and recently, and others had not updated a single time since starting the process on Friday (it runs in a Windows Service shell, btw).
So ... I'm hoping someone can help me understand this behavior of Quartz.
I need to be certain that every job runs. If it's trigger is missed, I need Quartz to run it as soon as possible. From reading the documentation, I thought this would be the default behavior - for SimpleTrigger with an indefinite repeat count it would reschedule the job for immediate execution if the trigger window was missed. This doesn't seem to be the case. Is there any way I can determine why Quartz is not firing these jobs? I am logging at the trace level and they just simply aren't there. It creates and executes an awful lot of jobs, but if I notice one missing - all I can find is that it ran it the last time (for example, sometimes it hasn't run for hours or days). Nothing about why it was skipped (I expected Quartz to log something if it skips a job for any reason), etc.
Any help would really, really be appreciated - I've spent my entire day trying to figure this out.
After reading your post, it sounds a lot like the handful of jobs that are not executing are very likely misfiring. The reason that I believe this:
I get log messages that indicate Quartz is creating and executing jobs for roughly one minute after the round of jobs starts.
In Quartz.NET the default misfire threshold is 1 minute. Chances are, you need to examine your logging configuration to determine why those misfire events are not being logged. I bet if you throw open the the floodgates on your logging (ie. set everything to debug, and make sure that you definitely have a logging directive for the Quartz scheduler class), and then rerun your jobs. I'm almost positive that the problem is the misfire events are not showing up in your logs because the logging configuration is lacking something. This is understandable, because logging configuration can get very confusing, very quickly.
Also, in the future, you might want to consult the quartz.net forum on google, since that is where some of the more thorny issues are discussed.
http://groups.google.com/group/quartznet?pli=1
Now, your other question about setting the policy for what the scheduler should do, I can't specifically help you there, but if you read the API docs closely, and also consult the google discussion group, you should be able to easily set the misfire policy flag that suits your needs. I believe that Trigger's have a MisfireInstruction property which you can configure.
Also, I would argue that misfires introduce a lot of "noise" and should be avoided; perhaps bumping up the thread count on your scheduler would be a way to avoid misfires? The other option would be to stagger your job execution into separate/multiple batches.
Good luck!

Resources