Clear worker cache in delayed jobs in production

Clear worker cache in delayed jobs in production - ruby-on-rails

I am using delayed jobs in my rails application. it works fine but there is an issue occurred on production server. I created a class in lib and call its method from controller to generate a csv file through delayed jobs. It was working fine when I ran the delayed jobs on local and production server but then I made some changes to this class for file naming convention and restarted the delayed jobs on local and then on production server. Now when I call that method through delayed job then it works according to latest changes I made to the class and sometimes it uses the old logic of file naming convention.
What could be the issue?

Delayed job has a hidden "feature" which is to ignore any changes to your app, and just use old settings, env-variables, email-templates, etc. You can clear every cache and restart your server, and it still holds onto data which no longer exists anywhere in your app's codebase.
delayed_job - Performs not up to date code?
Also be aware that DJ's "restart" does not always kill and restart all the workers, so you need to hunt them down and kill them all manually with
ps aux | grep delay
See: Rails + Delayed Job => email view template does not get updated
I have not yet found a "clear delayed job cache" function. If it exists, someone please post it here.

In my case, I just spent almost 4 hours trying everything to delete failing delayed_jobs in Heroku. In case you get here trying to kill a zombie delayed_job, but you're in Heroku, this won't work.
You can not do ps aux like you'd do in a regular server, nor you can do rake jobs:clear, and if you check via Rails console, you'll see the jobs there, but not in the Database, so nothing you can do there either.
What I did was placing the app in maintenance mode, made a deployment totally uninstalling delayed_job gem and all its references, and then another deployment reverting that change. That cleared the zombie cache, and that did the trick.

I had a similar issue in Dokku. My solution was to remove the worker=1 entry from my DOKKU_SCALE file (so all it contained was web=1) and also to remove the worker: bundle exec rake jobs:work line from my Procfile.
I pushed that to my production server, reversed the changes above, pushed again and it was fixed.

Related

Rails Production server/console not reloading module

I am using DelayedJob for a long running task in my app. I have the job defined in a class MyJob saved in app/jobs/my_job.rb.
All was well, but I added some code to the file and restarted the server and the changes are not up. Before it was saving one field and now it should be saving two, and I added a logger.debug line to help me, but nothing is coming up in the logs and the models aren't being saved with the field.
This is in 'production' (though still using Thin webserver).
development works.
Since the folder is in the autoload path (at least if I am not misunderstood) I didn't think I needed to do anything special. But since it is not working, something must be off. Help? Let me know if you need more info.

Well, turns out I had a worker who was off the radar.
I run my server in a docker container (should have mentioned that) so even though I did a
RAILS_ENV=production script/delayed_job restart
in both the container and the host (the container has the app from the host as a volume), a worker I probably started some other time continued on. I saw it in the logs when I went back, so I did a
kill {pid}
and that solved my problem. So Flavio was right, I just had to kill the worker myself because the script wasn't picking it up.

Heroku Scheduler not creating log

I recently set up the Scheduler add on and set up my rake task, 'rake cron_jobs:my_task'.
When I test it with
'heroku run rake cron_jobs:my_task', it works fine.
The scheduler also claims it ran when it was supposed to, and is scheduled to run again, but there's no logging associated with the process the way https://devcenter.heroku.com/articles/scheduler#inspecting-output says there should be.
'heroku ps' shows no scheduled dynos, 'heroku logs --ps scheduler.1' has no output.
What am I missing?

Actually I was trying to solve this myself, and did not find the answer anywhere, so here it is if someone else is struggling with this:
heroku logs --tail --ps scheduler
--tail is important to keep streaming the logs.

My best guess: the heroku ps and heroku logs commands only give you status logs for currently running processes/dynos.
So after the scheduled rake task is done, you can't reach the logs through the command line.
You can access the history of your logs by using one of the logging addons. Most of them offer a free tier too.
They all are based on the log drains which you also could use directly, if you want to build it yourself.

Here is what I do for that:
Simply in your tasks itself include put statements to know when the job started running and when it is finished as well.
Also, you can include puts statement in the executed job as well.
I'm using paper trial add-on which is a very powerful logging tool that you can search and find any particular log at a specific time. Also, you can add an alert when your schedule job started to run.

I had a similar problem (using the newer Heroku PGBackups Service) and found an unexpected explanation for it.
The rake task rake pgbackups-archive was not run by Heroku Scheduler, but it worked when I ran it manually from the command line.
In my case, I noticed that my issues were caused by the different time zone used by the Heroku interface (which seems not to be CET). So my rake task which should have run at a specific time daily effectively never ran, as I changed the specific time throughout the day for testing and I always missed the specified time in the Heroku timezone.
You can try running the task every ten minutes and see if it works.

"mapping values are not allowed in this context at line xx" when running DelayedJob

One day on my production server with Rails 3.2.13 app DelayedJob stoped working and there was no way to run it again. I haven't made any changes on the server before. When trying to run rake jobs:work I saw error:
mapping values are not allowed in this context at line xx
this error is always connected with parsing some yaml file.
When I was searching for problem I
restarted my app
checked for yaml problems
checked for system problems
and everything seemed to be fine.
Where could be the problem?

Finally I tried to run first job from rails console by DelayedJob.find(x).invoke_job and the problem was in one specific job and its handler description. I remover this one and then started delayed_job without problem. So if you have that kind of problem start searching from your first job in queue.

Rails.root points to the wrong directory in production during a Resque job

I have two jobs that are queued simulataneously and one worker runs them in succession. Both jobs copy some files from the builds/ directory in the root of my Rails project and place them into a temporary folder.
The first job always succeeds, never have a problem - it doesn't matter which job runs first either. The first one will work.
The second one receives this error when trying to copy the files:
No such file or directory - /Users/apps/Sites/my-site/releases/20130829065128/builds/foo
That releases folder is two weeks old and should not still be on the server. It is empty, housing only a public/uploads directory and nothing else. I have killed all of my workers and restarted them multiple times, and have redeployed the Rails app multiple times. When I delete that releases directory, it makes it again.
I don't know what to do at this point. Why would this worker always create/look in this old releases directory? Why would only the second worker do this? I am getting the path by using:
Rails.root.join('builds') - Rails.root is apparently a 2 week old capistrano release? I should also mention this only happens in the production environment. What can I do
?

Rescue is not being restarted (stopped and started) on deployments which is causing old versions of the code to be run. Each worker continues to service the queue resulting in strange errors or behaviors.
Based on the path name it looks like you are using Capistrano for deploying.
Are you using the capistrano-resque gem? If not, you should give that a look.

I had exactly the same problem and here is how I solved it:
In my case the problem was how capistrano is handling the PID-files, which specify which workers currently exist. These files are normally stored in tmp/pids/. You need to tell capistrano NOT to store them in each release folder, but in shared/tmp/pids/. Otherwise resque does not know which workers are currently running, after you make a new deployment. It looks into the new release's pids-folder and finds no file. Therefore it assumes that no workers exist, which need to be shut down. Resque just creates new workers. And all the other workers still exist, but you cannot see them in the Resque-Dashboard. You can only see them, if you check the processes on the server.
Here is what you need to do:
Add the following lines in your deploy.rb (btw, I am using Capistrano 3.5)
append :linked_dirs, ".bundle", "tmp/pids"
set :resque_pid_path, -> { File.join(shared_path, 'tmp', 'pids') }
On the server, run htop in the terminal to start htop and then press T, to see all the processes which are currently running. It is easy to spot all those resque-worker-processes. You can also see the release-folder's name attached to them.
You need to kill all worker-processes by hand. Get out of htop and type the following command to kill all resque-processes (I like to have it completely clean):
sudo kill -9 `ps aux | grep [r]esque | grep -v grep | cut -c 10-16`
Now you can make a new deploy. You also need to start the resque-scheduler again.
I hope that helps.

How do I clear stuck/stale Resque workers?

As you can see from the attached image, I've got a couple of workers that seem to be stuck. Those processes shouldn't take longer than a couple of seconds.
I'm not sure why they won't clear or how to manually remove them.
I'm on Heroku using Resque with Redis-to-Go and HireFire to automatically scale workers.

None of these solutions worked for me, I would still see this in redis-web:
0 out of 10 Workers Working
Finally, this worked for me to clear all the workers:
Resque.workers.each {|w| w.unregister_worker}

In your console:
queue_name = "process_numbers"
Resque.redis.del "queue:#{queue_name}"
Otherwise you can try to fake them as being done to remove them, with:
Resque::Worker.working.each {|w| w.done_working}
EDIT
A lot of people have been upvoting this answer and I feel that it's important that people try hagope's solution which unregisters workers off a queue, whereas the above code deletes queues. If you're happy to fake them, then cool.

You probably have the resque gem installed, so you can open the console and get current workers
Resque.workers
It returns a list of workers
#=> [#<Worker infusion.local:40194-0:JAVA_DYNAMIC_QUEUES,index_migrator,converter,extractor>]
pick the worker and prune_dead_workers, for example the first one
Resque.workers.first.prune_dead_workers

Adding to answer by hagope, I wanted to be able to only unregister workers that had been running for a certain amount of time. The code below will only unregister workers running for over 300 seconds (5 minutes).
Resque.workers.each {|w| w.unregister_worker if w.processing['run_at'] && Time.now - w.processing['run_at'].to_time > 300}
I have an ongoing collection of Resque related Rake tasks that I have also added this to: https://gist.github.com/ewherrmann/8809350

Run this command wherever you ran the command to start the server
$ ps -e -o pid,command | grep [r]esque
you should see something like this:
92102 resque: Processing ProcessNumbers since 1253142769
Make note of the PID (process id) in my example it is 92102
Then you can quit the process 1 of 2 ways.
Gracefully use QUIT 92102
Forcefully use TERM 92102
* I'm not sure of the syntax it's either QUIT 92102 or QUIT -92102
Let me know if you have any trouble.

I just did:
% rails c production
irb(main):001:0>Resque.workers
Got the list of workers.
irb(main):002:0>Resque.remove_worker(Resque.workers[n].id)
... where n is the zero based index of the unwanted worker.

I had a similar problem that Redis saved the DB to disk that included invalid (non running) workers. Each time Redis/resque was started they appeared.
Fix this using:
Resque::Worker.working.each {|w| w.done_working}
Resque.redis.save # Save the DB to disk without ANY workers
Make sure you restart Redis and your Resque workers.

Started working on https://github.com/shaiguitar/resque_stuck_queue/ recently. It's not a solution to how to fix stuck workers but it addresses the issue of resque hanging/being stuck, so I figured it could be helpful for people on this thread. From README:
"If resque doesn't run jobs within a certain timeframe, it will trigger a pre-defined handler of your choice. You can use this to send an email, pager duty, add more resque workers, restart resque, send you a txt...whatever suits you."
Been used in production and works pretty well for me thus far.

Here's how you can purge them from Redis by hostname. This happens to me when I decommission a server and workers do not exit gracefully.
Resque.workers.each { |w| w.unregister_worker if w.id.start_with?(hostname) }

I ran into this issue and started down the path of implementing a lot of the suggestions here. However, I discovered the root cause that was creating this issue was that I was using the gem redis-rb 3.3.0. Downgrading to redis-rb 3.2.2 prevented these workers from getting stuck in the first place.

I've cleared them out from redis-cli directly. Luckily redistogo.com allows access from environments outside heroku.
Get dead worker ID from the list. Mine was
55ba6f3b-9287-4f81-987a-4e8ae7f51210:2
Run this command in redis directly.
del "resque:worker:55ba6f3b-9287-4f81-987a-4e8ae7f51210:2:*"
You can monitor redis db to see what it's doing behind the scenes.
redis xxx.redistogo.com> MONITOR
OK
1380274567.540613 "MONITOR"
1380274568.345198 "incrby" "resque:stat:processed" "1"
1380274568.346898 "incrby" "resque:stat:processed:c65c8e2b-555a-4a57-aaa6-477b27d6452d:2:*" "1"
1380274568.346920 "del" "resque:worker:c65c8e2b-555a-4a57-aaa6-477b27d6452d:2:*"
1380274568.348803 "smembers" "resque:queues"
Second last line deletes the worker.

In resque 2.0.0, here's one way that seems to work to remove only actually appearantly-dead workers in resque 2.0.0:
Resque::Worker.all_workers_with_expired_heartbeats.each { |w| w.unregister_worker }
I am not an expert in what's going, it's possible there's a better way to do this or that this will have problems. I'm just trying to figure this out too.
This seems to remove workers that haven't sent a "heartbeat" in much longer than expected from the resque worker list.
If the phantom worker was in a "running" state, then a new entry in the "failed" job queue will be created corresponding to phantom job.

I had stuck/stale resque workers here too, or should I say 'jobs', because the worker is actually still there and running fine, it's the forked process that is stuck.
I chose the brutal solution of killing the forked process "Processing" since more than 5min, via a bash script, then the worker just spawn the next in queue, and everything keeps on going
have a look at my script here: https://gist.github.com/jobwat/5712437

If you are using newer versions of Resque, you'll need to use the following command as the internal APIs have changed...
Resque::WorkerRegistry.working.each {|work| Resque::WorkerRegistry.remove(work.id)}

This avoids the problem as long as you have a resque version newer than 1.26.0:
resque: env QUEUE=foo TERM_CHILD=1 bundle exec rake resque:work
Keep in mind that it does not let the currently running job finish.

If you use Docker, you can also use this command:
<id> is the worker id.
docker stop <id>
docker start <id>

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart