Sidekiq worker is leaking memory - ruby-on-rails

Using the sidekiq gem - I have sidekiq worker that runs a process (git-tf clone of big repository) using IO.popen and tracks the stdout to check the progress of the clone.
When I am running the worker, I see that sidekiq memory is getting larger over the time until I get kernel OOM and the process get killed. the subprocess (java process) is taking only 5% of the total memory.
How I can debug/check the memory leak I have in my code? and does the sidekiq memory is the total of my workers memory with the popen process?
And does anyone have any idea how to fix it?
EDIT
This is the code of my worker -
https://gist.github.com/yosy/5227250
EDIT 2
I ran the code without sidekiq, and I have no memory leaks.. this is something strange with sidekiq and big repositories in tfs

I didn't find the cause for the memory leak in sidekiq, but I found a away to get a way from sidekiq.
I have modified git-tf to have server command that accepts command from redis queue, it removes lot of complexity from my code.
The modified version of git-tf is here:
https://github.com/yosy/gittf
I will add documentation later about the sever command when I will fix some bugs.

Related

Why is there so many sidekiq process?

I ran htop in my production server to see what was eating my RAM. A lot of sidekiq process is running, is this normal?
Press Shift-H. htop shows individual threads as processes by default. There is only one actual sidekiq process.
You seem to have configured it to have 25 workers.
By default, one sidekiq process creates 25 threads.
If that's crushing your machine with I/O, you can adjust it down:
sidekiq -c 10
https://github.com/mperham/sidekiq/wiki/Advanced-Options
If you are not using JRuby then it's likely these all are seperate processes that consume memory.

Sidekiq worker shutdown

I have got strange errors
2016-07-15T14:34:09.334Z 16484 TID-33ld0 WARN: Terminating 1 busy worker threads
2016-07-15T14:34:09.334Z 16484 TID-33ld0 WARN: Work still in progress [#<struct Sidekiq::BasicFetch::UnitOfWork queue="queue:load_xml", job="{\"class\":\"GuitarmaniaWorker\",\"args\":[],\"retry\":false,\"queue\":\"load_xml\",\"jid\":\"56c01b371c3ee077c2ccf440\",\"created_at\":1468590072.35382,\"enqueued_at\":1468590072.3539252}">]
2016-07-15T14:34:09.334Z 16484 TID-33ld0 DEBUG: Re-queueing terminated jobs
2016-07-15T14:34:09.335Z 16484 TID-33ld0 INFO: Pushed 1 jobs back to Redis
2016-07-15T14:34:09.336Z 16484 TID-33ld0 INFO: Bye!
What can cause it?
Locally all work great but after deployment on production server this errors appeared.
Any suggestions.
This means the Sidekiq is being shut down. In the event of a normal "kill" operation (TERM signal), Sidekiq server will attempt to shutdown gracefully by waiting 8 seconds for jobs to be complete. It will then stop the job execution and re-queue for the next time the server starts.
That begs the question, why is your Sidekiq shutting down. Possible reasons are: you on command-line or a script killed the process; you or your data host shut down the machine; your OS ran out of memory. The last reason is the most likely if you're not sure of the cause.
If the memory leak occurs soon after startup, you may be running too many Sidekiq processes and/or your machine may not have enough memory to load the app. If the message happens after some time, it's possible you have a memory leak - run free periodically after starting Sidekiq to see if memory usage scales up, which would indicate a leak. Sometimes a memory leak is due to a library, sometimes it's your own application. More about tracking down leaks in Ruby here.

How to detect and prevent spawning failing Unicorn workers

Situation: I am using Rails + Unicorn, deploying with Capistrano. Sometimes Rails app fails to start in production mode (though it is not the real production, but a staging env). This usually happens due to errors in deploy scripts or configuration (thus usually not detectable by tests). When this happens, unicorn master process kills the worker that failed and spawns a new one, which also fails and so on and so forth. During all that time unicorn consumes lots of CPU and pollutes logs with the same message.
Manual way (not good): Go to your home page to see if it works. Look at the htop. Tail the logs. Kill unicorn manually. Cons: easy to forget. Logs are polluted, CPU is loaded while you are reacting.
Another solution: Use unicorn's preload_app true. This will cause master process to fail fast. Cons: higher memory consumption in happy scenario.
Best practice: - ???
Is there any way to cleverly detect that unicorn master uselessly tries to spawn failing children and stop it?
You have something like "unicorn start" in your Capistrano script right? Make your Capistrano script ping Unicorn right after invoking that command. If Unicorn does not return an expected response within a timeout, then you know that something went wrong, and you can choose to rollback the deploy or performing some other action.
As for how to ping Unicorn, that depends. If you have Unicorn listening on a TCP socket then you can use curl. If you have Unicorn listening on a Unix domain socket then you have to write a little script that connects to it, like this:
require 'socket'
sock = UNIXSocket.new('/path-to-unicorn.sock')
sock.write("HEAD / HTTP/1.0\r\n")
sock.write("Host: www.foo.com\r\n")
sock.write("Connection: close\r\n")
sock.write("\r\n")
if sock.read !~ /something/
exit 1
end
But it sounds like Phusion Passenger Enterprise solves your problem beautifully. It has this feature called "deployment error resistance". When you deploy a new version and Phusion Passenger detects that it cannot spawn any processes for your new codebase, it will stop trying to spawn your new version and keep the processes for the old versions around indefinitely, until you manually give the signal that it's okay to spawn processes for the new version. In the mean time it will log all errors into the log file so that you can analyze the problem.
I would suggest brushing off your bash skills. The functionality you need is already in Unicorn as it leverages the Unix-y master/worker process.
You need a init.d script. Or at the very least godrb or monit. I recommend the init.d script route AND monitoring. Its more complex, but it can more easily be leveraged by your monitoring software and also gives you an automatic start on reboot.
The gist of it is:
Send the USR2 signal to the unicorn master process, this will fork the master process.
Then send the WINCH to the old master process that gets created, this will kill each worker.
Then you can send the old master process the QUIT signal.
Unicorn Signals
This will spin up a new master process running the new code and label the old one as (old). If it fails the old one should be returned to its prior state and you shouldn't suffer an outage, just a restart error. This is the beauty of unicorn. You can almost get instantaneous deploys of your code.
I'm using a lot of hedge words because I did this work on my apps over a year ago so there are a lot of cobwebs upstairs. Hope this helps!
This is by no mean a correct script. Its a good starting point though ... feel free to update the gist if you can improve upon it! :-)
Example Unicorn Control Script

Pragmatic ways to monitor Resque queues in Rails

I am looking to automate the starting/restarting of queues with Resque in my Ruby on Rails application. (running on JRuby)
I want to ensure the following criteria are met:
Workers are started after I deploy with capistrano
Workers are restarted if they die for whatever reason
Workers eating too much memory are stopped/restarted and can fire me an email alert
Are there tools that current provide this functionality or at least a subset of it? If there isn't anything that restarts the queue/worker, I would like to be notified at minimum so I can manually do it.
The easiest way to do it would be using a program such as God or Monit to get #2 and #3. For #1, you can just setup your Capistrano script to send a kill -INT to all the Resque workers, then the monitoring program will start them up again.
The advantaged to using kill -INT rather than manually stopping and starting the jobs in the Capistrano script is that your deploy won't have to wait for every worker to stop processing its job to start them back up. It also means if you have a long running job, you will quickly have whatever free workers were running on the new code as quickly as possible.
I'm not especially familiar with it, however I believe the god gem is used frequently for process management.

Preventing delayed_job background jobs from consuming too much CPU on a single server

My Rails application has a number of tasks which are offloaded into background processes, such as image resizing and uploading to S3. I'm using delayed_job to manage these processes.
These processes, particularly thumbnailing PDFs (using Ghostscript) and resizing images (using ImageMagick), are CPU intensive and often consume 100% CPU time. Since these jobs are running on the same (RedHat Linux) server as the web application itself, as well as the DB, they can lead to our web application being unresponsive.
One solution is to get another server on which to run only the background jobs. I guess this would be the optimal solution? However, since this isn't something I can do immediately I wonder whether it would be possible to somehow make the background jobs run at a lower operating system priority, and hence consume less CPU cycles in doing their work?
Thoughts appreciated.
If I'm not mistaken, delayed_job uses worker processes that will handle all the background jobs. It should be easily possible to alter the OS scheduling priority of the process when you start it.
So instead of, for example:
ruby script/delayed_job -e production -n 2 start
try:
nice -n 15 ruby script/delayed_job -e production -n 2 start

Resources