How to debug slow Sidekiq jobs - ruby-on-rails

Often times my Sidekiq jobs will be running for greater than 1 minute. I've tried to debug by sending the Sidekiq process a TTIN signal but I don't see anything being logged. My intuition is that it's a network request that's making it hang but I'm making use of timeouts on all network requests to address this already.
Any suggestions? Thanks!

You are asking how to profile and tune slow Ruby code.
RubyProf.profile do
MyJob.new.perform(...)
end
Output the report and review it to find where the code is slow.
https://github.com/ruby-prof/ruby-prof#usage

You can use the benchmark module https://ruby-doc.org/stdlib-1.9.3/libdoc/benchmark/rdoc/Benchmark.html
You can then save the report to a log file.

Related

Error: too many emails per second on Heroku

I have rails 4 app on Heroku sending emails via deliver_later. Sidekiq is running and working and I have config.active_job.queue_adapter = :sidekiq in my config.
I also see this in the logs [ActiveJob] [ActionMailer::DeliveryJob] which makes me believe emails are being sent as background jobs.
So why do I still get errors about too many emails per second?
Net::SMTPUnknownError: could not get 3xx (550: 550 5.7.0 Requested action not taken: too many emails per second
I just noticed I have sidekiq concurrency set to 3. Maybe that's the problem?
This is not going to be related to Sidekiq (well, not directly, anyway). This relates to whatever SMTP server you are using and the server's rate limiting. If you have multiple instances of Sidekiq firing off emails near simultaneously, this error might get thrown if you have a relatively low rate limit.
For instance, if you are using Mailtrap as a SMTP server in a development or staging environment, a free account is limited to 2 emails per second. Pretty easy to run into this error.
If you can't avoid running into this issue for your purposes, you can try a few different strategies by using the keyword argument 'wait' for ActionMailer's deliver_later method, e.g.:
users_i_need_to_email.each_with_index do |user, index|
t = 5 * index
UserMailer.send_important_stuff(user)
.deliver_later(wait: t.seconds)
end
In the event your :deliver_later calls are operating in different processes or flows, you can get away with randomizing the delivery time (say, some random time up to 5 minutes).
Obviously, neither solution above is full proof, so you are best off using a SMTP server that is adequate for your purposes.
Definitely seems to be an issue with sidekiq concurrency set to 3.
Additional info: Setting concurrency to 1 fixes the issue but I don't want all jobs to be processed by a single worker. It would be great if Sidekiq would let use set concurrency per queue so emails could use a single worker while other job could use multiple workers. However, sidekiq author said it's not possible.
I found a solution to my particular problem here using heroku which involves putting each queue in it's own heroku dyno which allows you to set concurrency in each dyno.

Current Sidekiq job lost when deploying to Heroku

I have a Sidekiq job that runs for a while and when I deploy to Heroku and the job is running, it can't finish within in the few seconds.
That is fine, as the job is designed to be able to be re-run if needed.
The problem is that the job gets lost (instead of put back to redis and run again after deploy).
I found that it is advised to set :timeout: 8 on heroku and I tried it, but it had no effect (also tried seeting to 5).
When there is an exception, I get errors reported, but I don't see any. So not sure what could be wrong.
Any tips on how to debug this?
The free version of Sidekiq will push unfinished jobs back to Redis after the timeout has passed, default of 8 seconds. Heroku gives a process 10 seconds to shut down. That means we have 2 seconds to get those jobs back to Redis or they will be lost. If your network is slow, if the Redis server is swapping, etc, that 2 sec deadline might not be met and the jobs lost.
You were on the right track: one answer is to lower the timeout so you have a better chance of meeting that deadline. But network or swapping delay can't be predicted: even 5 seconds might not be enough time.
Under normal healthy conditions, things should work as designed. Keep your machines healthy (uncongested network, plenty of RAM) and the basic fetch should work well. Sidekiq Pro's reliable fetch feature is a fundamental redesign of how Sidekiq fetches jobs and works around all of these issues by keeping jobs in Redis all the time so they can't be lost. But it comes with serious trade offs too: it's more complicated, slower and more Redis intensive than "basic" fetch.
In short, I don't know why you are losing jobs but make sure your instances and Redis server are healthy and the latency is low.
https://github.com/mperham/sidekiq/wiki/Using-Redis#life-in-the-cloud
This is actually feature of sidekiq - designed to steer you toward paying pro version:
http://sidekiq.org/products/pro
RELIABILITY
More reliable message processing.
Cloud environments are noisy and unreliable. Seeing timeouts? Wild swings in latency or performance? Ruby VM crashes or processes disappearing?
If a Sidekiq process crashes while processing a job, that job is lost.
If the Sidekiq client gets a networking error while pushing a job to Redis, an exception is raised and the job is not delivered.
Sidekiq Pro uses Redis's RPOPLPUSH command to ensure that jobs will not be lost if the process crashes or gets a KILL signal.
The Sidekiq Pro client can withstand transient Redis outages or timeouts. It will enqueue jobs locally upon error and attempt to deliver those jobs once connectivity is restored.
Deploy terminates all processes that belongs to user, therefore job is lost. There is actually not much you can do there.
As #mike-perham and #esse noted, Sidekiq is designed the way it can loose jobs due to its fetching mechanism. Your options to get around this are:
To buy Sidekiq Pro (although it was reported to cause the same issue)
To write your own fetcher (but that would mean you can not use most of 3rd party libraries, as they will not work with your custom fetcher)
To mimic Sidekiq Pro's reliable fetch by backing up your jobs data. In case you are up for this way, check out attentive_sidekiq gem which does exactly that.

running fork in delayed job

we use delayed job in our web application and we need multiple delayed jobs workers happening parallelly, but we don't know how many will be needed.
solution i currently try is running one worker and calling fork/Process.detach inside the needed task.
i was trying to run fork directly in rails application previously but it didnt work too good with passenger.
this solution seems to work well. could there be any caveats in production?
one issue which happened to me today and which anyone trying that should take care of was following:
i noticed that worker is down so i started it. something i didnt think about was that there were 70 jobs waiting in queue. and since processes are forked, they pretty much killed our server for around half an hour by starting all almost immediately and eating all memory in process.. :]
so ensuring that there is god watching over the worker is important.
also worker seems to die often but not sure yet if its connected with forking.

Scaling Dynos with Heroku

I've currently got a ruby on rails app hosted on Heroku that I'm monitoring with New Relic. My app is somewhat laggy when using it, and my New Relic monitor shows me the following:
Given that majority of the time is spent in Request Queuing, does this mean my app would scale better if I used an extra worker dynos? Or is this something that I can fix by optimizing my code? Sorry if this is a silly question, but I'm a complete newbie, and appreciate all the help. Thanks!
== EDIT ==
Just wanted to make sure I was crystal clear on this before having to shell out additional moolah. So New Relic also gave me the following statistics on the browser side as you can see here:
This graph shows that majority of the time spent by the user is in waiting for the web application. Can I attribute this to the fact that my app is spending majority of its time in a requesting queue? In other words that the 1.3 second response time that the end user is experiencing is currently something that code optimization alone will do little to cut down? (Basically I'm asking if I have to spend money or not) Thanks!
Request Queueing basically means 'waiting for a web instance to be available to process a request'.
So the easiest and fastest way to gain some speed in response time would be to increase the number of web instances to allow your app to process more requests faster.
It might be posible to optimize your code to speed up each individual request to the point where your application can process more requests per minute -- which would pull requests off the queue faster and reduce the overall request queueing problem.
In time, it would still be a good idea to do everything you can to optimize the code anyway. But to begin with, add more workers and your request queueing issue will more than likely be reduced or disappear.
edit
with your additional information, in general I believe the story is still the same -- though nice work in getting to a deep understanding prior to spending the money.
When you have request queuing it's because requests are waiting for web instances to become available to service their request. Adding more web instances directly impacts this by making more instances available.
It's possible that you could optimize the app so well that you significantly reduce the time to process each request. If this happened, then it would reduce request queueing as well by making requests wait a shorter period of time to be serviced.
I'd recommend giving users more web instances for now to immediately address the queueing problem, then working on optimizing the code as much as you can (assuming it's your biggest priority). And regardless of how fast you get your app to respond, if your users grow you'll need to implement more web instances to keep up -- which by the way is a good problem since your users are growing too.
Best of luck!
I just want to throw this in, even though this particular question seems answered. I found this blog post from New Relic and the guys over at Engine Yard: Blog Post.
The tl;dr here is that Request Queuing in New Relic is not necessarily requests actually lining up in the queue and not being able to get processed. Due to how New Relic calculates this metric, it essentially reads a time stamp set in a header by nginx and subtracts it from Time.now when the New Relic method gets a hold of it. However, New Relic gets run after any of your code's before_filter hooks get called. So, if you have a bunch of computationally intensive or database intensive code being run in these before_filters, it's possible that what you're seeing is actually request latency, not queuing.
You can actually examine the queue to see what's in there. If you're using Passenger, this is really easy -- just type passenger status on the command line. This will show you a ton of information about each of your Passenger workers, including how many requests are sitting in the queue. If you run with preceded with watch, the command will execute every 2 seconds so you can see how the queue changes over time (so just execute watch passenger status).
For Unicorn servers, it's a little bit more difficult, but there's a ruby script you can run, available here. This script actually examines how many requests are sitting in the unicorn socket, waiting to be picked up by workers. Because it's examining the socket itself, you shouldn't run this command any more frequently than ~3 seconds or so. The example on GitHub uses 10.
If you see a high number of queued requests, then adding horizontal scaling (via more web workers on Heroku) is probably an appropriate measure. If, however, the queue is low, yet New Relic reports high request queuing, what you're actually seeing is request latency, and you should examine your before_filters, and either scope them to only those methods that absolutely need them, or work on optimizing the code those filters are executing.
I hope this helps anyone coming to this thread in the future!

Ruby mod_passenger process timeout

A few Ruby apps I've worked with hang for a long time on slow calls causing processes to backup on the machine eventually requiring a reboot. Is there a quick and easy way in Passenger to limit a execution time for a single Apache call.
In PHP if a process exceeds the max execution time setting in php.ini the process returns an error to Apache and the server keeps merrily plugging away.
I would take a look at fixing the application. Cutting off requests at the web server level is really more of a band aid and not addressing the core problem - which is request failures, one way or another. If the Ruby app is dependent on another service that is timing out, you can patch the code like this, using the timeout.rb library:
require 'timeout'
status = Timeout::timeout(5) {
# Something that should be interrupted if it takes too much time...
}
This will let the code "give up" and close out the request gracefully when needed.

Resources