Frequent SystemExit in Ruby when making HTTP calls - ruby-on-rails

I have a Ruby on Rails Website that makes HTTP calls to an external Web Service.
About once a day I get a SystemExit (stacktrace below) error email where a call to the service has failed. If I then try the exact same query on my site moments later it works fine.
It's been happening since the site went live and I've had no luck tracking down what causes it.
Ruby is version 1.8.6 and rails is version 1.2.6.
Anyone else have this problem?
This is the error and stacktrace.
A SystemExit occurred
/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.6/lib/fcgi_handler.rb:116:in
exit'
/usr/local/lib/ruby/gems/1.8/gems/rails-1.2.6/lib/fcgi_handler.rb:116:in
exit_now_handler'
/usr/local/lib/ruby/gems/1.8/gems/activesupport-1.4.4/lib/active_support/inflector.rb:250:in
to_proc' /usr/local/lib/ruby/1.8/net/protocol.rb:133:in call'
/usr/local/lib/ruby/1.8/net/protocol.rb:133:in sysread'
/usr/local/lib/ruby/1.8/net/protocol.rb:133:in rbuf_fill'
/usr/local/lib/ruby/1.8/timeout.rb:56:in timeout'
/usr/local/lib/ruby/1.8/timeout.rb:76:in timeout'
/usr/local/lib/ruby/1.8/net/protocol.rb:132:in rbuf_fill'
/usr/local/lib/ruby/1.8/net/protocol.rb:116:in readuntil'
/usr/local/lib/ruby/1.8/net/protocol.rb:126:in readline'
/usr/local/lib/ruby/1.8/net/http.rb:2017:in read_status_line'
/usr/local/lib/ruby/1.8/net/http.rb:2006:in read_new'
/usr/local/lib/ruby/1.8/net/http.rb:1047:in request'
/usr/local/lib/ruby/1.8/net/http.rb:945:in request_get'
/usr/local/lib/ruby/1.8/net/http.rb:380:in get_response'
/usr/local/lib/ruby/1.8/net/http.rb:543:in start'
/usr/local/lib/ruby/1.8/net/http.rb:379:in get_response'

Using fcgi with Ruby is known to be very buggy.
Practically everybody has moved to Mongrel for this reason, and I recommend you do the same.

It's been awhile since I used FCGI but I think a FCGI process could throw a SystemExit if the thread was taking too long. This could be the web service not responding or even a slow DNS query. Some google results show a similar error with Python and FCGI so moving to mongrel would be a good idea. This post is my reference I used to setup mongrel and I still refer back to it.

I used to get these all the time on Apache1/fastcgi. I think it's caused by fastcgi hanging up before Ruby is done.
Switching to mongrel is a good first step, but there's more to do. It's a bad idea to cull from web services on live pages, particularly from Rails. Rails is not thread-safe. The number of concurrent connections you can support equals the number of mongrels (or Passenger processes) in your cluster.
If you have one mongrel and someone accesses a page that calls a web service that takes 10 seconds to time out, every request to your website will timeout during that time. Most of the load balancers just cycle through your mongrels blindly, so if you have two mongrels, every other request will timeout.
Anything that can be unpredictably slow needs to happen in a job queue. The first hit to /slow/action adds the job to the queue, and /slow/action keeps on refreshing via page refreshes or queries via ajax until the job is finished, and then you get your results from the job queue. There are a few job queues for Rails nowadays, but the oldest and probably most widely used one is BackgroundRB.
Another alternative, depending on the nature of your app, is to cull the service every N minutes via cron, cache the data locally, and have your live page read from the cache.

I would also take a look at Passenger. It's a lot easier to get going than the traditional solution of Apache/nginx + Mongrel.

Related

Random slow Rack::MethodOverride#call on rails app on Heroku

Environment:
Ruby: 2.1.2
Rails: 4.1.4
Heroku
In our rails app hosted on Heroku, there are times that requests take a long time to execute. It is just 1% of times or less, but we cannot figure out what it is happening.
We have newrelic agent installed and it says that it is not request-queuing, it is the transaction itself who takes all that time to execute.
However, transaction trace shows this:
(this same request most of the times takes only 100ms to be executed)
As far as I can tell, the time is being consumed before our controller gets invoked. It is consumed on
Rack::MethodOverride#call
and that is what we cannot understand.
Also, most of the times (or even always, we are not sure) this happens on POST requests that are sent by mobile devices. Could this have something to do with a slow connection? (although POST-payload is very tiny).
Has anyone experienced this? Any advice on how to keep exploring this issue is appreciated.
Thanks in advance!
Since the Ruby agent began to instrument middleware in version 3.9.0.229, we've seen this question arise for some users. One possible cause of the longer timings is that Rack::MethodOverride needs to examine the request body on POST in order to determine whether the POST parameters contain a method override. It calls Rack::Request#POST, which ends up triggering a read that reads in the entire request body.
This may be why you see that more time than expected is being spent in this middleware. Looking more deeply into how the POST body relates to the time spent in the middleware might be a fruitful avenue for investigation.
In case anyone is experiencing this:
Finally we have made the switch from unicorn to passenger and this issue has been resolved:
https://github.com/phusion/passenger-ruby-heroku-demo
I am not sure, but the problem may have something to do with POST requests on slow clients. Passenger/nginx says:
Request/response buffering - The included Nginx buffers requests and
responses, thus protecting your app against slow clients (e.g. mobile
devices on mobile networks) and improving performance.
So this may be the reason.

RoR: multiple calls in a row to the same long-time-response controller

Update:
Read "Indicate to an ajax process that the delayed job has completed" before if you have the same problem. Thanks Gene.
I have a problem with concurrency. I have a controller scraping a few web sites, but each call to my controller needs about 4-5 seconds to respond.
So if I call 2 (or more) times in a row, the second call needs wait for the first call before starting.
So how I can fix this problem in my controller? Maybe with something like EventMachine?
Update & Example:
application_controller.rb
def func1
i=0
while i<=2
puts "func1 at: #{Time.now}"
sleep(2)
i=i+1
end
end
def func2
j=0
while j<=2
puts "func2 at: #{Time.now}"
sleep(1)
j=j+1
end
end
whatever_controller.rb
puts ">>>>>>>> Started At #{Time.now}"
func1()
func2()
puts "End at #{Time.now}"
So now I need request http://myawesome.app/whatever several times at the same times from the same user/browser/etc.
I tried Heroku (and local) with Unicorn but without success, this is my setup:
unicorn.rb http://pastebin.com/QL0wdGx0
Procfile http://pastebin.com/RrTtNWJZ
Heroku setup https://www.dropbox.com/s/wxwr5v4p61524tv/Screenshot%202014-02-20%2010.33.16.png
Requirements:
I need a RESTful solution. This is API so I need to responds JSON
More info:
I have right now 2 cloud servers running.
Heroku with Unicorn
Engineyard Cloud with Nginx + Panssenger
You're probably using webrick in development mode. Webrick only handles one request at a time.
You have several solutions, many ruby web servers exist that can handle concurrency.
Here are a few of them.
Thin
Thin was originally based on mongrel and uses eventmachine for handling multiple concurrent connections.
Unicorn
Unicorn uses a master process that will dispatch requests to web workers, 4 workers equals 4 concurrent possible requests.
Puma
Puma is a relatively new ruby server, its shiny feature is that it handles concurrent requests in threads, make sure your code is threadsafe !
Passenger
Passenger is a ruby server bundled inside nginx or apache, it's great for production and development
Others
These are a few alternatives, many other exist, but I think they are the most used today.
To use all these servers, please check their instructions. They are generally available on their github README.
For any long response time controller function, the delayed job gem
is a fine way to go. While it is often used for bulk mailing, it works as well for any long-running task.
Your controller starts the delayed job and responds immediately with a page that has a placeholder - usually a graphic with a progress indicator - and Ajax or a timed reload that updates the page with the full information when it's available. Some information on how to approach this is in this SO article.
Not mentioned in the article is that you can use redis or some other memory cache to store the results rather than the main database.
Answers above are part of the solution: you need a server environment that can properly dispatch concurrent requests to separate workers; unicorn or passenger can both work by creating workers in separate processes or threads. This allows many workers to sit around waiting while not blocking other incoming requests.
If you are building a typical bot whose main job is to get content from other sources, these solutions may be ok. But if what you need is a simple controller that can accept hundreds of concurrent requests, all of which are sending independent requests to other servers, you will need to manage threads or processes yourself. Your goal is to have many workers waiting to do a simple job, and one or more masters whose jobs it is to send requests, then be there to receive the responses. Ruby's Thread class is simple, and works well for cases like this with ruby 2.x or 1.9.3.
You would need to provide more detail about what you need to do for help getting to any more specific solution.
Try something like unicorn as it handles concurrency via workers. Something else to consider if there's a lot of work to be done per request, is to spin up a delayed_job per request.
The one issue with delayed job is that the response won't be synchronous, meaning it won't return to the user's browser.
However, you could have the delayed job save its responses to a table in the DB. Then you can query that table for all requests and their related responses.
What ruby version are you utilizing?
Ruby & Webserver
Ruby
If its a simple application I would recommend the following. Try to utilize rubinius (rbx) or jruby as they are better at concurrency. Although they have drawback as they're not mainline ruby so some extensions won't work. But if its a simple app you should be fine.
Webserver
use Puma or Unicorn if you have the patience to set it up
If you're app is hitting the API service
You indicate that the Global Lock is killing you when you are scraping other sites (presumably ones that allow scraping), if this is the case something like sidekiq or delayed job should be utilized, but with caution. These will be idempotent jobs. i.e. they might be run multiple times. If you start hitting a website multiple times, you will hit a website's Rate limit pretty quickly, eg. twitter limits you to 150 requests per hour. So use background jobs with caution.
If you're the one serving the data
However reading your question it sounds like your controller is the API and the lock is caused by users hitting it.
If this is the case you should utilize dalli + memcached to serve your data. This way you won't be I/O bound by the SQL lookup as memcached is memory based. MEMORY SPEED > I/O SPEED

How to profile inconsistent H12 timeouts on Heroku

My users are seeing occasional request timeouts on Heroku. Unfortunately I can not consistently reproduce them which makes them really hard to debug. There's plenty of opportunity to improve performance - e.g. by reducing the huge number of database queries per request and by adding more caching - but without profiling that's a shot in the dark.
According to our New Relic analytics, many requests take between 1 and 5 seconds on the server. I know that's too slow, but it nowhere near the 30 seconds needed for the timeout.
The error tab on New Relic shows me several different database queries where the timeout occurs, but these aren't particularly slow queries and it can be different queries for each crash. Also for the same URL it sometimes does and sometimes does not show a database query.
How do I find out what's going on in these particular cases? E.g. how do I see how much time it was spending in the database when the timeout occurred, as opposed to the time it spends in the database when there's no error?
One hypothesis I have is that the database gets locked in some cases; perhaps a combination of reading and writing.
You may have already seen it, but Heroku has a doc with some good background about request timeouts.
If your requests are taking a long time, and the processes servicing them are not being killed before the requests complete, then they should be generating transaction traces that will provide details about individual transactions that took too long.
If you're using Unicorn, it's possible that this is not happening because the requests are taking long enough that they're hitting up against Unicorn's timeout (after which the workers servicing those requests will be forcibly killed, not giving the New Relic agent enough time to report back in).
I'd recommend a two-step approach:
Configure the rack-timeout middleware to have a timeout below Heroku's 30s timeout. If this works, it will terminate requests taking longer than the timeout by raising a Timeout::Error, and such requests should generate transaction traces in New Relic.
If that yields nothing (which it might, because rack-timeout relies on Ruby's stdlib Timeout class, which has some limitations), you can try bumping the Unicorn request handling timeout up from its default of 60s (assuming you're using Unicorn). Be aware that long-running requests will tie up a Unicorn worker for a longer period in this case, which may further slow down your site, so use this as a last resort.
Two years late here. I have minimal experience with Ruby, but for Django the issue with Gunicorn is that it does not properly handle slow clients on Heroku because requests are not pre-buffered, meaning a server connection could be left waiting (blocking). This might be a helpful article to you, although it applies primarily to Gunicorn and Python.
You're pretty clearly hitting the issue with long running requests. Check out http://artsy.github.com/blog/2013/02/17/impact-of-heroku-routing-mesh-and-random-routing/ and upgrade to NewRelic RPM 3.5.7.59 - the wait time measuring will be accurately reported.

Strange TTFB (time to first byte) issue on Heroku

We're in the process of improving performance of the our rails app hosted at Heroku (rails 3.2.8 and ruby 1.9.3). During this we've come across one alarming problem for which the source seems to be extremely difficult to track. Let me quickly explain how we experience the problem and how we've tried to isolate it.
--
Since around June we've experienced weird lag behavior in Time to First Byte all over the site. The problems is obvious from using the site (sometimes the application doesn't respond for 10-20 seconds), and it's also present in waterfall analysis via webpagetest.org.
We're based in Denmark but get this result from any host.
To confirm the problem we've performed a benchmark test where we send 300 identical requests to a simple page and measured the response time.
If we send 300 requests to the front page the median response time is below 1 second, which is fairly good. What scares us is that 60 requests takes more that double that time and 40 of those takes more than 4 seconds. Some requests take as much as 16 seconds.
None of these slow requests show up in New Relic, which we use for performance monitoring. No request queuing shows up and the results are the same no matter how high we scale our web processes.
Still, we couldn't reject that the problem was caused by application code, so we tried another experiment where we responded to the request via rack middleware.
By placing this middleware (TestMiddleware) at the beginning of the rack stack, we returned a request before it even hit the application, ensuring that none of the following middleware or the rails app could cause the delay.
Middleware setup:
$ heroku run rake middleware
use Rack::Cache
use ActionDispatch::Static
use TestMiddleware
use Rack::Rewrite
use Rack::Lock
use Rack::Runtime
use Rack::MethodOverride
use ActionDispatch::RequestId
use Rails::Rack::Logger
use ActionDispatch::ShowExceptions
use ActionDispatch::DebugExceptions
use ActionDispatch::RemoteIp
use Rack::Sendfile
use ActionDispatch::Callbacks
use ActiveRecord::ConnectionAdapters::ConnectionManagement
use ActiveRecord::QueryCache
use ActionDispatch::Cookies
use ActionDispatch::Session::DalliStore
use ActionDispatch::Flash
use ActionDispatch::ParamsParser
use ActionDispatch::Head
use Rack::ConditionalGet
use Rack::ETag
use ActionDispatch::BestStandardsSupport
use NewRelic::Rack::BrowserMonitoring
use Rack::RailsExceptional
use OmniAuth::Builder
run AU::Application.routes
We then ran the same script to document response time and got pretty much the same result. The median response time was around 130ms (obviously faster because it doesn't hit the app. But still 60 requests took more than 400ms and 25 requests took more than 1 second. Again, with some requests as slow as 16 seconds.
One explanation could be related to slow hops on the network or DNS setup, but the results of traceroute looks perfectly OK.
This result was confirmed from running the response script on another rails 3.2 and ruby 1.9.3 application hosted on Heroku - no weird behavior at all.
The DNS setup follows Heroku's recommendations.
--
We're confused to say the least. Could there be something fishy with Heroku's routing network?
Why the heck are we seeing this weird behavior? How do we get rid of it? And why can't we see it in New Relic?
It Turned out that it was a kind of request queuing. Sometimes, that web server was busy, and since heroku just routs randomly incoming requests randomly to any dyno, then I could end up in a queue behind a dyno, which was totally stuck due to e.g. database problems. The strange thing is, that this was hardly noticeable in new relic (it's a good idea to uncheck all other resources when viewing thins in their charts, then the queuing suddenly appears)
EDIT 21/2 2013: It has turned out, that the reason why it wasn't hardly noticeable in Newrelic was, that it wasn't measured! http://rapgenius.com/Lemon-money-trees-rap-genius-response-to-heroku-lyrics
We find this very frustrating, and we ended up leaving Heroku in favor of dedicated servers. This gave us 20 times better performance at a 1/10 of the cost. Additionally I must say that we are disappointed by Heroku who at the time this happened, denied that the slowness was due to their infrastructure even though we suspected it and highlighted it several times. We even got answers like this back:
Heroku 28/8 2012: "If you're not seeing request queueing or other slowness reported in New Relic, then this is likely not a server-side issue. Heroku's internal routing should take <1ms. None of our monitoring systems are indicating any routing problems currently."
Additionally we spoke to Newrelic who also seemed unaware of the issue, even though they according to them selfs has a very close work relationship with Heroku.
Newrelic 29/8 2012: "It looks like whatever is causing this is happening before the Ruby agent's visibility starts. The queue time that the agent records is from the time the request enters a dyno, so the slow down is occurring before then."
The bottom-line was, that we ended up spending hours and hours on optimizing code that wasn't really the bottleneck. Additionally running with a too high dyno scale in a desperate try to boost our performance, but the only thing that we really got from this was bigger receipts from both Heroku and Newrelic - NOT COOL. I'm glad that we changed.
PS. At that time there even was a bug that caused newrelic pro to be charged on ALL dynos even though we, (according to Newrelics own advice), had disabled the monitoring on our background worker processes. It took a lot of time and many emails before the mistake was admitted by both parties.
PPS. If you are not aware of the current ongoing discussion, then here is the link http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics
EDIT 26/2 2013
Heroku has just announced in their newsletter, that Newrelic has released an update that apparently should cast some light on the situation at Heroku.
EDIT 8/4 2013
Heroku has just released an FAQ over the topic
traceroute is not a good measure of problems in the network, its a tool that can find failures along the network, but it will not show you the best view.
Try just putting up a static webpage and hit it with the ip address from your webpage tester. If it is still slow, blame the network.
If for some reason it is fast, then you have a different issue.

Ruby mod_passenger process timeout

A few Ruby apps I've worked with hang for a long time on slow calls causing processes to backup on the machine eventually requiring a reboot. Is there a quick and easy way in Passenger to limit a execution time for a single Apache call.
In PHP if a process exceeds the max execution time setting in php.ini the process returns an error to Apache and the server keeps merrily plugging away.
I would take a look at fixing the application. Cutting off requests at the web server level is really more of a band aid and not addressing the core problem - which is request failures, one way or another. If the Ruby app is dependent on another service that is timing out, you can patch the code like this, using the timeout.rb library:
require 'timeout'
status = Timeout::timeout(5) {
# Something that should be interrupted if it takes too much time...
}
This will let the code "give up" and close out the request gracefully when needed.

Resources