Cloudflare + Heroku (Rails) Failed Requests - ruby-on-rails

Our Ruby on Rails application is running on Heroku behind Cloudflare. Occasionally, requests to our server simply don't make it. We learned about this through users emailing us saying that their data is not saving. I installed Bugsnag on the front-end to track how often this occurs. In short, it's a very small percentage of requests (<1%), but when it happens, it's extremely frustrating to our users.
The response from Cloudflare contains a status of "-1" and no other information. I'm confident these requests are not reaching our server, since I've setup logging there, too. Further, simply retrying the request works approximately 2/3 of the time but has negative consequences for us, too (duplicate data in our database).
Has anyone experienced this before? Any ideas where to look next? We've followed the Cloudflare <> Heroku setup guide closely. We're operating in Full (Strict) SSL. I've done a TON of reading on this. Any help would be GREATLY appreciated.

Related

Random slow Rack::MethodOverride#call on rails app on Heroku

Environment:
Ruby: 2.1.2
Rails: 4.1.4
Heroku
In our rails app hosted on Heroku, there are times that requests take a long time to execute. It is just 1% of times or less, but we cannot figure out what it is happening.
We have newrelic agent installed and it says that it is not request-queuing, it is the transaction itself who takes all that time to execute.
However, transaction trace shows this:
(this same request most of the times takes only 100ms to be executed)
As far as I can tell, the time is being consumed before our controller gets invoked. It is consumed on
Rack::MethodOverride#call
and that is what we cannot understand.
Also, most of the times (or even always, we are not sure) this happens on POST requests that are sent by mobile devices. Could this have something to do with a slow connection? (although POST-payload is very tiny).
Has anyone experienced this? Any advice on how to keep exploring this issue is appreciated.
Thanks in advance!
Since the Ruby agent began to instrument middleware in version 3.9.0.229, we've seen this question arise for some users. One possible cause of the longer timings is that Rack::MethodOverride needs to examine the request body on POST in order to determine whether the POST parameters contain a method override. It calls Rack::Request#POST, which ends up triggering a read that reads in the entire request body.
This may be why you see that more time than expected is being spent in this middleware. Looking more deeply into how the POST body relates to the time spent in the middleware might be a fruitful avenue for investigation.
In case anyone is experiencing this:
Finally we have made the switch from unicorn to passenger and this issue has been resolved:
https://github.com/phusion/passenger-ruby-heroku-demo
I am not sure, but the problem may have something to do with POST requests on slow clients. Passenger/nginx says:
Request/response buffering - The included Nginx buffers requests and
responses, thus protecting your app against slow clients (e.g. mobile
devices on mobile networks) and improving performance.
So this may be the reason.

Old HTTP requests arrive to server behind AWS ELB

We had an interesting event the other day in our system where a burst of month old HTTP requests arrived to our ELB and from there to one of our coupled servers. We could tell that the requests were old by a timestamp we are sending from our client app (and from the fact it had no relevant data :) ).
Our system is hosted on AWS, using a group of EC2 instances behind an ELB which communicates in HTTP with the EC2's. Also, our client app runs on iOS.
A thing to notice - the old requests were dated to a day in which we had a server crash which lead to a great load on our remaining servers (resulting in a lot of hanged HTTP requests, i.e they were not processed)
Also, despite the group of old messages originally spanned across several minutes (which we know from the timestamps), they all came in a single bulk the other day (this is from the ELB metrics).
We are trying to figure out how or where could these requests stack up and maybe understand why it happened when it did.
Any insights, similar experiences or suggestions will be appreciated as we've failed to find similar events on the web, thanks!

Strange TTFB (time to first byte) issue on Heroku

We're in the process of improving performance of the our rails app hosted at Heroku (rails 3.2.8 and ruby 1.9.3). During this we've come across one alarming problem for which the source seems to be extremely difficult to track. Let me quickly explain how we experience the problem and how we've tried to isolate it.
--
Since around June we've experienced weird lag behavior in Time to First Byte all over the site. The problems is obvious from using the site (sometimes the application doesn't respond for 10-20 seconds), and it's also present in waterfall analysis via webpagetest.org.
We're based in Denmark but get this result from any host.
To confirm the problem we've performed a benchmark test where we send 300 identical requests to a simple page and measured the response time.
If we send 300 requests to the front page the median response time is below 1 second, which is fairly good. What scares us is that 60 requests takes more that double that time and 40 of those takes more than 4 seconds. Some requests take as much as 16 seconds.
None of these slow requests show up in New Relic, which we use for performance monitoring. No request queuing shows up and the results are the same no matter how high we scale our web processes.
Still, we couldn't reject that the problem was caused by application code, so we tried another experiment where we responded to the request via rack middleware.
By placing this middleware (TestMiddleware) at the beginning of the rack stack, we returned a request before it even hit the application, ensuring that none of the following middleware or the rails app could cause the delay.
Middleware setup:
$ heroku run rake middleware
use Rack::Cache
use ActionDispatch::Static
use TestMiddleware
use Rack::Rewrite
use Rack::Lock
use Rack::Runtime
use Rack::MethodOverride
use ActionDispatch::RequestId
use Rails::Rack::Logger
use ActionDispatch::ShowExceptions
use ActionDispatch::DebugExceptions
use ActionDispatch::RemoteIp
use Rack::Sendfile
use ActionDispatch::Callbacks
use ActiveRecord::ConnectionAdapters::ConnectionManagement
use ActiveRecord::QueryCache
use ActionDispatch::Cookies
use ActionDispatch::Session::DalliStore
use ActionDispatch::Flash
use ActionDispatch::ParamsParser
use ActionDispatch::Head
use Rack::ConditionalGet
use Rack::ETag
use ActionDispatch::BestStandardsSupport
use NewRelic::Rack::BrowserMonitoring
use Rack::RailsExceptional
use OmniAuth::Builder
run AU::Application.routes
We then ran the same script to document response time and got pretty much the same result. The median response time was around 130ms (obviously faster because it doesn't hit the app. But still 60 requests took more than 400ms and 25 requests took more than 1 second. Again, with some requests as slow as 16 seconds.
One explanation could be related to slow hops on the network or DNS setup, but the results of traceroute looks perfectly OK.
This result was confirmed from running the response script on another rails 3.2 and ruby 1.9.3 application hosted on Heroku - no weird behavior at all.
The DNS setup follows Heroku's recommendations.
--
We're confused to say the least. Could there be something fishy with Heroku's routing network?
Why the heck are we seeing this weird behavior? How do we get rid of it? And why can't we see it in New Relic?
It Turned out that it was a kind of request queuing. Sometimes, that web server was busy, and since heroku just routs randomly incoming requests randomly to any dyno, then I could end up in a queue behind a dyno, which was totally stuck due to e.g. database problems. The strange thing is, that this was hardly noticeable in new relic (it's a good idea to uncheck all other resources when viewing thins in their charts, then the queuing suddenly appears)
EDIT 21/2 2013: It has turned out, that the reason why it wasn't hardly noticeable in Newrelic was, that it wasn't measured! http://rapgenius.com/Lemon-money-trees-rap-genius-response-to-heroku-lyrics
We find this very frustrating, and we ended up leaving Heroku in favor of dedicated servers. This gave us 20 times better performance at a 1/10 of the cost. Additionally I must say that we are disappointed by Heroku who at the time this happened, denied that the slowness was due to their infrastructure even though we suspected it and highlighted it several times. We even got answers like this back:
Heroku 28/8 2012: "If you're not seeing request queueing or other slowness reported in New Relic, then this is likely not a server-side issue. Heroku's internal routing should take <1ms. None of our monitoring systems are indicating any routing problems currently."
Additionally we spoke to Newrelic who also seemed unaware of the issue, even though they according to them selfs has a very close work relationship with Heroku.
Newrelic 29/8 2012: "It looks like whatever is causing this is happening before the Ruby agent's visibility starts. The queue time that the agent records is from the time the request enters a dyno, so the slow down is occurring before then."
The bottom-line was, that we ended up spending hours and hours on optimizing code that wasn't really the bottleneck. Additionally running with a too high dyno scale in a desperate try to boost our performance, but the only thing that we really got from this was bigger receipts from both Heroku and Newrelic - NOT COOL. I'm glad that we changed.
PS. At that time there even was a bug that caused newrelic pro to be charged on ALL dynos even though we, (according to Newrelics own advice), had disabled the monitoring on our background worker processes. It took a lot of time and many emails before the mistake was admitted by both parties.
PPS. If you are not aware of the current ongoing discussion, then here is the link http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics
EDIT 26/2 2013
Heroku has just announced in their newsletter, that Newrelic has released an update that apparently should cast some light on the situation at Heroku.
EDIT 8/4 2013
Heroku has just released an FAQ over the topic
traceroute is not a good measure of problems in the network, its a tool that can find failures along the network, but it will not show you the best view.
Try just putting up a static webpage and hit it with the ip address from your webpage tester. If it is still slow, blame the network.
If for some reason it is fast, then you have a different issue.

Is Heroku dependable?

I have been hosting a site on Heroku for a few months that is very soon to go into production.
Since I began with them, there have been at least three significant outages, one of which was the disastrous Amazon outage last month and another of which is a multi-hour outage happening today.
I believe in Heroku's vision and I think they are a great company, but I am faced with the ultimate problem: if they can't keep sites up and running, everything I like about them doesn't really matter.
Is Heroku a reliable provider to run a production site on Rails?
Are there any other providers I might look into that have a better reputation for reliability than Heroku?
In my opinion, downtime can happen with almost any provider. What you need to see is how well or badly the host handles the downtime and the effort they make in keeping the customer updated about possible resolution.
In my opinion Heroku is a great place to host your app. The advantages and ease of deploying there covers up for the recent (and rare) downtime FOR ME.
I am user of Heroku with Amazon RDS plugin for the past 7-8 months and my conclusion is there is nothing to appreciate about Heroku except their architecture. Here is why I think:
Even though it is sold for $250 million+ they were still NOT using the Amazon multiple zones feature of Amazon. Below is the link how SmugMug survived amazon crash by using Amazon's multiple zones feature.
http://don.blogs.smugmug.com/2011/04/24/how-smugmug-survived-the-amazonpocalypse/
No phone contact support in the event of issues (not application but Heroku's), lot to learn from Rackspace
The application I am hosting, people will starve if it goes down for few hours on Friday forget about 60 hours downtime.
I see intermittent deployment and connectivity issues. Please visit this link for a confirmation:
http://status.heroku.com/
I know developers love it because they throw a cheap web process called 'dyno' for free.
So far Heroku does not offer multiple availability zone redundancy. If you want something more reliable than Heroku you can create your own EC2 instances in multiple availability zones. Of course this will require significantly more server upkeep, admin, and deployment time.
I have seem Heroku to be reliable. I highly recommended it for starting out and validating your idea. I believe when you start your project you want get it out quickly (to customer or to public).
As mentioned in other comments at some point you might need to switch over to EC2 as you might need zone redundancy and it might actually become cheaper to run of EC2 especially if you already have an SA in the company.
No. It is not. As a customer I've experienced multiple critical outages. These things happen and I get that. But what makes Heroku unreliable is their nearly non-existent support when things do go wrong. I would use caution when evaluating Heroku or any provider for that matter and really understand what you're paying for. Paying as much as I did for Heroku I expected more.
As an example one of their databases went offline early on a Sunday. I immediately was made aware, not from Heroku but from our customers and new relic alerts. I contacted Heroku support just to get the ball rolling as I began to troubleshoot. 24 hours later I had literally no responses from Heroku. I could not fork, follow, or take snapshot of the database as they suggest (because they were experiencing issues) so I basically sat on my hand and waited. Hoping that somebody would respond as I frantically attempt to recover somehow, someway.
Was this their fault. No. Not at all. I should/could have done something to mitigate this failure. But as much as I pay for their servies each month I expected something resembling a response to my critical issue.
Our our app is hosted by Heroku and went down mutliple times over the last 12 months.
Two times it was caused by one of the third-party apps that Heroku offers:
We used Zerigo (recommended by Heroku) for our DNS. This has caused our site to go down twice - one time it took over 12 hours te recover. This is absolutely crazy for something like DNS, so we have switched to a more reliable provider.
The Redistogo app went down once.
Heroku does bring some benefits, but be careful about the apps you select.
In my org i build simple SPA productivity apps, and have been using Heroku to host them for the last year after migrating away from a physical box server to cloud VMs.
I've had multiple days lost due to Heroku development hindering outages. Usually while running apps stay online, and work, when Heroku goes down you can't push updates or restart apps.
Lets also not forget the ridiculous times for scheduled maintenance (usually 2PM EST, midweek....REALLY?)
As of writing this, the Logging system for Heroku has now been acting up (more or less down) for over 24 hour.
Thankfully my apps aren't mission critical. While I like Heroku's ease of use, it's just not worth this much headache for what is nothing other than an AWS middle-man.
That said, I'm moving over to just pure AWS EC2 instances.

How to hunt down a long running request in Rails

We have a customer complaining about a long running request. I found the request in the production.log but am not sure how to dig deeper into figuring out why it took so long. Is there any artifacts in the log that I should look for?
Also the DB and View times don't add up to the total request time.
Try newrelic rpm. It can parse your logs and show you the slowest requests and a lot of other information. It shows live statistics about the app too. The trial should be enough for you to fix your application.

Resources