Meteor reports 500 errors randomly - post

I am running Meteor 1.2.1 but this issue has occurred on 1.1 as well. It seems to happen pretty randomly. I tend to notice it if I take focus off the window that I start seeing them appear more regularly. This is the error that I see:
sockjs-0.3.4.js:854 POST http://blah.something.com/sockjs/770/bh33bcip/xhr 500 (Internal Server Error)
AbstractXHRObject._start # sockjs-0.3.4.js:854
(anonymous function) # sockjs-0.3.4.js:881
I recently installed natestrauser:connection-banner which pops a banner at the top when Meteor.connection.status().status is anything other than "connected". Since I installed it, this pops up every time I see the 500 error. The 500 error seems to kick it into "waiting" status. It reconnects eventually, but it's a rather annoying error.
I don't see anything on the server side whatsoever, nor on the client side. Does anyone have ideas on how to debug this, or why I'm getting this error?
A picture is included here:
http://imgur.com/EtTowR4

I figured out the problem! I use pound as a reverse proxy and the default installation has a very short timeout. I changed that timeout from 15 seconds to 60 seconds and the 500 errors disappeared. I don't know if this is because pound's keep alive was set to 30 (which would likely not keep anything alive given that the timeout was 15 seconds), or if its because the Meteor client doesn't check in more frequently than 15 seconds. Perhaps someone can chime in as to why this?
Either way, beware of your reverse proxy settings with Meteor!

Related

Request consistently returns 502 after 1 minute while it successfully finishes in the container (application)

To preface this, I know an HTTP request that's longer than 1 minute is bad design and I'm going to look into Cloud Tasks but I do want to figure out why this issue is happening.
So as specified in the title, I have a simple API request to a Cloud Run service (fully managed) that takes longer than 1 minute which does some DB operations and generates PDFs and uploads them to GCS. When I make this request from the client (browser), it consistently gives me back a 502 response after 1 minute of waiting (presumably coming from the HTTP Load Balancer):
However when I look at the logs the request is successfully completed (in about 4 to 5 min):
I'm also getting one of these "errors" for each PDF that's being generated and uploaded to GCS, but from what I read these shouldn't really be the issue?:
To verify that it's not just some timeout issue with the application code or the browser, I put a 5 min sleep on a random API call on a local build and everything worked fine and dandy.
I have set the request timeout on Cloud Run to the maximum (15min), the max concurrency to the default 80, amount of CPU and RAM to 2 and 2GB respectively and the timeout on the Fastify (node.js) server to 15 min as well. Furthermore I went through the logs and couldn't spot an error indicating that the instance was out of memory or any other error around the time that I'm receiving the 502 error. Finally, I also followed the advice to use strace to have a more in depth look at system calls, just in case something's going very wrong there but from what I saw, everything looked fine.
In the end my suspicion is that there's some weird race condition in routing between the container and gateway/load balancer but I know next to nothing about Knative (on which Cloud Run is built) so again, it's just a hunch.
If anyone has any more ideas on why this is happening, please let me know!

Do I have to worry about my logs containing long request for signalr/poll and signalr/connect

In my logs I have requests to signalr/poll and signalr/connect that take around 30 seconds.
The applications had some isues recently caused by thread starvation. Might these requests be the root cause or is it an expected behaviour and is the duration normal?
When I request the site with chrome I see websocket traffic so I gues it is running ok for most clients.
The applications is accessed via vpn and sometimes the connection is bad. Could this be a reason for falling back to long polling?
If you do not have enough threads, you end up in a deadlock, and the app will start to error out, and not work properly. At that point you would be forced to restart your AppPool, or web server. If your application is falling back to long polling, and a client is connected but not doing anything the poll will remain open until it gets a response, or if there is a timeout configured (default is 30 seconds I believe) will close on timeout. I would try restarting your AppPool and see if that helps, if not there is something wrong in the transportation layer, it should only need to fall back to long polling in extreme circumstances

Random and occasional network error (NSURLErrorDomain Code=-1001 and NSURLErrorDomain Code=-1005)

The last couple of days I've tried to debug a network error from d00m. I'm starting to run out of ideas/leads and my hope is that other SO users have valuable experience that might be useful. I hope to be able to provide all relevant information, but I'm not personally in control of the server environments.
The whole thing started by users noticing a couple of "network errors" in our app. The error seemed to occur randomly, without any noticeable pattern related to internet connectivity, iOS version or backend updates. The two errors that occurs behind the scenes are:
Error Domain=NSURLErrorDomain Code=-1001 "The request timed out."
and more frequently:
Error Domain=kCFErrorDomainCFNetwork Code=-1005 "The network connection was lost.
After debugging it for a couple of days, I've managed to reproduce these errors (occurring at random) by firing approx. 10 random (GET and POST) requests towards our backend with a random sleep timer between each request (set at 1-20 seconds). However, it only occurs in periods. What I've experienced the last couple of days is that when a "period of error" starts, I get one of the two errors every once or twice I run the code (meaning an error rate of 1/10 or 1/20 requests). This error rate continues for a couple of hours and then the error disappears for a couple of hours and then it starts all over.
Some quick facts about the setup:
Happens on device and simulator
Happens on iOS 8.4 and iOS 7.1 - although v. 8.4 is the main one I use for testing.
We use NSURLSession for our network requests. We also have AFNetworking included (updated to latest version), but we only use the Security part for SSL Pinning. Even with SSL pinning totally turned off, the error still occurs.
Some findings I've written down during the last couple of days:
It seems to only happen on our production environments which has some different configuration as our staging environments. This lead me to think that it might be related to the keep-alive bug as discussed here and here. However, our ops department have set up a new staging environment sending the same keep-alive header as the production environments, but this did not make the error occur on the staging environment.
Our Android version of the app were unable to reproduce the error using the same setup of requests. Further, we've not received any customer issues on "network errors" in the Android app.
My gut feeling says that it's related to the server environment and the HTTP implementation in iOS. I'm however unable to track down a convincing pattern that proves anything. I've made the same setup using a simple Rails script, and when the next "error period" occur, I'll be ready to try and reproduce it outside of iOS land. I'll update the question when this happens.
I'm not looking for solutions involving resetting wifi settings, shutting down the simulator or similar as I do not see this as feasible solutions in a production environment. I've also considered making the retry-loop-fix as mentioned in the GitHub issue, but I see this as a last resort.
Please let me know if you need any more information.
In my experience, those sorts of problems usually point to massive packet loss, particularly over a cell network, where minor variations in multipath interference and other issues can make the difference between reliably passing traffic and not.
Another possibility that comes to mind is poor-quality NAT implementations, in the unlikely event that your server's timeout interval is long enough to cause the NAT to give up on the TCP connection.
Either way, the only way to know for sure what's happening is to take a packet trace. To do that, connect a Mac to the Internet via a wired connection, enable network sharing over Wi-Fi, and connect the iOS device to that Wi-Fi network. Then run Wireshark and tell it to monitor the bridge interface. Instructions here:
http://www.howtogeek.com/104278/how-to-use-wireshark-to-capture-filter-and-inspect-packets/
From there, you should be able to see exactly what is being sent and when. That will probably go a long way towards understanding why it is failing.
Ok, I lost a lot of time investigeting similar issue.
1005 could be coused by known iOS bug and there are couple fixes. For example add header
"Connection" with value "close".
More info
1001 is a different story. In my case the problem was strange (bad?) firewall on the server. It was banning device when there was many (not so many) requests in short period of time.
I believe you can do easy test if you are facing similar issue.
Send a lot of (depends of firewall settings) requests in loop (let's say 50 in 1 second).
Close/Kill app (this will close connections to server)
(OPTIONAL) Wait a while (lets say 60 seconds)
Start app again and try send request
If you now got timeout for all next requests, you probably have same issue and you should talk with sever guys.
PS: if you don't have access to server, you can give user info that he should restart wifi on device to quit that timeout loop. It could be last resort in some cases.

end of file reached EOFError (Databasedotcom + Rails + Heroku)

After much frustration trying to figure it out myself, I am reaching for SO guys (you!) to help me trace this formidable error:
Message: end of file reached EOFError Backtrace:
["/app/vendor/ruby-1.9.3/lib/ruby/1.9.1/openssl/buffering.rb:174:in
`sysread_nonblock
Background: My App is a Rails 3 app hosted on Heroku and 100% back-end app. It uses Redis/Resque workers to process it's payload received from Salesforce using Chatter REST API.
Trouble: Unlike other similar errors of EOF in HTTPS/OpenSSL in Ruby, my error happens very random (since I can't yet predict when will this come up).
Usual Suspects: The error has been noticed quite frequently when I try to create 45 Resque workers, and try to sync data from 45 different Salesforce Chatter REST API connections all at once! It's so frequent that my processing fails 20% or more of the total and all due to this error.
Remedy Steps:
I am using Databasedotcom gem which uses HTTPS and follows all the required steps to connect to create a sane HTTPS connection.
So...
Use SSL set in HTTPS - checked
URI Encode - checked
Ruby 1.9.3 - checked
HTTP read timeout is set to 900 (15 minutes)
I retry this EOF error MAX of 30 times after sleeping 30 seconds before each retry!
Still, it fails some of the data.
Any help here please?
Have you considered that Salesforce might not like so many connections at once from a single source and you are blocked by an DDOS-preventer?
Also, setting those long timeouts is quite useless. If the connection fails, drop it and reschedule a new one yourself. That is what Resque does fine, it will add up those waiting times if it keeps having problems...

Heroku app drops initial requests

Whenever I hit my application on Heroku for the first time (over a period of about 10 minutes), it fails. Something went wrong error. But a refresh, always fixes the problem? Any ideas what might be causing this? Thanks for your help!
If you are running with 1 dyno (the free way) then your dyno will shut down after some period of inactivity and get started back up upon the next request. So, when you leave it alone for 10 minutes it gets shutdown and tries to spin back up on that first request. That process is usually pretty fast and you will see a 3-5 sec startup lag but not enough to time you out.
Do you have anything going on during startup that would take a long time?
Also, if it is worth paying a little bit per month you can bump it up to 2 dynos and they will not spin it down on paid apps.
I was having the same issue when I added the compass framework to my application. In my error logs I was getting this error:
Errno::EACCES (Permission denied - /app/public/stylesheets/screen.css)
Following these instructions solved the problem
http://devcenter.heroku.com/articles/using-compass
To build on #Ben's answer, which is 100% correct, there's one issue: it doesn't seen you're getting a timeout error. The "Something Went Wrong" error indicates a 500 error, so your app is being loaded, but something is throwing an exception. If it only happens on the first request, then there is something that is being loaded/executed only on the first request which is causing the problem.
If this is the case, then to see the error, check your logs:
$ heroku logs
Or sign-up for an error-reporting add-on, like Exceptional (it's free!):
$ heroku addons:add exceptional
You can then access Exceptional from your Heroku dashboard for your app - once there, use the "Add-ons" menu in the upper-right.
This has happened to me on all my apps for the past couple years. I never was annoying enough to really figure it out until now.
In my logs on the first request, I get this: Errno::EACCES (Permission denied - /app/public/stylesheets/screen.css)
The second, and sequential, requests work fine without this error. I can't think of anything wrong with my screen.css file.

Resources