Request consistently returns 502 after 1 minute while it successfully finishes in the container (application) - google-cloud-run

To preface this, I know an HTTP request that's longer than 1 minute is bad design and I'm going to look into Cloud Tasks but I do want to figure out why this issue is happening.
So as specified in the title, I have a simple API request to a Cloud Run service (fully managed) that takes longer than 1 minute which does some DB operations and generates PDFs and uploads them to GCS. When I make this request from the client (browser), it consistently gives me back a 502 response after 1 minute of waiting (presumably coming from the HTTP Load Balancer):
However when I look at the logs the request is successfully completed (in about 4 to 5 min):
I'm also getting one of these "errors" for each PDF that's being generated and uploaded to GCS, but from what I read these shouldn't really be the issue?:
To verify that it's not just some timeout issue with the application code or the browser, I put a 5 min sleep on a random API call on a local build and everything worked fine and dandy.
I have set the request timeout on Cloud Run to the maximum (15min), the max concurrency to the default 80, amount of CPU and RAM to 2 and 2GB respectively and the timeout on the Fastify (node.js) server to 15 min as well. Furthermore I went through the logs and couldn't spot an error indicating that the instance was out of memory or any other error around the time that I'm receiving the 502 error. Finally, I also followed the advice to use strace to have a more in depth look at system calls, just in case something's going very wrong there but from what I saw, everything looked fine.
In the end my suspicion is that there's some weird race condition in routing between the container and gateway/load balancer but I know next to nothing about Knative (on which Cloud Run is built) so again, it's just a hunch.
If anyone has any more ideas on why this is happening, please let me know!

Related

Properly handle timeout on CloudRun

We use Google Cloud Run to wrap an analysis developed in R behind a web API. For this, we have a small Fastify app that launches an R script and uploads the results to Google Cloud Storage. The process' stdout and stderr are written to a file and are also uploaded at the end of the analysis.
However, we sometimes run into issues when a process takes longer to execute than expected. In these cases, we fail to upload anything and it's difficult to debug, because stdout and stderr are "lost" on the instance. The only thing we see in the Cloud Run logs is this message
The request has been terminated because it has reached the maximum request timeout
Is there a recommended way to handle a request timeout?
In App Engine there used to be a descriptive error: DeadllineExceededError for Python and DeadlineExceededException for Java.
We currently evaluate the following approach
Explicitly set Cloud Run's request timeout
Provide the same value as an environment variable, so it's available to the container
When receiving a request, we start a timer that calls a "cleanup" function just before the timeout is exceeded
The cleanup function stops the running analysis and uploads the current stdout and stderr files to Cloud Storage
This feels a little complicated so any feedback very appreciated.
Since the default timeout is 5 minutes and can extend up to 60 minutes, I would simply start by increasing this to 10 minutes. Then observe over the course of a month how that affects your service.
Aside from that fix, I would start investigating why your process is taking longer than expected and if it's perhaps due to a forever-growing result set.
If there's no result set scalability concern, then bumping the default timeout up from 5-minutes seems to be the most reasonable and simple fix. It would only be a problem until your script has to deal with more data in the future for some reason.

Cloud Run: 500 Server Error with no log output

We are investigating an issue on a deployed cloud run service, where requests made to the service occasionnaly fail with a StatusCodeError: 500, while no log of said requests appear on cloud run.
Served requests usually produce two log lines detailing the request, route and exit code (POST 200 on https://service-name.a.run.app/route/...)
One with log name projects/XXX/logs/run.googleapis.com/stdout is produced by our application to log the serving of every request
One with log name projects/XXX/logs/run.googleapis.com/requests is automatically produced by cloud run on every request
When the incident occurs, none of those are logged. The client (running in a gke pod in the same project) has the only log of the failing requests, with the following message:
StatusCodeError: 500 - "\n<html><head>\n<meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">\n<title>500 Server Error</title>\n</head>\n<body text=#000000 bgcolor=#ffffff>\n<h1>Error: Server Error</h1>\n<h2>The server encountered an error and could not complete your request.<p>Please try again in 30 seconds.</h2>\n<h2></h2>\n</body></html>\n"
Rough timeline of the last incident:
14:41 - Service is serving requests as expected, producing both log lines each time
14:44 to 14:56 - Cloud run logs are empty, every request made to the service (~30) gets the 500 error message
14:56 - Cloud run terminates the currently running container instance, (as happens after some inactivity for instance), which is correctly logged by the application ([INFO] Handling signal: term)
14:58 - Cloud run instantiates a new container instance and starts serving incoming requests (which are logged normally)
The absence of logs during the incident makes it hard to investigate its cause, and at this stage we would be gratefull for any kind of lead.
Our service has another known issue, that may or may not be related. The service is designed to avoid multiple replicas, as a single one should be able to handle the load and serve concurrent requests (cloud run concurency = 80), but has a relatively long cold start time (~30s). This leads to 429 errors when a spike of requests comes while no replica is available (because of cloud run hard capping concurrency to 1 during cold start). This issue was somewhat mitigated by allowing some replication (currently maxScale = 3), since each replica can put a request on hold during the cold start, but will require some work on the client side to handle correctly (simple retries after the cold start).
I have found this PIT that describes the aforementioned behavior. It seems to happen because a part of Cloud Run thinks that there are already provisioned instances handling the traffic but there aren't. This issue is currently being worked on internally but there's no ETA for a fix at the moment.
The current workaround is to set a maximum number of instances to at least 4.

In what cases does Google Cloud Run respond with "The request failed because the HTTP connection to the instance had an error."?

We've been running Google Cloud Run for a little over a month now and noticed that we periodically have cloud run instances that simply fail with:
The request failed because the HTTP connection to the instance had an error.
This message is nearly always* proceeded by the following message (those are the only messages in the log):
This request caused a new container instance to be started and may thus take longer and use more CPU than a typical request.
* I cannot find, nor recall, a case where that isn't true, but I have not done an exhaustive search.
A few things that may be of importance:
Our concurrency level is set to 1 because our requests can take up to the maximum amount of memory available, 2GB.
We have received errors that we've exceeded the maximum memory, but we've dialed back our usage to obviate that issue.
This message appears to occur shortly after 30 seconds (e.g., 32, 35) and our timeout is set to 75 seconds.
In my case, this error was always thrown after 120 seconds from receiving the request. I figured out the issue that Node 12 default request timeout is 120 seconds. So If you are using Node server you either can change the default timeout or update Node version to 13 as they removed the default timeout https://github.com/nodejs/node/pull/27558.
If your logs didn't catch anything useful, most probably the instance crashes because you run heavy CPU tasks. A mention about this can be found on the Google Issue Tracker:
A common cause for 503 errors on Cloud Run would be when requests use
a lot of CPU and as the container is out of resources it is unable to
process some requests
For me the issue got resolved by upgrading node "FROM node:13.10.1 AS build" to "FROM node:14.10.1 AS build" in docker file it got resolved by upgarding the node.

Meteor reports 500 errors randomly

I am running Meteor 1.2.1 but this issue has occurred on 1.1 as well. It seems to happen pretty randomly. I tend to notice it if I take focus off the window that I start seeing them appear more regularly. This is the error that I see:
sockjs-0.3.4.js:854 POST http://blah.something.com/sockjs/770/bh33bcip/xhr 500 (Internal Server Error)
AbstractXHRObject._start # sockjs-0.3.4.js:854
(anonymous function) # sockjs-0.3.4.js:881
I recently installed natestrauser:connection-banner which pops a banner at the top when Meteor.connection.status().status is anything other than "connected". Since I installed it, this pops up every time I see the 500 error. The 500 error seems to kick it into "waiting" status. It reconnects eventually, but it's a rather annoying error.
I don't see anything on the server side whatsoever, nor on the client side. Does anyone have ideas on how to debug this, or why I'm getting this error?
A picture is included here:
http://imgur.com/EtTowR4
I figured out the problem! I use pound as a reverse proxy and the default installation has a very short timeout. I changed that timeout from 15 seconds to 60 seconds and the 500 errors disappeared. I don't know if this is because pound's keep alive was set to 30 (which would likely not keep anything alive given that the timeout was 15 seconds), or if its because the Meteor client doesn't check in more frequently than 15 seconds. Perhaps someone can chime in as to why this?
Either way, beware of your reverse proxy settings with Meteor!

end of file reached EOFError (Databasedotcom + Rails + Heroku)

After much frustration trying to figure it out myself, I am reaching for SO guys (you!) to help me trace this formidable error:
Message: end of file reached EOFError Backtrace:
["/app/vendor/ruby-1.9.3/lib/ruby/1.9.1/openssl/buffering.rb:174:in
`sysread_nonblock
Background: My App is a Rails 3 app hosted on Heroku and 100% back-end app. It uses Redis/Resque workers to process it's payload received from Salesforce using Chatter REST API.
Trouble: Unlike other similar errors of EOF in HTTPS/OpenSSL in Ruby, my error happens very random (since I can't yet predict when will this come up).
Usual Suspects: The error has been noticed quite frequently when I try to create 45 Resque workers, and try to sync data from 45 different Salesforce Chatter REST API connections all at once! It's so frequent that my processing fails 20% or more of the total and all due to this error.
Remedy Steps:
I am using Databasedotcom gem which uses HTTPS and follows all the required steps to connect to create a sane HTTPS connection.
So...
Use SSL set in HTTPS - checked
URI Encode - checked
Ruby 1.9.3 - checked
HTTP read timeout is set to 900 (15 minutes)
I retry this EOF error MAX of 30 times after sleeping 30 seconds before each retry!
Still, it fails some of the data.
Any help here please?
Have you considered that Salesforce might not like so many connections at once from a single source and you are blocked by an DDOS-preventer?
Also, setting those long timeouts is quite useless. If the connection fails, drop it and reschedule a new one yourself. That is what Resque does fine, it will add up those waiting times if it keeps having problems...

Resources