AWS ECS Fargate CannotPullContainerError: ref pull has been retried failed to copy: httpReadSeeker: failed open: unexpected status code - docker

From ECS console I started to see this issue
I think I understand pretty well the cause of this. ECS pulls these images as an anonymous user, since I haven't provided any credentials, and since the task was set to run every 5 minutes it was hitting the anonymous quota. No big deal, I set the task to run every 10 minutes and for now the problem is solved.
What drives me nuts is:
From CloudWatch console you can see that the task was executed. If I graph the number of executions I will see a chart with data points every 5 minutes. This is wrong to me, because in order to execute the task ECS first needs to pull something, and it can't, therefore there is no execution.
I can't find the error (CannotPullContainerError: ref pull has been retried 1 time(s): failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/amazon/aws-for-fluent-bit/manifests/sha256:f134723f733ba4f3489c742dd8cbd0ade7eba46c75d574...) in any CloudWatch log stream.
I need to be able to monitor for this error. I spot checked it because I found it by chance since I was looking into the console. How can I monitor for this?
FYI, I use Data Dog too for log ingestion. Maybe that can help?

Related

Properly handle timeout on CloudRun

We use Google Cloud Run to wrap an analysis developed in R behind a web API. For this, we have a small Fastify app that launches an R script and uploads the results to Google Cloud Storage. The process' stdout and stderr are written to a file and are also uploaded at the end of the analysis.
However, we sometimes run into issues when a process takes longer to execute than expected. In these cases, we fail to upload anything and it's difficult to debug, because stdout and stderr are "lost" on the instance. The only thing we see in the Cloud Run logs is this message
The request has been terminated because it has reached the maximum request timeout
Is there a recommended way to handle a request timeout?
In App Engine there used to be a descriptive error: DeadllineExceededError for Python and DeadlineExceededException for Java.
We currently evaluate the following approach
Explicitly set Cloud Run's request timeout
Provide the same value as an environment variable, so it's available to the container
When receiving a request, we start a timer that calls a "cleanup" function just before the timeout is exceeded
The cleanup function stops the running analysis and uploads the current stdout and stderr files to Cloud Storage
This feels a little complicated so any feedback very appreciated.
Since the default timeout is 5 minutes and can extend up to 60 minutes, I would simply start by increasing this to 10 minutes. Then observe over the course of a month how that affects your service.
Aside from that fix, I would start investigating why your process is taking longer than expected and if it's perhaps due to a forever-growing result set.
If there's no result set scalability concern, then bumping the default timeout up from 5-minutes seems to be the most reasonable and simple fix. It would only be a problem until your script has to deal with more data in the future for some reason.

Request consistently returns 502 after 1 minute while it successfully finishes in the container (application)

To preface this, I know an HTTP request that's longer than 1 minute is bad design and I'm going to look into Cloud Tasks but I do want to figure out why this issue is happening.
So as specified in the title, I have a simple API request to a Cloud Run service (fully managed) that takes longer than 1 minute which does some DB operations and generates PDFs and uploads them to GCS. When I make this request from the client (browser), it consistently gives me back a 502 response after 1 minute of waiting (presumably coming from the HTTP Load Balancer):
However when I look at the logs the request is successfully completed (in about 4 to 5 min):
I'm also getting one of these "errors" for each PDF that's being generated and uploaded to GCS, but from what I read these shouldn't really be the issue?:
To verify that it's not just some timeout issue with the application code or the browser, I put a 5 min sleep on a random API call on a local build and everything worked fine and dandy.
I have set the request timeout on Cloud Run to the maximum (15min), the max concurrency to the default 80, amount of CPU and RAM to 2 and 2GB respectively and the timeout on the Fastify (node.js) server to 15 min as well. Furthermore I went through the logs and couldn't spot an error indicating that the instance was out of memory or any other error around the time that I'm receiving the 502 error. Finally, I also followed the advice to use strace to have a more in depth look at system calls, just in case something's going very wrong there but from what I saw, everything looked fine.
In the end my suspicion is that there's some weird race condition in routing between the container and gateway/load balancer but I know next to nothing about Knative (on which Cloud Run is built) so again, it's just a hunch.
If anyone has any more ideas on why this is happening, please let me know!

In what cases does Google Cloud Run respond with "The request failed because the HTTP connection to the instance had an error."?

We've been running Google Cloud Run for a little over a month now and noticed that we periodically have cloud run instances that simply fail with:
The request failed because the HTTP connection to the instance had an error.
This message is nearly always* proceeded by the following message (those are the only messages in the log):
This request caused a new container instance to be started and may thus take longer and use more CPU than a typical request.
* I cannot find, nor recall, a case where that isn't true, but I have not done an exhaustive search.
A few things that may be of importance:
Our concurrency level is set to 1 because our requests can take up to the maximum amount of memory available, 2GB.
We have received errors that we've exceeded the maximum memory, but we've dialed back our usage to obviate that issue.
This message appears to occur shortly after 30 seconds (e.g., 32, 35) and our timeout is set to 75 seconds.
In my case, this error was always thrown after 120 seconds from receiving the request. I figured out the issue that Node 12 default request timeout is 120 seconds. So If you are using Node server you either can change the default timeout or update Node version to 13 as they removed the default timeout https://github.com/nodejs/node/pull/27558.
If your logs didn't catch anything useful, most probably the instance crashes because you run heavy CPU tasks. A mention about this can be found on the Google Issue Tracker:
A common cause for 503 errors on Cloud Run would be when requests use
a lot of CPU and as the container is out of resources it is unable to
process some requests
For me the issue got resolved by upgrading node "FROM node:13.10.1 AS build" to "FROM node:14.10.1 AS build" in docker file it got resolved by upgarding the node.

Incomplete GCLOUD SQL UPDATE operation upon STORAGE FULL is preventing any further operations including START

Stopped SQL service before realizing storage was full.
The UPDATE OPERATION during stopping did not cleanly complete.
Now any further operations, including START and attempting to change storage size give only the error:
"Operation failed because another operation was already in progress."
Tried from both web cloud console and gcloud command line. Same error on both.
How can I clear this incomplete UPDATE OPERATION so I can then increase storage size and start the SQL server?
It's not a perfect solution, but apparently the OPERATION does eventually complete after 2 to 3 hours. And this continues, every 2 to 3 hours, until an operation which increases the storage size, which again takes 2 to 3 hours to complete, and then the interface works normally again.
Posting this in case someone else runs into this same problem. There may still be a better solution, but giving it a lot of time does seem to work.

Error creating the GCE VMs or starting Dataflow

I'm getting the following error in the recent jobs I'm trying to submit:
2015-01-07T15:51:56.404Z: (893c24e7fd2fd6de): Workflow failed.
Causes: (893c24e7fd2fd601):
There was a problem creating the GCE VMs or starting Dataflow on the VMs so no data was processed. Possible causes:
1. A failure in user code on in the worker.
2. A failure in the Dataflow code.
Next Steps:
1. Check the GCE serial console for possible errors in the logs.
2. Look for similar issues on http://stackoverflow.com/questions/tagged/google-cloud-dataflow.
There are no other errors.
What does this error mean?
Sorry for the trouble.
The Dataflow starts up VM instances and then launches an agent on those VMs. Those agents then do the heavy lifting of executing your code (e.g. ParDo's, reading and writing) your Data.
The error indicates the job failed because no agents were requesting work. As a result, the service marked the job as a failure because it wasn't making any progress and never would since there weren't any agents to process your data.
So we need to figure out where in the agent startup process things failed.
The first thing to check is whether the VMs actually started. When you run your job do you see any VMs created in your project? It might take a minute or two for the VMs to startup but they should appear shortly after the runner prints out the message "Starting worker pool setup". The VMs should be named something like
<PREFIX-OF-JOB-NAME>-<TIMESTAMP>-<random hexadecimal number>-<instance number>
Only a prefix of the job name is used to ensure we don't exceed GCE name limits.
If the VMs startup the next thing to do is to inspect the worker logs to look for errors indicating problems in launching the agent.
The easiest way to access the logs is using the UI. Go to the Google Cloud Console and then select the Dataflow option in the left hand frame. You should see a list of your jobs. You can click on the job in question. This should show you a graph of your job. On the right side you should see a button "view logs". Please click that. You should then see a UI for navigating the logs and you can look for errors.
The second option is to look for the logs on GCS. The location to look for is:
gs://PATH TO YOUR STAGING DIRECTORY/logs/JOB-ID/VM-ID/LOG-FILE
You might see multiple log files. The one we are most interested in is the one that starts with "start_java_worker". If that log file doesn't exist then the worker didn't make enough progress to actually upload the file; or else there might have been a permission problem uploading the log file.
In that case the best thing to do is to try to ssh into one of the VMs before it gets torn down. You should have about 15 minutes before the job fails and the VMs are deleted.
Once you login to the VM you can find all the logs in
/var/log/dataflow/...
The log we care most about at this point is:
/var/log/dataflow/taskrunner/harness/start_java_worker-SOME ID.log
If there is a problem starting the code that runs on the VM that log should tell us. That log and the other logs should also tell us if there is a permission problem that prevents the code running on the worker from being able to access Dataflow.
Please take a look and let us know if you find anything.
Apart from Jeremy Lewi's great answer, I would like to add that I've seen this error appear when you don't enable the proper Google APIs in the Developers Console, as mentioned here, which leads to a permission issue, like Jeremy said.

Resources