Incomplete GCLOUD SQL UPDATE operation upon STORAGE FULL is preventing any further operations including START - startup

Stopped SQL service before realizing storage was full.
The UPDATE OPERATION during stopping did not cleanly complete.
Now any further operations, including START and attempting to change storage size give only the error:
"Operation failed because another operation was already in progress."
Tried from both web cloud console and gcloud command line. Same error on both.
How can I clear this incomplete UPDATE OPERATION so I can then increase storage size and start the SQL server?

It's not a perfect solution, but apparently the OPERATION does eventually complete after 2 to 3 hours. And this continues, every 2 to 3 hours, until an operation which increases the storage size, which again takes 2 to 3 hours to complete, and then the interface works normally again.
Posting this in case someone else runs into this same problem. There may still be a better solution, but giving it a lot of time does seem to work.

Related

AWS ECS Fargate CannotPullContainerError: ref pull has been retried failed to copy: httpReadSeeker: failed open: unexpected status code

From ECS console I started to see this issue
I think I understand pretty well the cause of this. ECS pulls these images as an anonymous user, since I haven't provided any credentials, and since the task was set to run every 5 minutes it was hitting the anonymous quota. No big deal, I set the task to run every 10 minutes and for now the problem is solved.
What drives me nuts is:
From CloudWatch console you can see that the task was executed. If I graph the number of executions I will see a chart with data points every 5 minutes. This is wrong to me, because in order to execute the task ECS first needs to pull something, and it can't, therefore there is no execution.
I can't find the error (CannotPullContainerError: ref pull has been retried 1 time(s): failed to copy: httpReadSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/amazon/aws-for-fluent-bit/manifests/sha256:f134723f733ba4f3489c742dd8cbd0ade7eba46c75d574...) in any CloudWatch log stream.
I need to be able to monitor for this error. I spot checked it because I found it by chance since I was looking into the console. How can I monitor for this?
FYI, I use Data Dog too for log ingestion. Maybe that can help?

Google Cloud Run is very slow vs. local machine

We have a small script that scrapes a webpage (~17 entries), and writes them to Firestore collection. For this, we deployed a service on Google Cloud Run.
The execution of this code takes ~5 seconds when tested locally using Docker Container image.
The same image when deployed to Cloud Run takes over 1 minute.
Even simple command as "Delete all Documents in a Collection", which takes 2-3 seconds locally, takes over 10 seconds when deployed on Cloud Run.
We are aware of Cold Start, and so we tested the performance of Cloud Run on the third, fourth and fifth subsequent runs, but it's still quite slow.
We also experimented with the number of CPUs, instances, concurrency, memory, using both default values as well as extreme values at both ends, but Cloud Run's performance is slow.
Is this expected? Are individual instances of Cloud Run really this weak? Can we do something to make it faster?
The problem with this slowness is that if we run our code for large number of entries, Cloud Run would eventually time out (not to mention the cost of Cloud Run per second)
Posting answer to my own question as we experimented a lot with this, and found issues in our own implementation.
In our case, the reason for super slow performance was async calls without Promises or callbacks.
What we initially missed was this section: Avoiding background activities
Our code didn't wait for the async operation to end, and responding to the request right away. The async operation then moved to background activity and took forever to finish.
Responding to comments posted, or similar questions that may arise:
1. We didn't try experiment with local by setting up a VM with same config because we figured out the cause sooner.
We are not writing anything on filesystem (yet), and operations are simple calls. But this is a good question, and we'll keep it in mind when we store/write data

In what cases does Google Cloud Run respond with "The request failed because the HTTP connection to the instance had an error."?

We've been running Google Cloud Run for a little over a month now and noticed that we periodically have cloud run instances that simply fail with:
The request failed because the HTTP connection to the instance had an error.
This message is nearly always* proceeded by the following message (those are the only messages in the log):
This request caused a new container instance to be started and may thus take longer and use more CPU than a typical request.
* I cannot find, nor recall, a case where that isn't true, but I have not done an exhaustive search.
A few things that may be of importance:
Our concurrency level is set to 1 because our requests can take up to the maximum amount of memory available, 2GB.
We have received errors that we've exceeded the maximum memory, but we've dialed back our usage to obviate that issue.
This message appears to occur shortly after 30 seconds (e.g., 32, 35) and our timeout is set to 75 seconds.
In my case, this error was always thrown after 120 seconds from receiving the request. I figured out the issue that Node 12 default request timeout is 120 seconds. So If you are using Node server you either can change the default timeout or update Node version to 13 as they removed the default timeout https://github.com/nodejs/node/pull/27558.
If your logs didn't catch anything useful, most probably the instance crashes because you run heavy CPU tasks. A mention about this can be found on the Google Issue Tracker:
A common cause for 503 errors on Cloud Run would be when requests use
a lot of CPU and as the container is out of resources it is unable to
process some requests
For me the issue got resolved by upgrading node "FROM node:13.10.1 AS build" to "FROM node:14.10.1 AS build" in docker file it got resolved by upgarding the node.

Docker container on AWS is constantly restarting

One of the service's container is constantly restarting. From the logs I can see that some request take like 20s, and for some of them there are exceptions like: An exception occurred in the database while iterating the results of a query. System.InvalidOperationException: An operation is already in progress. at Npgsql.NpgsqlConnection or Timeouts. When I try to access the db with the local environment, I cannot reproduce such exceptions. On random requests, taking too long, the container restarts. Have somebody had some similar issue?
As the exception says, your application is likely trying to use the same physical connection at the same time from multiple threads - but it's impossible to know without seeing some code. Make sure you understand exactly when connections are being used and by which thread, and if you're still stuck try to post a minimal code sample that demonstrates the issue.
If you are using ELB( Elastic Load Balancer ) then increase timeout limit of it .

PythonOperator task hangs accessing Cloud Storage and is stacked as SCHEDULED

One of the tasks in my DAG sometimes hangs when accessing Cloud Storage. It seems the code stops at the download function here:
hook = GoogleCloudStorageHook(google_cloud_storage_conn_id='google_cloud_default')
for input_file in hook.list(bucket, prefix=folder):
hook.download(bucket=bucket, object=input_file)
In my tests the folder contains a single 20Mb json file.
The task normally takes 20-30 seconds, but in some cases it will run for 5 minutes, and after that its state is updated to SCHEDULED and stuck there (waited for more than 6 hours). I suspect the 5 minutes are due to the configuration scheduler_zombie_task_threshold 300 but not sure.
If I clear the task manually on the Web UI, the task is quickly queued and run again correctly. I am getting around the issue by setting an execution_timeout which updates the task correctly to FAILED or UP_FOR_RETRY state when it takes longer than 10 minutes; but I'd like to fix the underlying issue to avoid relying on a fixed timeout threshold, any suggestions?
There was a discussion on the Cloud Composer Discuss group about this: https://groups.google.com/d/msg/cloud-composer-discuss/alnKzMjEj8Q/0lbp3bTlAgAJ. It is a problem with the Celery executor when Airflow workers die.
Although Composer is working on a fix, if you want this to happen less frequently in the current version, you may consider reducing your parallelism Airflow configuration or creating a new environment with a larger machine-type.

Resources