PythonOperator task hangs accessing Cloud Storage and is stacked as SCHEDULED - google-cloud-composer

One of the tasks in my DAG sometimes hangs when accessing Cloud Storage. It seems the code stops at the download function here:
hook = GoogleCloudStorageHook(google_cloud_storage_conn_id='google_cloud_default')
for input_file in hook.list(bucket, prefix=folder):
hook.download(bucket=bucket, object=input_file)
In my tests the folder contains a single 20Mb json file.
The task normally takes 20-30 seconds, but in some cases it will run for 5 minutes, and after that its state is updated to SCHEDULED and stuck there (waited for more than 6 hours). I suspect the 5 minutes are due to the configuration scheduler_zombie_task_threshold 300 but not sure.
If I clear the task manually on the Web UI, the task is quickly queued and run again correctly. I am getting around the issue by setting an execution_timeout which updates the task correctly to FAILED or UP_FOR_RETRY state when it takes longer than 10 minutes; but I'd like to fix the underlying issue to avoid relying on a fixed timeout threshold, any suggestions?

There was a discussion on the Cloud Composer Discuss group about this: https://groups.google.com/d/msg/cloud-composer-discuss/alnKzMjEj8Q/0lbp3bTlAgAJ. It is a problem with the Celery executor when Airflow workers die.
Although Composer is working on a fix, if you want this to happen less frequently in the current version, you may consider reducing your parallelism Airflow configuration or creating a new environment with a larger machine-type.

Related

Properly handle timeout on CloudRun

We use Google Cloud Run to wrap an analysis developed in R behind a web API. For this, we have a small Fastify app that launches an R script and uploads the results to Google Cloud Storage. The process' stdout and stderr are written to a file and are also uploaded at the end of the analysis.
However, we sometimes run into issues when a process takes longer to execute than expected. In these cases, we fail to upload anything and it's difficult to debug, because stdout and stderr are "lost" on the instance. The only thing we see in the Cloud Run logs is this message
The request has been terminated because it has reached the maximum request timeout
Is there a recommended way to handle a request timeout?
In App Engine there used to be a descriptive error: DeadllineExceededError for Python and DeadlineExceededException for Java.
We currently evaluate the following approach
Explicitly set Cloud Run's request timeout
Provide the same value as an environment variable, so it's available to the container
When receiving a request, we start a timer that calls a "cleanup" function just before the timeout is exceeded
The cleanup function stops the running analysis and uploads the current stdout and stderr files to Cloud Storage
This feels a little complicated so any feedback very appreciated.
Since the default timeout is 5 minutes and can extend up to 60 minutes, I would simply start by increasing this to 10 minutes. Then observe over the course of a month how that affects your service.
Aside from that fix, I would start investigating why your process is taking longer than expected and if it's perhaps due to a forever-growing result set.
If there's no result set scalability concern, then bumping the default timeout up from 5-minutes seems to be the most reasonable and simple fix. It would only be a problem until your script has to deal with more data in the future for some reason.

How can I see how long my Cloud Run deployed revision took to spin up?

I deployed a Vue.js and a Kotlin server app. Cloud Run does promise to put a service to sleep if no request to it arise for a specific time. I did not opened my app for a day now. As I opened it - it was available almost immediatly. Since I know how long it takes to spin up when started locally I kinda don't trust the promise that Cloud Run really had put the app to sleep and span it up so crazy fast.
I'd love to know a way how I can really see how long it took for the spinup - also for startup improvement for the backend service.
After having the service inactive for some time, record the time when you request the service URL and request it.
Then go to the logs for the Cloud Run service, and use this filter to see the logs for the service:
resource.type="cloud_run_revision"
resource.labels.service_name="$SERVICE_NAME"
Look for the log entry with the normal app output after your request, check its time and compare it with the recorded time.
You can't know when the instance will be evicted or if it is kept in memory. It could happen quickly, or take hours or days before eviction. it's "serverless".
About the starting time, when I test, I deploy a new revision and I have a try on it. In the logging service, the first log entry of the new revision provides me the cold start duration. (Usually 300+ ms, compare to usual 20 - 50 ms with warm start).
The billing instance time is the sum of all the containers running times. A container is considered as "running" when it process request(s).

Google Cloud Run is very slow vs. local machine

We have a small script that scrapes a webpage (~17 entries), and writes them to Firestore collection. For this, we deployed a service on Google Cloud Run.
The execution of this code takes ~5 seconds when tested locally using Docker Container image.
The same image when deployed to Cloud Run takes over 1 minute.
Even simple command as "Delete all Documents in a Collection", which takes 2-3 seconds locally, takes over 10 seconds when deployed on Cloud Run.
We are aware of Cold Start, and so we tested the performance of Cloud Run on the third, fourth and fifth subsequent runs, but it's still quite slow.
We also experimented with the number of CPUs, instances, concurrency, memory, using both default values as well as extreme values at both ends, but Cloud Run's performance is slow.
Is this expected? Are individual instances of Cloud Run really this weak? Can we do something to make it faster?
The problem with this slowness is that if we run our code for large number of entries, Cloud Run would eventually time out (not to mention the cost of Cloud Run per second)
Posting answer to my own question as we experimented a lot with this, and found issues in our own implementation.
In our case, the reason for super slow performance was async calls without Promises or callbacks.
What we initially missed was this section: Avoiding background activities
Our code didn't wait for the async operation to end, and responding to the request right away. The async operation then moved to background activity and took forever to finish.
Responding to comments posted, or similar questions that may arise:
1. We didn't try experiment with local by setting up a VM with same config because we figured out the cause sooner.
We are not writing anything on filesystem (yet), and operations are simple calls. But this is a good question, and we'll keep it in mind when we store/write data

cloud run is closing the container even if my script is still running

I want to run a long-running job on cloud run. this task may execute more than 30 minutes and it mostly sends out API requests.
cloud run stops executing after about 20 minutes and from the metrics, it looks like it did not identify that my task is still in the running state. so it probably thinks it is in idling and closing the container. I guess I can run calls to the server while job run to keep the container alive, but is there a way to signal from to container to cloud run that job is still active and not to close the container?
I can tell it is closing the container since the logs just stop. and then, the next call I make to the cloud run endpoint, I can see the "listening" log again from the NodeJS express.
I want to run a long-running job on cloud run.
This is a red herring.
On Cloud Run, there’s no guarantee that the same container will be used. It’s a best effort.
While you don’t process requests, your CPU will be throttled to nearly 0, so what you’re trying to do right now (running a background task and trying to keep container alive by sending it requests) is not a great idea. Most likely your app model is not fit a for Cloud Run, I recommend other compute products that would let you run long-running processes as well.
According to the documentation, Cloud Run will time out after 15 minutes, and that limit can't be increased. Therefore, Cloud Run is not a very good solution for long running tasks. If you have work that needs to run for a long amount of time, consider delegating the work to Compute Engine or some other product that doesn't have time limits.
Yes, You can use.You can create an timer that call your own api after 5 minutes, so no timeout after 15 minutes.Whenever timer executes it will create a dummy request on your server.
Other option you can increase request timeout of container to 1 hour from 5 min, if your backend request gets complete in 1 hour

Service vs Scheduled Task intervals

If you have a recurring task that runs once per day, you use a Scheduled Task.
If you have a recurring task that runs every 10 seconds, you use a Service.
At what point do you switch between the two? Is there official guidance on this somewhere?
i`m not sure the interval is the main issue here.
here are a few thing to consider:
how much state this task needs in memory - do you load stuff from a file of DB ?
does the system that needs this task to run, have a need to communicate with the task
other that when its running ?
do you need more control over the process lifecycle when the task is up?
you can see where i`m going with this , that a service is a resident entity, and a sched task isn't.
i think it depends on the point if your programm is made for only one task or for more. if it's just doin' one "stupid" thing (like running a stored procedure in a database every 20 seconds) i would concidering a sheduled task, but if it does more than that and maybe got some dependencies (maybe what time it is running or some file-operations) I would concider a service.
I would also concider a service if the intervals when the operation is made are different. Let's say your programm runs a single stored procedure in a database and depending on the fact that it made "real" changes to the db. If it did something the next run is in 5 seconds and if not the next run is in 20 seconds. That's one of the perfect examples for a service.

Resources