Prometheus alerting when a pod is running for too long - docker

I have run into a bit of a trouble for what is seems to be an easy question.
My scenario:
I have a k8s job which can be run at any time (not a cronJob) which in turn creates a pod to perform some tasks. Once the pod performs its task it completes, thus completing the job that spawned it.
What I want:
I want to alert via prometheus if the pod is in a running state for more than 1h signalling that the task is taking too much time.
I'm interested to alert ONLY when duration symbolised by the arrow in the attached image exceeds 1h. Also have no alerts triggered when the pod is no longer running.
What I tried:
The following prometheus metric, which is an instant vector that can be either 0(pod not running) or 1(pod is running):
kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}
I figured I tried to use this metric with the following formula for computing the duration for when the metric was one during a day
(1 - avg_over_time(kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}[1d])) * 86400 > 3600
Because these pods come and go and are not always present I'm encountering the following problems:
The expr above starts from the 86400 value and eventually drops once
the container is running this would trigger an alert
The pod eventually goes away and I would not like to send out fake alerts for pods which are no longer running(although they took over 1h to run)

Thanks to the suggestion of #HelloWorld i think this would be the best solution to achieve what I wanted:
(sum_over_time(kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}[1d:1s]) > 3600) and (kube_pod_status_ready{condition="true",pod_name=~".+POD-A.+"}==1)
Count the number of times pods is running in the past day/6h/3h and verify if that exceeds 1h(3600s)
AND
Check if the pod is still running - so that it doesn't take into consideration old pods or if the pod terminates.

Related

How to increase Node Drain Timeout for AKS node upgrade rollout

Problem:
PDB with maxUnavailable of 1
Pods have LONG grace period of 15 hours (technical requirement for the business case at hand handling stateful connections. Also induced by external dependency, so no way for me to change this)
Node Drain timeout is only 1 hour?
During an upgrade, a pod that needs to be evicted off a node might take longer than the Node Drain timeout and yields the following error:
(UpgradeFailed) Drain of NODE_NAME did not complete pods [STS_NAME:POD_NAME]: Pod
POD_NAME still in state Running on node NODE_NAME, pod termination grace period 15h0m0s was
greater than remaining per node drain timeout. See http://aka.ms/aks/debugdrainfailures
Code: UpgradeFailed
After which the cluster is in failed state.
Due to the grace period of pods not being in my control, I would like to increase the node drain timeout to 31 hours, as there can be 2 of those long grace period pods on a single node. I haven't been able to find anything regarding the node drain timeout though.
I can't even figure out, if it's part of K8s, or AKS specifically.
How to increase the per node drain timeout, such that my long grace period pods don't interrupt my node upgrade operations?
EDIT:
In the kubectl cli reference, the drain command takes a timeout parameter. As I don't invoke the drain myself, I don't see how this helps me. It lead me to believe that, if anywhere, this needs to be dealt with on the AKS side of things.
Not an answer to the actual question, but a possible workaround:
Scale up to double the amount of nodes you need to run the workload
Manually evict first half of nodes that need upgrading
Start the upgrade
Upgrade fails somewhere in the second half
Manually evict second half of nodes
Start the upgrade
Upgrade completes
Scale down to required amount of nodes again
Disadvantages:
Doubled infrastructure cost for the duration of the upgrades
Lots of manual steps for upscaling, evicting, upgrading, evicting again, upgrading again, downscaling
Might require additional quota requests to actually perform due to possible vCore quota limits
Manual nature of this upgrade will prevent auto-upgrading the cluster successfully
At least doubles the time for the entire operation, because the entire workload needs to be evicted completely twice, instead of just once
It's a terrible workaround, but a workaround.

How can I see how long my Cloud Run deployed revision took to spin up?

I deployed a Vue.js and a Kotlin server app. Cloud Run does promise to put a service to sleep if no request to it arise for a specific time. I did not opened my app for a day now. As I opened it - it was available almost immediatly. Since I know how long it takes to spin up when started locally I kinda don't trust the promise that Cloud Run really had put the app to sleep and span it up so crazy fast.
I'd love to know a way how I can really see how long it took for the spinup - also for startup improvement for the backend service.
After having the service inactive for some time, record the time when you request the service URL and request it.
Then go to the logs for the Cloud Run service, and use this filter to see the logs for the service:
resource.type="cloud_run_revision"
resource.labels.service_name="$SERVICE_NAME"
Look for the log entry with the normal app output after your request, check its time and compare it with the recorded time.
You can't know when the instance will be evicted or if it is kept in memory. It could happen quickly, or take hours or days before eviction. it's "serverless".
About the starting time, when I test, I deploy a new revision and I have a try on it. In the logging service, the first log entry of the new revision provides me the cold start duration. (Usually 300+ ms, compare to usual 20 - 50 ms with warm start).
The billing instance time is the sum of all the containers running times. A container is considered as "running" when it process request(s).

cloud run is closing the container even if my script is still running

I want to run a long-running job on cloud run. this task may execute more than 30 minutes and it mostly sends out API requests.
cloud run stops executing after about 20 minutes and from the metrics, it looks like it did not identify that my task is still in the running state. so it probably thinks it is in idling and closing the container. I guess I can run calls to the server while job run to keep the container alive, but is there a way to signal from to container to cloud run that job is still active and not to close the container?
I can tell it is closing the container since the logs just stop. and then, the next call I make to the cloud run endpoint, I can see the "listening" log again from the NodeJS express.
I want to run a long-running job on cloud run.
This is a red herring.
On Cloud Run, there’s no guarantee that the same container will be used. It’s a best effort.
While you don’t process requests, your CPU will be throttled to nearly 0, so what you’re trying to do right now (running a background task and trying to keep container alive by sending it requests) is not a great idea. Most likely your app model is not fit a for Cloud Run, I recommend other compute products that would let you run long-running processes as well.
According to the documentation, Cloud Run will time out after 15 minutes, and that limit can't be increased. Therefore, Cloud Run is not a very good solution for long running tasks. If you have work that needs to run for a long amount of time, consider delegating the work to Compute Engine or some other product that doesn't have time limits.
Yes, You can use.You can create an timer that call your own api after 5 minutes, so no timeout after 15 minutes.Whenever timer executes it will create a dummy request on your server.
Other option you can increase request timeout of container to 1 hour from 5 min, if your backend request gets complete in 1 hour

PythonOperator task hangs accessing Cloud Storage and is stacked as SCHEDULED

One of the tasks in my DAG sometimes hangs when accessing Cloud Storage. It seems the code stops at the download function here:
hook = GoogleCloudStorageHook(google_cloud_storage_conn_id='google_cloud_default')
for input_file in hook.list(bucket, prefix=folder):
hook.download(bucket=bucket, object=input_file)
In my tests the folder contains a single 20Mb json file.
The task normally takes 20-30 seconds, but in some cases it will run for 5 minutes, and after that its state is updated to SCHEDULED and stuck there (waited for more than 6 hours). I suspect the 5 minutes are due to the configuration scheduler_zombie_task_threshold 300 but not sure.
If I clear the task manually on the Web UI, the task is quickly queued and run again correctly. I am getting around the issue by setting an execution_timeout which updates the task correctly to FAILED or UP_FOR_RETRY state when it takes longer than 10 minutes; but I'd like to fix the underlying issue to avoid relying on a fixed timeout threshold, any suggestions?
There was a discussion on the Cloud Composer Discuss group about this: https://groups.google.com/d/msg/cloud-composer-discuss/alnKzMjEj8Q/0lbp3bTlAgAJ. It is a problem with the Celery executor when Airflow workers die.
Although Composer is working on a fix, if you want this to happen less frequently in the current version, you may consider reducing your parallelism Airflow configuration or creating a new environment with a larger machine-type.

Jenkins: jobs in queue are stuck and not triggered to be restarted

For a while, our Jenkins experiences critical problems. We have jobs hung, our job scheduler does not trigger the builds. After the Jenkins service restart, everything is back to normal, but after some period of time all problem are return. (this period can be week or day or ever less). Any idea where we can start looking? I'll appreciate any help on this issue
Muatik has made a good point in his comment, the recommended approach is to run jobs on agents (slave) nodes. If you already do it, you can look at:
Jenkins master machine CPU, RAM and hard disk usage. Access the machine and/or use plugin like Java Melody. I have seen missing graphics in the builds test results and stuck builds due to no hard disk space. You could also have hit the limit of RAM or CPU for the slaves/jobs you are executing. You may need more heap space.
Look at Jenkins log files, start with severe exceptions. If the files are too big or you see logrotate exceptions, you can change the logging levels, so that fewer exceptions are logged. For more details see my article on this topic. Try to fix exceptions that you see logged.
Go through recently made changes that can be the cause of such behavior, for example, new plugins, changes to config files (jenkins.xml)?
Look at TCP connections. Run netstat -a Are there suspicious connections (CLOSED_WAIT status)?
Delete old builds that you do not need.
We have been facing this issue from last 4 months, and tried everything, changing resources CPU & memory, increasing desired nodes in ASG. But nothing seems worked .
Solution: 1. Go to Manage Jnekins-> System Configurationd-> Maven project
configurations
2. In "usage" field, select "Only buid Jobs with label expressions matching this nodes"
Doing this resolved it and jenkins is working as a Rocket now :)

Resources