Airflow job fails after detecting successful Dataflow job as a zombie.
I run an hourly Dataflow job that's triggered by an external Airflow instance using the python DataflowTemplateOperator. A couple of times a week, Dataflow becomes completely unresponsive to status pings. When I've caught the error in real-time and tried looking at the status of the Dataflow job in the GCP UI, the page won't load despite my having a network connection and being able to look at other pages on the GCP site. After a few minutes, everything returns to normal working order. This seems to happen towards the end of a job's run or when workers are shutting down. The Dataflow jobs don't fail, and don't report any errors. Airflow thinks they've failed because, when Dataflow becomes unresponsive, Airflow assumes the jobs are zombies. I needed a fast solution and just increased my number of retries, but I would like to understand the problem and find a better solution.
Related
I'm considering deploying Flink with K8s. I'm a newbie on Flink and have a simple question:
Saying that I use K8s to manager dockers and deploy the TaskManager into the dockers.
As my understanding, a docker can be restarted by K8s when it fails, and a Task can be restarted by Flink when it fails.
If a Task is running in a container of docker and the container suddenly fails for some reason, in the Flink's view, a Task failed so the task should be restarted, and in the K8s' view, a container failed so the docker should be restarted. In this case, should we worry about some conflict because of the two kinds of "be restarted"?
I think you want to read up on the official kubernetes setup guide here: https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/kubernetes.html
It describes 3 ways of getting it to work:
Session Cluster: This involves spinning up 2 deployments in the appendix and requires you to submit your Flink job manually or via a script in the beginning. This is very similar to a local standalone cluster when you are developing, except it is now in your kubernetes cluster
Job Cluster: By deploying Flink as a k8s job, you would be able to eliminate the job submission step.
Helm chart: By the look of it, the project has not updated this for 2 years, so your mileage may vary.
I have had success with a Session Cluster, but I would eventually like to try the "proper" way, which is to deploy it as kubernetes job using the 2nd method by the looks of it.
Depending on your Flink Source and the kind of failure, your Flink job will fail differently. You shouldn't worry about the "conflict". Either Kubernetes is going to restart the container, or Flink is going to handle the error it could handle. After a certain amount of retry it would cancel, depending on how you configured this. See Configuration for more details. If the container exited with a code that is not 0, Kubernetes would try to restart it. However, it may or may not resubmit the job depending on whether you deployed the job in a Job Cluster or whether you had an initialization script for the image you used. In a Session Cluster, this can be problematic depending on whether the job submission is done through task manager or job manager. If the job was submitted through task manager, then we need to cancel the existing failed job so the resubmitted job can start.
Note: if you did go with the Session Cluster and have a file system based Stateful Backend (non-RocksDB Stateful Backend) for checkpoints, you would need to figure out a way for the job manager and task manager to share a checkpoint directory.
If the task manager uses a checkpoint directory that is inaccessible to the job manager, the task manager's persistence layer would build up and eventually cause some kind of out of disk space error. This may not be a problem if you decided to go with RocksDB and enable incremental checkpoints
I want to upgrade my jenkins master without aborting or waiting for long running jobs to finish on slaves. Is there a plugin available that provides this feature?
We have several build jobs running regression and integration tests which take hours to run. Often, at least one of those jobs is running, making it hard to restart jenkins after updates. I know, that it is poosible to block the queue. We tried this, but it hinders more than it helps.
What we are looking for is a plugin, that runs jobs on slaves, caches the output as soon as the connection to the master is interrupted and sends the remaining output to the master when the master is up again. Does anybody know a plugin providing this feature.
I'm running a Jenkins server and some slaves on a docker swarm that's hosted on preemptive google instances (akin to AWS spot instances). I've got everything set up so that at any given moment there is a Jenkins master running on a single server and slaves running on every other server on the swarm. When one server gets terminated another is spun up and replaces it, and eventually Jenkins is back up running again on another machine even if its server was stopped, and slaves get replaced as they die.
I'm facing two problems:
My first one is when the Jenkins master dies and comes back online it tries to resume the jobs that were previously running and they end up getting stuck trying to be built. Is there any way to automatically have Jenkins restart jobs that were interrupted instead of trying to resume them?
The second is when a slave dies I'd like to automatically restart any jobs that were running on it elsewhere. Is there any way to do that?
Currently I'm dealing with both situations by have an external application retry the failed build jobs, but that's not really optimal.
Thanks!
I have more than 30 rake tasks added to Jenkins for scheduling jobs. (Rails project)
But the jenkins server goes down frequently and uses 100% of CPU at most of the time.
Please suggest me a better job scheduler instead of Jenkins, which is also capable of
doing steps like
Notify an email when jobs fail
Log the jobs terminal output
Add dependency to jobs
Your question seems to come out as "recommend me a CI server".
But - why does Jenkins fall over and/or use 100% CPU most of the time? I'd be looking at why this is. My experience of Jenkins is that it is pretty stable and low overhead. If your hardware / OS / something else is flaky or just under provisioned for the task then swapping Jenkins out isn't going to fix that.
I have setup jobs correctly using Jenkins on Cloudbees, Janky, and Hubot. Hubot and Janky work and are pushing the jobs to the Jenkins server.
The job has been sitting in the Jenkins queue for over an hour now. I don't see anywhere to configure the # of executors and this is a completely default instance from Cloudbees.
Is the CloudBees service just taking a while or is something misconfigured?
This was a problem in early March caused by the build containers failing to start cleanly at times.
The root cause was a kernel oops that was occurring in the build container as it launched.
This has since been resolved, and you should not experience these long pauses waiting for an executor.
Anything more than 10 minutes is definitely a bug, and typically more than about 5s is unusual (although when a lot of jobs are launched simultaneously the time to get a container can be on the order of around 3 minutes).