Dataflow job not doing anything after starting workers - google-cloud-dataflow

There are no worker logs in a particular environment I am running this pipeline in. Workers are being started but post that it is not being able to do anything and batch pipeline is running for long time. I have set parameters like --subnetwork etc for shared VPC also the worker region is same as shared VPC region.

Related

How to handle flink management and k8s management

I'm considering deploying Flink with K8s. I'm a newbie on Flink and have a simple question:
Saying that I use K8s to manager dockers and deploy the TaskManager into the dockers.
As my understanding, a docker can be restarted by K8s when it fails, and a Task can be restarted by Flink when it fails.
If a Task is running in a container of docker and the container suddenly fails for some reason, in the Flink's view, a Task failed so the task should be restarted, and in the K8s' view, a container failed so the docker should be restarted. In this case, should we worry about some conflict because of the two kinds of "be restarted"?
I think you want to read up on the official kubernetes setup guide here: https://ci.apache.org/projects/flink/flink-docs-release-1.10/ops/deployment/kubernetes.html
It describes 3 ways of getting it to work:
Session Cluster: This involves spinning up 2 deployments in the appendix and requires you to submit your Flink job manually or via a script in the beginning. This is very similar to a local standalone cluster when you are developing, except it is now in your kubernetes cluster
Job Cluster: By deploying Flink as a k8s job, you would be able to eliminate the job submission step.
Helm chart: By the look of it, the project has not updated this for 2 years, so your mileage may vary.
I have had success with a Session Cluster, but I would eventually like to try the "proper" way, which is to deploy it as kubernetes job using the 2nd method by the looks of it.
Depending on your Flink Source and the kind of failure, your Flink job will fail differently. You shouldn't worry about the "conflict". Either Kubernetes is going to restart the container, or Flink is going to handle the error it could handle. After a certain amount of retry it would cancel, depending on how you configured this. See Configuration for more details. If the container exited with a code that is not 0, Kubernetes would try to restart it. However, it may or may not resubmit the job depending on whether you deployed the job in a Job Cluster or whether you had an initialization script for the image you used. In a Session Cluster, this can be problematic depending on whether the job submission is done through task manager or job manager. If the job was submitted through task manager, then we need to cancel the existing failed job so the resubmitted job can start.
Note: if you did go with the Session Cluster and have a file system based Stateful Backend (non-RocksDB Stateful Backend) for checkpoints, you would need to figure out a way for the job manager and task manager to share a checkpoint directory.
If the task manager uses a checkpoint directory that is inaccessible to the job manager, the task manager's persistence layer would build up and eventually cause some kind of out of disk space error. This may not be a problem if you decided to go with RocksDB and enable incremental checkpoints

Apache Airflow Job Fails and Thinks Successful Dataflow Jobs are Zombies

Airflow job fails after detecting successful Dataflow job as a zombie.
I run an hourly Dataflow job that's triggered by an external Airflow instance using the python DataflowTemplateOperator. A couple of times a week, Dataflow becomes completely unresponsive to status pings. When I've caught the error in real-time and tried looking at the status of the Dataflow job in the GCP UI, the page won't load despite my having a network connection and being able to look at other pages on the GCP site. After a few minutes, everything returns to normal working order. This seems to happen towards the end of a job's run or when workers are shutting down. The Dataflow jobs don't fail, and don't report any errors. Airflow thinks they've failed because, when Dataflow becomes unresponsive, Airflow assumes the jobs are zombies. I needed a fast solution and just increased my number of retries, but I would like to understand the problem and find a better solution.

Jenkins workers on AWS ECS Fargate: run a few jobs in parallel

I have AWS ECS cluster in Fargate mode for Jenkins workers (slaves) only.
Cluster consists of one Service called jenkins which has Desired tasks value set to 5
But when I start a few jobs which have the same label they're queued up instead of execute in parallel.
How parallel execution can be set?
The problem was related to the Fargate speed - next job is queued in 2-3 minutes, so if jobs are very fast they aren't queued properly.

Sidekiq jobs in cluster

I run my app on two servers. Each has a job, which checks
every 2.minutes do
runner "MailmanCheckJob.perform_later"
end
Now this job runs on each server. It checks the new emails and process them. If the email processing takes 4-5 minutes. One message gets picked up by two jobs.
How do I make sure that each email is picked only once. Each email is marked read once all the processing is over.
it connects to remote redis. They are monitored by monit.
is there an option of running sidekiq in cluster so that only one server picks and runs the job.
-A
You are looking for periodic jobs within Sidekiq
https://github.com/mperham/sidekiq/wiki/Related-Projects#recurring-jobs

Is it possible to make Jenkins create workers from attached clouds faster?

I have an instance of Jenkins that uses the mesos plugin. Nearly all of my jobs get triggered via Mesos tasks. I would like to make worker generation a bit more aggressive.
The current issue is that, for the mesos plugin, I have all of the jobs marking the mesos tasks as one-time usage slaves and when a build is in progress on one of these slaves Jenkins forces any queued jobs to wait for a potential executor on these slaves, as opposed to spinning up new instances.
Based on the logs, it also seems like Jenkins has a timer that periodically checks to see if any slaves should be spun up based on the # of queued jobs / excess workload. Is it possible to decrease the polling interval for that process?
From Mesos Jenkins Plugin Readme: over provisioning flags
By default, Jenkins spawns slaves conservatively. Say, if there are 2 builds in queue, it won't spawn 2 executors immediately. It will spawn one executor and wait for sometime for the first executor to be freed before deciding to spawn the second executor. Jenkins makes sure every executor it spawns is utilized to the maximum. If you want to override this behavior and spawn an executor for each build in queue immediately without waiting, you can use these flags during Jenkins startup:
-Dhudson.slaves.NodeProvisioner.MARGIN=50 -Dhudson.slaves.NodeProvisioner.MARGIN0=0.85

Resources