Sensor won't be re-scheduled on worker failure - docker

I'm in the process of learning ins-and-outs of Airflow to end all our Cron woes. When trying to mimic failure of (CeleryExecutor) workers, I've got stuck with Sensors. I'm using ExternalTaskSensors to wire-up top-level DAGs together as described here.
My current understanding is that since Sensor is just a type of Operator, it must inherit basic traits from BaseOperator. If I kill a worker (the docker container), all ordinary (non-Sensor) tasks running on it get rescheduled on other workers.
However upon killing a worker, ExternalTaskSensor does not get re-scheduled on a different worker; rather it gets stuck
Then either of following things happen:
I just keep waiting for several minutes and then sometimes the ExternalTaskSensor is marked as failed but workflow resumes (it has happened a few times but I don't have a screenshot)
I stop all docker containers (including those running scheduler / celery etc) and then restart them all, then the stuck ExternalTaskSensor gets rescheduled and workflow resumes. Sometimes it takes several stop-start cycles of docker containers to get the stuck ExternalTaskSensor resuming again
Sensor still stuck after single docker container stop-start cycle
Sensor resumes after several docker container stop-start cycles
My questions are:
Does docker have a role in this weird behaviour?
Is there a difference between Sensors (particularly ExternalTaskSensor) and other operators in terms of scheduling / retry behaviour?
How can I ensure that a Sensor is also rescheduled when the worker it is running on gets killed?
I'm using puckel/docker-airflow with
Airflow 1.9.0-4
Python 3.6-slim
CeleryExecutor with redis:3.2.7
This is the link to my code.

Related

Does it make sense to run multiple similar processes in a container?

a brief background to give context on the question.
Currently my team and i are in the midst of migrating our microservices to k8s to lessen the effort of having to maintain multiple deployment tools & pipelines.
One of the microservices that we are planning to migrate is an ETL worker that listens to messages on SQS and performs multi-stage processing.
It is built using PHP Laravel and we use supervisord to control how many processes to run on each worker instance on aws ec2. Each process basically executes a laravel command to poll different queues for new messages. We also periodically adjust the number of processes to maximize utilization of each instance's compute power.
So the questions are:
is this method of deployment still feasible when moving to k8s? Is there still a need to "maximize" compute usage? Are we better off just running 1 process in each container using the "container way" (not sure what is the tool called. runit?)
i read from multiple sources (e.g https://devops.stackexchange.com/questions/447/why-it-is-recommended-to-run-only-one-process-in-a-container) that it is ideal that for a container to run only 1 process. There's also the case of recovering crashed processes and how running supervisord might interfere with how container performs recovery. But i am not very sure if it applies for our use case.
You should absolutely restructure this to run one process per container and one container per pod. You do not typically need an init system or a process manager like supervisord or runit (there is an argument to have a dedicated init like tini that can do the special pid-1 things).
You mention two concerns here, restarting failed processes and process placement in the cluster. For both of these, Kubernetes handles these automatically for you.
If the main process in a Pod fails, Kubernetes will restart it. You don't need to do anything for this. If it fails repeatedly, it will start delaying the restarts. This functionality only works if the main process fails – if your container's main process is a supervisor process, you will never get a pod restart and you may not directly notice if a process can't start up at all.
Typically you'll run containers via Deployments that have some number of identical replica Pods. Kubernetes itself takes responsibility for deciding which node will run each pod; you don't need to manually specify this. The smaller the pods are, the easier it is to place them. Since you're controlling the number of replicas of a pod, you also want to separate concerns like Web servers vs. queue workers so you can scale these independently.
Kubernetes has some ability to auto-scale, though the typical direction is to size the cluster based on the workload: in a cloud-oriented setup if you add a new pod that requests more CPUs than your cluster currently has available, it will provision a new node. The HorizonalPodAutoscaler is something of an advanced setup, but you can configure it so that the number of workers is a function of your queue length. Again, this works better if the only thing it's scaling is the worker pods, and not a collection of unrelated things packaged together.

Killing tasks spawned by a job

I am considering if replacing celery with dask. Currently we have a cluster where different jobs are submitted, each one generating multiple tasks that run in parallel. Celery has a killer feature, the "revoke" command: I can kill all the tasks of a given job without interfering with the other jobs that are running at the same time. How can I do that with dask? I only find references saying that this is not possible, but for us it would be a disaster. So don't want to be force the shut down the entire cluster when a calculation goes rogue, thus killing the jobs of the other users.
You can cancel tasks using the Client.cancel command.
If they haven't started yet then they won't start, however if they're already running in a thread somewhere then there isn't much that Python can do to stop them other than to tear down the process.

How to delay Docker Swarm updating a stateful container until it's ready?

Problem domain
Imagine that a stateful container is being managed by Swarm, e.g. a database, and another container is relying on it, e.g. a service that is executing a long-running job (minutes, sometimes hours) that does not tolerate the database (or even itself) to go down while it's executing.
To give an example, a database importing a multi GB dump.
There's also a CI/CD system in place which takes care of building new versions of the containers and deploying them to the Swarm, or pushing the image to Docker Hub which then calls a defined webhook which fires off the deployment event.
Question
Is there any way I can build my containers so that Swarm can know whether it's ok to update it or not? Similarly how HEALTHCHECK reports whether it needs to be restarted, something that would let Swarm know that 'It's safe to restart this container now'.
Or is it the CI/CD system's responsibility to check whether the stateful containers are safe to restart, and only then issue the update command to swarm?
Thanks in advance!
Docker will not check with a container if it is ready to be stopped, once you give docker the command to stop a container it will perform that action. However it performs the stop in two steps. The first step is a SIGTERM that your container can trap and gracefully handle. By default, after 10 seconds, a SIGKILL is sent that the Linux kernel immediately applies and cannot be trapped by the container. For your goals, you'll want to make sure your app knows when it's safe to exit after receiving the first signal, and you'll probably want to extend the time to much longer than 10 seconds between signals.
The healthcheck won't tell docker that your container is at a safe point to stop. It does tell swarm when your container has finished starting, or when it's misbehaving and needs to be stopped and replaced. The healthcheck defines a command to run inside your container, and the exit code is checked for whether it's 0 (healthy) or 1 (unhealthy). No other exit codes are currently valid.
If you need more than the simple signal handling inside the container, then yes, you're likely moving up the stack to a ci/cd tool to manage the deployment.

how to schedule job with the monitoring of cpu,memory, disk io, etc.

My problem is I have a dedicate server, but the resources are still limited, i.e. IO, memory, CPU, etc.
I need to run a lot of jobs every day. Some jobs are io intensive some jobs are computation intensive. Is there a way to monitor the current status and decide when to start a new job from my job pool or not.
For example, when it knows the current running job are io intensive, it can lunch a job which do not relay on much of io. Or it can choose a running job which use a lot of disk io, stop it, re-schedule it later.
I come up with the solution with docker,since it can monitor the process, but I do not know such kind of scheduler build on top of docker.
Thanks
You can check the docker stats command in order to get basic metrics on what is running in the containers managed by a docker daemon.
You cannot exactly assign a job to a node depending on its dynamic behavior. That would mean to know in advance what type of resource a job will use. Which is not described in docker at all.
Docker provides a way to tag nodes which enable swarm filers, which would enable a cluster manager like swarm to select the right node based on criteria represented by a tag.
But Docker doesn't know about the "job" about to be launched.
Depending on the Docker version you're on, you have a number of options for prod. You can use the native Docker Swarm (just went GA in v1.9), you can give the more mature Kubernetes a try or HashiCorp's Nomad (early days) and there's of course Apache Mesos+Marathon. See also this comparison for more info on the topic.

Separate clock process from sidekiq workers on Docker

I am currently working on moving my environment off Heroku and part of my application is runs a clock process that sets off a Sidekiq background job.
As I understand it, Sidekiq is composed of a client, which sends jobs off to be queued into Redis and a server which pulls off requests of the queue and processes them. I am now trying to split out my application into the following containers on Docker:
- Redis container
- Clock container (Using Clockwork gem)
- Worker container
- Web application container (Rails)
However, I am not sure how one is supposed to split up this Sidekiq server and client. Essentially, the clock container needs to be running Sidekiq on it so that the client can send off jobs to the Redis queue every so often. However, the worker containers should also run Sidekiq (the server though) on them so that they can process the jobs. I assume that splitting up the responsibilities between different containers should be quite possible to do since Heroku allows you to split this across various dynos.
I can imagine one way to do this would be to assign the clock container to pull off a non-existent queue so that it just never pulls any jobs off the queue and then set the worker to be pulling off a queue that exists. However, this just doesn't seem like the most optimal approach to me since it will still be checking for new jobs in this non-existing queue.
Any tips or guides on how I can start going about this?
The sidekiq client just publishes jobs into redis. A sidekiq deamon process just subscribes to redis and starts worker threads as they are published.
So you can just install the redis gem on both conatainers: Clock container and Worker Container and start the worker daemon only on the Worker Container and provide a proper redis config to both. You also have to make sure that the worker sourcecode is available on both servers/containers as Sidekiq client just stores the name of the worker class and then the daemon instanciates it through metaprogramming.
But actually you also can just include a sidekiq daemon process together with every application wich needs to process a worker job. Yes, there is this best practise of docker for one container per process, however imho this is not an all or nothing rule. In this case i see both processes as one unity. It's just a way of running some code in background. You then would just configure that instances of same applications just work against same sidekiq queues. Or you could even configure that every physical node runs again a separate queue.

Resources