Task on marathon never ends - docker

I am new to mesos , marathon framework. I formed the cluster with three mesos(0.27.0) masters and two mesos slaves. Marathon (0.15.1) is installed on masters. I scheduled one task from marathon UI of echoing Hello in some file echo "hello" > /tmp/sample.txt.
I observed that the hello is written in the file but the process of writing hello inside the file is going on. Ideally it should be stopped once it has written. I have same trouble when I tried to launch the containers, the containers are getting created till I have no memory. Can anyone suggest me what to do in order to stop echoing and to stop marathon from creating new containers ?

This is the expected behaviour for Marathon, which is meant to be used for long-running tasks, that is things like a Web server, app server, etc.
When Marathon sees the app terminates, it will launch it again (potentially on a different node).
For one-shots, you can use Chronos, Cook or write your own framework.

Related

Sensor won't be re-scheduled on worker failure

I'm in the process of learning ins-and-outs of Airflow to end all our Cron woes. When trying to mimic failure of (CeleryExecutor) workers, I've got stuck with Sensors. I'm using ExternalTaskSensors to wire-up top-level DAGs together as described here.
My current understanding is that since Sensor is just a type of Operator, it must inherit basic traits from BaseOperator. If I kill a worker (the docker container), all ordinary (non-Sensor) tasks running on it get rescheduled on other workers.
However upon killing a worker, ExternalTaskSensor does not get re-scheduled on a different worker; rather it gets stuck
Then either of following things happen:
I just keep waiting for several minutes and then sometimes the ExternalTaskSensor is marked as failed but workflow resumes (it has happened a few times but I don't have a screenshot)
I stop all docker containers (including those running scheduler / celery etc) and then restart them all, then the stuck ExternalTaskSensor gets rescheduled and workflow resumes. Sometimes it takes several stop-start cycles of docker containers to get the stuck ExternalTaskSensor resuming again
Sensor still stuck after single docker container stop-start cycle
Sensor resumes after several docker container stop-start cycles
My questions are:
Does docker have a role in this weird behaviour?
Is there a difference between Sensors (particularly ExternalTaskSensor) and other operators in terms of scheduling / retry behaviour?
How can I ensure that a Sensor is also rescheduled when the worker it is running on gets killed?
I'm using puckel/docker-airflow with
Airflow 1.9.0-4
Python 3.6-slim
CeleryExecutor with redis:3.2.7
This is the link to my code.

Port allocation when running build job in Jenkins

My project is structured in such a way that the build job in Jenkins is triggered from a push to Git. As part of my application logic, I spin up kafka and elastic search instances to be used in my test cases downstream.
The issue I have right now is, when a developer pushes his changes to Git, it triggers a build in Jenkins which in turn runs our code and spawns kafka broker in localhost:9092 and elastic search in localhost:9200.
When another developer working on some other change simultaneously, pushes his code, it triggers the build job again and tries to spin up another instance of kafka/elastic search but fails with the exception “Port already in use”.
I am looking at options on how to handle this scenario.
Will running these instances inside of docker container help to some extent? How do I handle the port issue in that case?
Yes dockerizing these instances can indeed help as you can spawn them multiple times.
You could create a docker container per component including your application and then let them talk to each other by linking them or using docker-compose
That way you would not have to expose the ports to the "outside" world but keep it internal within the docker environment.
That way you would not have the “Port already in use”. The only problem is memory in that case. e.g. if 100 pushes are done to the git repo, you might run out of memory...

Restart task in docker service after a certain time

I have a swarm with 3 nodes. On it, I want to launch one service for a Database and then another, with some replicas that run a python application. The program will take approximately 30 minutes to finish. After that, the container is shut down and a new one starts. Sometimes, however, some problem occur and the container does not stop. Is there any option that I can use when I launch the service so that, after 1 hour, a container is automatically killed and a new one is created?
You can create an application using the Docker Remote API, that automatically creates that container, waits for one hour, deploys it to the swarm and then deletes that container. This is not a feature to look for in docker. You should manually implement it using Docker API.
You can find in here complete list of docker libraries to help you get started.

Marathon - do not redeploy app when return code = 0?

We have a spring boot application deployed in a docker container and managed using mesosphere (marathon + mesos). The spring boot app is intended to be deployed via marathon, and once complete, it will exit with code = 0.
Currently, every time the boot application terminates, marathon redeploys the app again, which I wish to disable. Is there a setting that I can set in the application's marathon json config file that will prevent marathon from redeploying an app if it does not exit with a non-zero code?
If you just want to run one-time jobs, I think Chronos would be the right tool. Marathon is, as Michael wrote, for long-running tasks.
I think there's a principled problem in the understanding of what Marathon does: it is meant for long-running tasks (or put in other words: there's a while loop somewhere in there, maybe an implicit one). If your app exists, Marathon sees this and assumes it has failed and re-starts it again.

How Mesos Marathon handle application data persistence?

I have been exploring Mesos, Marathon framework to deploy applications. I have a doubt that how Marathon handle application files when an application is killed .
For example we are using Jenkins which is run through Marathon and if Jenkins server fails and it will be restarted again by Marathon but this time old jobs defined will be lost .
Now my question is how can I ensure that if a application restarts, those old application jobs should be available ?
Thanks.
As of right now mesos/marathon is great at supporting stateless applications, but the support for stateful applications is increasing.
By default the task data is written into sandbox and hence will be lost when a task is failed/restarted. Note that usually only a small percentage of tasks fails (e.g. only the tasks being on the failed node).
Now let us have a look at different failure scenarios.
Recovering from slave process failures:
When only the Mesos slave process fails (or is upgraded) the framework can use slave checkpointing for reconnecting to the running executors.
Executor failures (e.g. Jenkins process failures):
In this case the framework could persist it own metadata on some persistent media and use it to restart. Note, that this is highly application specific and hence mesos/marathon can not offer a generic way to do this (and I am actually not sure how that could look like in case of jenkins). Persistent data could either be written to HDFS, Cassandra or you could have a look at the concept of dynamic reservations.

Resources