How Mesos Marathon handle application data persistence? - jenkins

I have been exploring Mesos, Marathon framework to deploy applications. I have a doubt that how Marathon handle application files when an application is killed .
For example we are using Jenkins which is run through Marathon and if Jenkins server fails and it will be restarted again by Marathon but this time old jobs defined will be lost .
Now my question is how can I ensure that if a application restarts, those old application jobs should be available ?
Thanks.

As of right now mesos/marathon is great at supporting stateless applications, but the support for stateful applications is increasing.
By default the task data is written into sandbox and hence will be lost when a task is failed/restarted. Note that usually only a small percentage of tasks fails (e.g. only the tasks being on the failed node).
Now let us have a look at different failure scenarios.
Recovering from slave process failures:
When only the Mesos slave process fails (or is upgraded) the framework can use slave checkpointing for reconnecting to the running executors.
Executor failures (e.g. Jenkins process failures):
In this case the framework could persist it own metadata on some persistent media and use it to restart. Note, that this is highly application specific and hence mesos/marathon can not offer a generic way to do this (and I am actually not sure how that could look like in case of jenkins). Persistent data could either be written to HDFS, Cassandra or you could have a look at the concept of dynamic reservations.

Related

How can I deploy docker swarm user docker stack with services by order

I have an issue about Docker Swarm.
I have tried to deploy my app with Docker Swarm Mode.
But I can not arrange my services start by order, although I was used depends_on (It’s recommended not support for docker stack deploy).
How can I deploy that with services start by order
Ex:
Service 1 starting
Service 2 wait for Service 1
Please help.
This is not supported by Swarm.
Swarm is designed for high availability. When encountering problems (services or hosts fail) services will be restartet in the order they failed.
If you have clear dependencies between your services and they can't handle waiting for the other service to be available or reconnecting, your system won't work.
Your services should be written in a way that they can handle any service being redeployed at any time.
There is no orchestration system that recommend or support this feature.
So forgot about it because this is a very bad idea.
Application infrastructure (here the container) should not depend on the database health, but your application itself must depend on the database health.
You see the difference?
For instance, the application could display an error message "Not ready yet" or "This feature is disabled because elasticsearch is down" etc...
So even if this is possible to implement this pattern (aka "wait-for", with kubernetes, you can use initContainer to "wait" for another service up&ready), I strongly recommend to move this logic to your application.

Need help regarding designing the architecture of application and deployment strategy using Dockers

Let me first explain the application that I will develope.
We have following set of workflows to be developed in Camunda:
Global Subprocess Workflow like fetchImageAttributes,
fetchFileAttributes, etc...
FileTransfer Workflow.
FileConverter Workflow.
FileTransfer workflow uses global subprocesses with the help of call activity task in camunda, similarly FileConverter workflow also uses the subprocesses with the help of call activity task.
Global process is long running process hence whenever any subprocess starts it sends a message in specific rabbit queue and waits for the response in specific rabbit queue to resume subprocess using receive task.
FileTransfer workflow & FileConverter Workflow can be invoked independently. We have created a rabbit queue listner in springs that will listen to specific queue for respective workflows, and whenever a message is dropped in those queues the workflow will get invoked.
During the development process all the three workflow will be deployed and tested in single tomcat instance hence the workflow will be working with no concerns.
Now the plan will be to host them to cloud using dockers, the plan is to host these three workflows in 3 docker containers.
Container 1 will contain Global Subprocess Workflow.
Container 2 will contain FileTransfer Workflow.
Container 3 will contain FileConverter Workflow.
All the three camunda workflow will be using same database to store specific workflow activities and variables.
Challenges faced:
Since FileTransfer Workflow & FileConverter Workflow both uses Global Subprocess using call activity will fail as they are not available in same runtime engine. Should we use Camunda Rest services?
To overcome the above challenge I thought of
Deployment plan 2:
Container 1 will contain Global Subprocess Workflow & FileTransfer Workflow.
Container 2 will contain Global Subprocess Workflow & FileConverter Workflow.
Challenges faced:
Since Global Subprocess Workflow are present in both the containers their may be scenarios where the response for FileTransfer Workflow may get pulled by the FileConverter Workflow since Global Subprocess are listening to same rabbit queue in both the containers, and hence it can lead to error where the process instance will not be found.
So if anyone can help me with a better architecture or if any one who has good experience in camunda and its deployment in Heterogeneous Clusters can guide me.
Thanks.
You could reason, how the communication between the processes is implemented. Since you separate the deployment, a sub process/call activity is not an option. A better approach is to use BPMN messages and create a choreography between the processes. If you are already using rabbit, you can develop a BPMN-message to rabbit adapter and pass messages around.
There are two other approaches for connecting systems:
using external service tasks
using a new approach called zeebe

Launching jobs with large docker images in mesos via aurora can be slow

When launching a task over mesos via aurora that uses a rather large docker image (~2GB) there is a long wait time before the task actually starts.
Even when the task has been previously launched and we would expect the docker image to already be available to the worker node, there is still a waiting time dependent on image size before the task actually launches. Using docker, you can launch a container almost instantly as long as it is in your images list already, does the mesos containerizer not support this "caching" as well ? Is this functionality something that can be configured ?
I haven't tried using the docker containerizer, but it is my understanding that it will be phased out soon anyway and that gpu resource isolation, which we require, only works for the mesos containerizer.
I am assuming you are talking about the unified containerizer running docker images? What is backend you are using? By default the Mesos agents use the copy backend which is why you are seeing it being slow. You can look at the backend the agent is using by hitting flags endpoint on the agent. Switch the backend to aufs or overlayfs to see if speeds up the launch. You can specify the backend through the flag --image_provisioner_backend=VALUE on the agent.
NOTE: There are few bugs fixes related to aufs and overlayfs backend in the latest Mesos release 1.2.0-rc1 that you might want to pick up. Not to mention that there is an autobackend feature in 1.2.0-rc1 that will automatically select the fastest backend available.

Task on marathon never ends

I am new to mesos , marathon framework. I formed the cluster with three mesos(0.27.0) masters and two mesos slaves. Marathon (0.15.1) is installed on masters. I scheduled one task from marathon UI of echoing Hello in some file echo "hello" > /tmp/sample.txt.
I observed that the hello is written in the file but the process of writing hello inside the file is going on. Ideally it should be stopped once it has written. I have same trouble when I tried to launch the containers, the containers are getting created till I have no memory. Can anyone suggest me what to do in order to stop echoing and to stop marathon from creating new containers ?
This is the expected behaviour for Marathon, which is meant to be used for long-running tasks, that is things like a Web server, app server, etc.
When Marathon sees the app terminates, it will launch it again (potentially on a different node).
For one-shots, you can use Chronos, Cook or write your own framework.

how to schedule job with the monitoring of cpu,memory, disk io, etc.

My problem is I have a dedicate server, but the resources are still limited, i.e. IO, memory, CPU, etc.
I need to run a lot of jobs every day. Some jobs are io intensive some jobs are computation intensive. Is there a way to monitor the current status and decide when to start a new job from my job pool or not.
For example, when it knows the current running job are io intensive, it can lunch a job which do not relay on much of io. Or it can choose a running job which use a lot of disk io, stop it, re-schedule it later.
I come up with the solution with docker,since it can monitor the process, but I do not know such kind of scheduler build on top of docker.
Thanks
You can check the docker stats command in order to get basic metrics on what is running in the containers managed by a docker daemon.
You cannot exactly assign a job to a node depending on its dynamic behavior. That would mean to know in advance what type of resource a job will use. Which is not described in docker at all.
Docker provides a way to tag nodes which enable swarm filers, which would enable a cluster manager like swarm to select the right node based on criteria represented by a tag.
But Docker doesn't know about the "job" about to be launched.
Depending on the Docker version you're on, you have a number of options for prod. You can use the native Docker Swarm (just went GA in v1.9), you can give the more mature Kubernetes a try or HashiCorp's Nomad (early days) and there's of course Apache Mesos+Marathon. See also this comparison for more info on the topic.

Resources