I need a curl command to find the applications and the time which were restarted in the last 30 minutes in mesos marathon.
For example, I hit the curl in the terminal like below:
curl http://marathon:5050/............
Then the output should be like:
APP TIME_OF_RESTART
app1 2018-07-01 23:45PM IST
If I can get the curl command then I can write a script to automate it to provide the details required.
it looks like you may want to use the Marathon Event Bus which streams all Marathon events.
The parameter you would be interested in is unhealthy_task_kill_event if you are looking for tasks that failed health checks enough times and require a "restart".
From the Marathon REST API documentation:
If a task fails more than maxConsecutiveFailures health checks consecutively, that task is killed causing Marathon to start more instances. These restarts are modulated like any other failing app by backoffSeconds, backoffFactor and maxLaunchDelaySeconds. The kill of the unhealthy task is signalled via unhealthy_task_kill_event event.
Related
I have looked for a bit on Stack Overflow for a way to have a container start up and wait for an external connection but have not seen anything.
Here is what my process looks like currently:
Non-Docker external process reaches out at X interval and tells system to run a command.
Command runs.
System should remain idle until the next interval.
Now I have seen a few options with --wait or sleep but I would think that would not allow the container to receive the connection.
I also looked at the wait for container script that is often recommended but in this case I need the container to wait for a script to call it on non defined intervals.
I have tried having this just run the help command for my process but it then fails the container after a bit of time and makes it a mess for finding anything.
Additionally I have tried to have the container start with no command just to run the base OS and wait for the call but that did not work either.
I was looking at this wrong.
Ended up just running like any other webserver and database server.
We're running spring-cloud microservices using eureka on AWS ECS. We're also doing continuous deployment, and we've run into an issue where rolling production deployments cause a short window of service unavailability. I'm focusing here on #LoadBalanced RestTemplate clients using ribbon. I think I've gotten retry working adequately in my local testing environment, but I'm concerned about new service instance eureka registration lag time and the way ECS rolling deployments work.
When we merge a new commit to master, if the build passes (compiles and tests pass) our jenkins pipeline builds and pushes a new docker image to ECR, then creates a new ECS task definition revision pointing to the updated docker image, and updates the ECS service. As an example, we have an ECS service definition with desired task count set to 2, minimum percent available set to 100%, and maximum percent available set to 200%. The ECS service scheduler starts 2 new docker containers using the new image, leaving the existing 2 docker container running on the old image. We use container health checks that pass once the actuator health endpoint returns 200, and as soon as that happens, the ECS service scheduler stops the 2 old containers running on the old docker image.
My understanding here could be incorrect, so please correct me if I'm wrong about any of this. Eureka clients fetch the registry every 30 seconds, so there's up to 30 seconds where all the client has in the server list is the old service instances, so retry won't help there.
I asked AWS support about how to delay ECS task termination during rolling deploys. When ECS services are associated with an ALB target group, there's a deregistration delay setting that ECS respects, but no such option exists when a load balancer is not involved. The AWS response was to run the java application via an entrypoint bash script like this:
#!/bin/bash
cleanup() {
date
echo "Received SIGINT, sleeping for 45 seconds"
sleep 45
date
echo "Killing child process"
kill -- -$$
}
trap 'cleanup' SIGTERM
"${#}" &
wait $!
When ECS terminates the old instances, it send SIGTERM to the docker container, this script traps it, sleeps for 45 seconds, then continues with the shutdown. I'll also have to change an ecs config parameter in /etc/ecs that controls the grace period before ECS sends a SIGKILL after the SIGTERM, which defaults to 30 seconds, which is not quite long enough.
This feels dirty to me. I'm not sure that script isn't going to cause some other unforeseen issue; does it forward all signals appropriately? It feels like an unwanted complication.
Am I missing something? Can anyone spot anything wrong with AWS support's suggested entrypoint script approach? Is there a better way to handle this and achieve the desired result, which is zero downtime rolling deployments on services registered in eureka on ECS?
I'm using jenkins to do a few actions in a remote server.
I have an Execute Shell command in which I do the following:
sudo ssh <remote server> 'sudo service supervisor restart'
sleep 30
When jenkins reaches the first line I can see 'Restarting Supervisor' but after a moment I see that jenkins closed the ssh connection and moved on to the second line.
I tried adding a 'sleep 30' after the restart command but it still doesn't work.
Seems jenkins doesn't wait for the supervisor restart command to be completed.
Problem is it's not something that always happens, just sometimes, but it does make a lot of problems when it fails.
I think you can never be certain all processes started by supervisord are in a 'ready' state after a restart. Even is the restart action would wait for processes to be started, it wouldn't know if they are 'ready'.
In docker-compose setups that need to know if a certain service is available I've used an extra 'really ready' check for this - optionally in a loop with a sleep/wait. If the process that you are starting opens a port you can use one of the variations of 'wait-for' for this.
I'm in the process of learning ins-and-outs of Airflow to end all our Cron woes. When trying to mimic failure of (CeleryExecutor) workers, I've got stuck with Sensors. I'm using ExternalTaskSensors to wire-up top-level DAGs together as described here.
My current understanding is that since Sensor is just a type of Operator, it must inherit basic traits from BaseOperator. If I kill a worker (the docker container), all ordinary (non-Sensor) tasks running on it get rescheduled on other workers.
However upon killing a worker, ExternalTaskSensor does not get re-scheduled on a different worker; rather it gets stuck
Then either of following things happen:
I just keep waiting for several minutes and then sometimes the ExternalTaskSensor is marked as failed but workflow resumes (it has happened a few times but I don't have a screenshot)
I stop all docker containers (including those running scheduler / celery etc) and then restart them all, then the stuck ExternalTaskSensor gets rescheduled and workflow resumes. Sometimes it takes several stop-start cycles of docker containers to get the stuck ExternalTaskSensor resuming again
Sensor still stuck after single docker container stop-start cycle
Sensor resumes after several docker container stop-start cycles
My questions are:
Does docker have a role in this weird behaviour?
Is there a difference between Sensors (particularly ExternalTaskSensor) and other operators in terms of scheduling / retry behaviour?
How can I ensure that a Sensor is also rescheduled when the worker it is running on gets killed?
I'm using puckel/docker-airflow with
Airflow 1.9.0-4
Python 3.6-slim
CeleryExecutor with redis:3.2.7
This is the link to my code.
I am new to mesos , marathon framework. I formed the cluster with three mesos(0.27.0) masters and two mesos slaves. Marathon (0.15.1) is installed on masters. I scheduled one task from marathon UI of echoing Hello in some file echo "hello" > /tmp/sample.txt.
I observed that the hello is written in the file but the process of writing hello inside the file is going on. Ideally it should be stopped once it has written. I have same trouble when I tried to launch the containers, the containers are getting created till I have no memory. Can anyone suggest me what to do in order to stop echoing and to stop marathon from creating new containers ?
This is the expected behaviour for Marathon, which is meant to be used for long-running tasks, that is things like a Web server, app server, etc.
When Marathon sees the app terminates, it will launch it again (potentially on a different node).
For one-shots, you can use Chronos, Cook or write your own framework.