DCOS Upgrade docker version on agent nodes - docker

We are running DC/OS 11.1 on Azure cloud and have Docker engine version 17.09 on our agent nodes. We would like to upgrade Docker engine to 17.12.1 on each agent node.
Has anyone had experience with such procedure and would it cause any instability / side effects with the rest of the DC/OS components?

I have not done the upgrade myself in the exact environment you are running in, but I would not be terribly concerned. It goes without saying that test this out in non-production environment before you do it in production.
I would suggest draining the agent node before doing the docker upgrade. What I mean by draining is that you stop all the containers(tasks) running on the node, this will ensure that Mesos agents will stop the tasks and then inform the framework that the tasks are no longer running and the frameworks would take appropriate action.
To drain nodes run
sudo sh -c 'systemctl kill -s SIGUSR1 dcos-mesos-slave && systemctl stop dcos-mesos-slave'
for a private agent
sudo sh -c 'systemctl kill -s SIGUSR1 dcos-mesos-slave-public && systemctl stop dcos-mesos-slave-public'
for public agent
You would observe the agent disappear from the Nodes section of the UI and all tasks running on the agent marked as TASK_LOST. Ideally it should have been TASK_KILLED but that is a topic for another time.
Now perform your docker upgrade
After you have upgraded docker start the agent service back up
sudo systemctl start dcos-mesos-slave
for a private agent
sudo systemctl start dcos-mesos-slave-public
for public agent
The nodes should now start showing up in the UI and start accepting tasks.
To be safe
Verify this in non-prod environment before you do it in prod, to
iron out any operational issues you might encounter
Do 1 or a subset of agents at a time so that you are not left with a
cluster without any nodes while you are performing the upgrade

Related

Puppet agent disabled in Puppet master

In my environment, there is a RHEL puppet master which successfully managing over 500 nodes.
When I run puppet in master server (puppet agent -t), I am getting below error. It seems puppet agent is disabled in master. Is there any impact , if I enable puppet agent in master.
*[root#puppet-master]# puppet agent -t
Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'reason not specified');
Use 'puppet agent --enable' to re-enable.*
Puppet should be enforcing its own configuration and the default behavior for PE is to run the agent on the master every 30 minutes.
You can test to see what would happen if you enabled Puppet using the following steps;
run systemctl stop puppet this will just stop the agent service, it won't stop the Puppet server running.
Run puppet agent --enable to re-enable Puppet runs.
Run puppet agent -t --noop if you run in noop mode it will not apply any changes, just report back what it would change.
At this point, if it's not going to make any further changes then you'll be safe to run systemctl enable puppet and start it enforcing itself again. If it is going to make some changes you don't want then run puppet agent --disable to ensure the agent doesn't accidentally restart after a reboot and investigate further.

When using sudo in Jenkins pipline sh command, it always asks for inputing password even I've set sudoers to NOPASSWD

I'm running Jenkins 2.319.2 installed from Debian Bullseye repository and set up some nodes to run my tasks.
In my Jenkins pipeline task, which is running on a node instead of on the master node, I set up several stages, including checkout, build, deploy and finally I have to restart a system service using systemctl. The last step needs to be run with root privileges, so I set up the running user on the node in sudoers config to let it run systemctl without password (NOPASSWD). However, my task always asks for password when running the final step, and hence fails.
If I directly log in the user with ssh, I can run sudo systemctl without needing to input password. In my other freestyle task, I also used the same way to run sudo systemctl restart myservice in the "execute shell", without any problem. But in the pipeline stage it always asks for password. No idea why.

Reset RabbitMQ-node for integration testing

I am using RabbitMQ in a project and am running my integration tests against it. Since the tests need to be independent from each other, I would like to reset the RabbitMQ instance before each test and currently solve this by restarting the (automatically created) RabbitMQ docker container. However, this is extremely slow (for integration tests).
I know from this answer that it's possible to reset the rabbitmq-instance with rabbitmqctl stop && rabbitmqctl reset && rabbitmqctl start - but in case of the docker-image, the stop-signal kills the main container process (which is rabbitmq-server), which in turn leads to dockerd killing the complete container.
The only solution I found so far is running the management-api-plugin, iterating over all queues, exchanges, policies, etc. and deleting them through that - which in turn takes a while as well and requires the management-plugin to run.
Is it possible to reset a running rabbitmq-node programmatically via AMQP, some other API-endpoint or running a command, without stopping it first?
The answer you're referring to is correct in that you should be using stop_app, not stop like in your message.
There's an important difference between the two:
stop:
...stops RabbitMQ and its runtime (Erlang VM)
stop_app:
...stops the RabbitMQ application, leaving the runtime (Erlang VM) running
Because in rabbitmq container process containing Erlang VM is PID = 1, stopping it will obviously cause container to stop. Luckly, rabbitmq authors added stop_app command specifically to improve user experience related to testing.
The code from the answer you're referring to should work just fine. Here's the same code as a one-liner:
docker exec my_queue sh -c "rabbitmqctl stop_app; rabbitmqctl reset; rabbitmqctl start_app"
The output will look like this:
$ docker exec my_queue sh -c "rabbitmqctl stop_app; rabbitmqctl reset; rabbitmqctl start_app"
Stopping rabbit application on node rabbit#40420e95dcee
Resetting node rabbit#40420e95dcee
Starting node rabbit#40420e95dcee
$

SLURM+Docker: How to kill docker-created processes using SLURMs scancel

We have currently setup a GPU computing cluster with SLURM as a resource manager. As this is a cluster for deep-learning, we manage dependencies by using nvidia-docker images to facilitate different frameworks and CUDA versions.
Our typical use case is to allocate resources with srun and give a command to run nvidia-docker which runs the experiment scripts as per the following:
srun --gres=gpu:[num gpus required] nvidia-docker run --rm -u $(id -u):$(id -g) /bin/bash -c [python scripts etc..] &
We have discovered an issue where if a slurm job is cancelled using the scancel command, the docker process on the node is cancelled, but whatever experiment scripts that were started in the docker still continue. As far as we understand, this is not a fault in SLURM, but rather it is the case that killing a docker process does not kill its spawned processes, they will only be killed with the docker kill command. While there might be some way to execute the docker kill command in a SLURM prologue script, we were wondering if anyone else has had this problem and if they have solved it somehow. To summerize, we would like to know:
How can we ensure that processes started in a nvidia-docker container, which in turn was started by a SLURM SRUN, are killed with SCANCEL?
Configuring Slurm to use cgroups might help here. With cgroups enabled, any process belonging to a job is attached to a cgroup and destroyed when the job ends. Destruction is take care of by the kernel so there is no way a regular process can escape that.

First execution of Docker on a new EC2 Jenkins Slave does not work

I'm using the EC2 Plugin in Jenkins to spin up slave instances when we need them. Recently I've wanted to play around with Docker so I installed it on the AMI we use as a slave - but the first run on the slave never seems to work.
+ docker ps
time="2015-04-17T15:38:20Z" level="fatal" msg="Get http:///var/run/docker.sock/v1.16/containers/json: dial unix /var/run/docker.sock: no such file or directory. Are you trying to connect to a TLS-enabled daemon without TLS?"
Any runs after this seem to work - why won't the slave work on the first job? I've tried using sudo, executing docker ps before docker build but nothing seems to fix the problem.
The problem is that Jenkins is just waiting for the slave to respond to an SSH connection, not that Docker is running.
To prevent the slave from becoming "online" too quickly, put a check in the "Init Script" section in the EC2 Slave Plugin configuration section. Here's an example of the one I use against the base AMI.
while [[ -z $(/sbin/service docker status | grep " is running...") && $sleep_counter -lt 300 ]]; do sleep 1; ((sleep_counter++)); echo "Waiting for docker $sleep_counter seconds - $(/sbin/service docker status)"; done
Amazingly, it can take up to 60 seconds between the slave coming up and the Docker service starting, so I've set the timeout to be 5 minutes.

Resources