Docker 1.12.1: after swarm init, workers unable to join swarm - docker

I am seeing the same problem as described here and here. I have tried everything that worked in those two cases to no avail - I still see the same behavior. Can someone offer alternatives I might try?
My setup:
I am running 3 Centos 7.2 boxes. Network Time Protocol (ntpd) running on all machines. All have been yum updated. Here is some detailed info:
Linux version 3.10.0-327.28.2.el7.x86_64 (builder#kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) )
Docker version:
# docker version
Client:
Version: 1.12.1
API version: 1.24
Go version: go1.6.3
Git commit: 23cf638
Built:
OS/Arch: linux/amd64
Server:
Version: 1.12.1
API version: 1.24
Go version: go1.6.3
Git commit: 23cf638
Built:
OS/Arch: linux/amd64
Setup the swarm manager:
>docker swarm init --advertise-addr 10.1.1.40:2377 --force-new-cluster
// on some retry attempts (after 'docker swarm leave --force') I ran:
>docker swarm init --advertise-addr 10.1.1.40:2377 --force-new-cluster
Manager status:
>docker node inspect self
[
{
"ID": "3x5q1n9v956g3ptdle2eve856",
"Version": {
"Index": 10
},
"CreatedAt": "2016-08-27T13:01:13.400345797Z",
"UpdatedAt": "2016-08-27T13:01:13.580143388Z",
"Spec": {
"Role": "manager",
"Availability": "active"
},
"Description": {
"Hostname": "mymanagerhost.mycompany.com",
"Platform": {
"Architecture": "x86_64",
"OS": "linux"
},
"Resources": {
"NanoCPUs": 4000000000,
"MemoryBytes": 16659128320
},
"Engine": {
"EngineVersion": "1.12.1",
"Plugins": [
{
"Type": "Network",
"Name": "bridge"
},
{
"Type": "Network",
"Name": "host"
},
{
"Type": "Network",
"Name": "null"
},
{
"Type": "Network",
"Name": "overlay"
},
{
"Type": "Volume",
"Name": "local"
}
]
}
},
"Status": {
"State": "ready"
},
"ManagerStatus": {
"Leader": true,
"Reachability": "reachable",
"Addr": "10.1.1.40:2377"
}
}
]
On the worker node (I have two, but they both behave the same).
Join Swarm:
>docker swarm join --token SWMTKN-1-4fjh7kncdpwjvxnxisamhldgenmmnqyvhnx9qdi8d4hkkfuacv-168gs9okd5ck0r4lokdgpef92 10.1.1.40:2377
Error response from daemon: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use "docker info" command to see the current swarm status of your node.
Output of Docker info command:
>docker info
Plugins:
Volume: local
Network: null host bridge overlay
Swarm: pending
NodeID:
Error: rpc error: code = 1 desc = context canceled
Is Manager: false
Node Address: 10.1.1.50
Runtimes: runc
Default Runtime: runc
Security Options: seccomp
Kernel Version: 3.10.0-327.28.2.el7.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.52 GiB
Name: myWorkerNode.mycompany.com
ID: DAWE:VDRA:ZUVS:P7PH:ADCP:MFNU:2LOS:C6TG:XSIS:Y7EX:I46S:KFXT
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Insecure Registries:
127.0.0.0/8
Edit per first answer below
So I tried leaving with stop/start surrounding commands. I did:
# docker swarm leave --force
Node left the swarm.
# service docker stop
Redirecting to /bin/systemctl stop docker.service
#
# service docker start
Redirecting to /bin/systemctl start docker.service
# docker swarm init --advertise-addr 10.1.1.40:2377
Swarm initialized: current node (0e0y2k2hngnwyeg86ilzbrjmu) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-2ggj60tnbppgjlg63a58oe5pqtv0vfrpj81hheawanf76x7cjc-7v48qak22wd03y3jyv903a9if \
10.1.1.40:2377
Then on the worker I did:
# docker swarm leave
Node left the swarm.
# service docker stop
Redirecting to /bin/systemctl stop docker.service
# service docker start
Redirecting to /bin/systemctl start docker.service
# docker swarm join \
> --token SWMTKN-1-2ggj60tnbppgjlg63a58oe5pqtv0vfrpj81hheawanf76x7cjc- 7v48qak22wd03y3jyv903a9if \
> 10.1.1.40:2377
Error response from daemon: Timeout was reached before node was joined. Attempt to join the cluster will continue in the background. Use "docker info" command to see the current swarm status of your node.
Which is obviously the same behavior...
UPDATE
I have tried all the steps outlined by #Miad Abrin. I still get the same behavior. I am guessing the cause is related to the CERTS errors I see when I do:
# journalctl -xe
Aug 29 12:26:15 dockerd[6577]: time="2016-08-29T12:26:15.554904435-04:00" level=warning msg="failed to retrieve remote root CA certificate: rpc
Aug 29 12:26:15 dockerd[6577]: time="2016-08-29T12:26:15.555400400-04:00" level=warning msg="failed to retrieve remote root CA certificate: rpc
Aug 29 12:26:15 dockerd[6577]: time="2016-08-29T12:26:15.555478782-04:00" level=warning msg="failed to retrieve remote root CA certificate: rpc
Aug 29 12:26:15 dockerd[6577]: time="2016-08-29T12:26:15.555528929-04:00" level=warning msg="failed to retrieve remote root CA certificate: rpc
Aug 29 12:26:15 dockerd[6577]: time="2016-08-29T12:26:15.555685464-04:00" level=warning msg="failed to retrieve remote root CA certificate: rpc
Does anyone know the cause of this and how to correct?

You need to restart your docker daemon service before leaving the swarm and after it. do it both for the swarm leader and the works. This is a bug in 1.12 version and it is fixed in 1.12.1 since I had the same problems.
My Results when trying this
In the two sections below I numbered the steps with (num) to show the order between the worker and the manager:
On the worker:
(1)# docker swarm leave --force
Error response from daemon: This node is not part of a swarm
(2)# service docker stop
Redirecting to /bin/systemctl stop docker.service
(6)# service docker start
Redirecting to /bin/systemctl start docker.service
#
(7)# docker swarm join \
> --token SWMTKN-1-4gsdy8jshxmd58mvpcm0tlmbbnrrqdrf51ggcwvdv0bvkltxmy-am9o4dsl4ovx6b4lbsabn0fc7 \
> 10.1.1.40:2377
Error response from daemon: Timeout was reached before node was joined. The attempt to join the swarm will continue in the background. Use the "docker info" command to see the current swarm status of your node.
(8)# nmap -p2377 10.1.1.40
Starting Nmap 6.40 ( http://nmap.org ) at 2016-08-29 10:32 EDT
Nmap scan report for (10.1.0.123)
Host is up (0.00085s latency).
PORT STATE SERVICE
2377/tcp filtered unknown
MAC Address: 00:50:56:B9:76:32
On the manager node:
(3)# docker swarm leave --force
Error response from daemon: This node is not part of a swarm
(4)# service docker stop
Redirecting to /bin/systemctl stop docker.service
(5)# service docker start
Redirecting to /bin/systemctl start docker.service
(7)# docker swarm init --advertise-addr 10.1.1.40 --force-new-cluster
Swarm initialized: current node (7z52d3bcoiou61ltgike42dnn) is now a manager.
To add a worker to this swarm, run the following command:
docker swarm join \
--token SWMTKN-1-4gsdy8jshxmd58mvpcm0tlmbbnrrqdrf51ggcwvdv0bvkltxmy-am9o4dsl4ovx6b4lbsabn0fc7 \
10.1.1.40:2377

Related

mounting permission denied in docker

I was facing issues installing docker on cloud server according to the official guide(Install Docker Engine on Ubuntu). I finished old version's uninstallation, the repository setting up and docker engine installation (sudo apt-get install docker-ce docker-ce-cli containerd.io). However, I got an error when running hello-world.
wyf#VM1103-Timi:~$ sudo docker run hello-world
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"proc\\\" to rootfs \\\"/var/lib/docker/overlay2/e9fedf64e8983aa01e513cee591cdfd7fc60962466a476b51fc1ead682ec8022/merged\\\" at \\\"/proc\\\" caused \\\"permission denied\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled
I tried restart docker and server, but the problem still exists.
So, it would be great if someone can guide me in fixing this error.
Please let me know if you have any idea about this issue.
Thank you very much!
Ps:
My system is Ubuntu 18.04. Thus, I did not have selinux. Instead of selinux, I checked AppArmor log.
May 19 21:14:55 VM1103-Timi networkd-dispatcher[155]: WARNING:Unknown index 37 seen, reloading interface list
May 19 21:14:55 VM1103-Timi systemd-networkd[126]: veth71cf495: Link UP
May 19 21:14:55 VM1103-Timi containerd[170]: time="2020-05-19T21:14:55.679793295+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/4c207ce1273d2c863ee419c5ebb271163a031394bd4c17ee75d44267d631954d/shim.sock" debug=false pid=106265
May 19 21:14:55 VM1103-Timi containerd[170]: time="2020-05-19T21:14:55.767796543+08:00" level=info msg="shim reaped" id=4c207ce1273d2c863ee419c5ebb271163a031394bd4c17ee75d44267d631954d
May 19 21:14:55 VM1103-Timi dockerd[15100]: time="2020-05-19T21:14:55.776863367+08:00" level=error msg="stream copy error: reading from a closed fifo"
May 19 21:14:55 VM1103-Timi dockerd[15100]: time="2020-05-19T21:14:55.776953910+08:00" level=error msg="stream copy error: reading from a closed fifo"
May 19 21:14:55 VM1103-Timi systemd-networkd[126]: veth71cf495: Link DOWN
May 19 21:14:55 VM1103-Timi dockerd[15100]: time="2020-05-19T21:14:55.927805156+08:00" level=error msg="4c207ce1273d2c863ee419c5ebb271163a031394bd4c17ee75d44267d631954d cleanup: failed to delete container from containerd: no such container"
The strange thing is that there is no record of permission-denied error.
Here are my ubuntu version, kernal version and docker info:
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.4 LTS
Release: 18.04
Codename: bionic
5.3.18-3-pve
Client:
Debug Mode: false
Server:
Containers: 8
Running: 0
Paused: 0
Stopped: 8
Images: 1
Server Version: 19.03.8
Storage Driver: overlay2
Backing Filesystem: <unknown>
Supports d_type: true
Native Overlay Diff: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 5.3.18-3-pve
Operating System: Ubuntu 18.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 4GiB
Name: VM1103-Timi
ID: 3G3F:LTVZ:NO25:C7LA:XKQV:ETMB:B6QU:3ZFJ:KBA5:R3KK:QZEA:ZONC
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
It seemed that the AppArmor Profile "docker-default" was lost. "docker-default" was not correctly generated. Check as follows:
root#VM1103-Timi:/etc/apparmor.d# aa-status
apparmor module is loaded.
12 profiles are loaded.
12 profiles are in enforce mode.
/sbin/dhclient
/usr/bin/man
/usr/lib/NetworkManager/nm-dhcp-client.action
/usr/lib/NetworkManager/nm-dhcp-helper
/usr/lib/connman/scripts/dhclient-script
/usr/lib/lightdm/lightdm-guest-session
/usr/lib/lightdm/lightdm-guest-session//chromium
/usr/sbin/mysqld
/usr/sbin/tcpdump
docker-default
man_filter
man_groff
0 profiles are in complain mode.
1 processes have profiles defined.
1 processes are in enforce mode.
/usr/sbin/mysqld (258)
0 processes are in complain mode.
0 processes are unconfined but have a profile defined.
Solution is probably to open ports needed. Your system might be running selinux and (ufw or firewalld or iptables) ?and/or others?. Read up a bit on linux firewall tools, in particular the ones running on your system.
For the selinux case, you need to check selinux logs, is it blocking access? Add exceptions using selinux commands.
https://wiki.centos.org/HowTos/SELinux These tools are well worth learning but can be complicated. A quick test disabling selinux and firewalld can confirm that this is the source of problem and you can enable selinux and firewalld later and allow/open ports in a secure way.
Simple test: disable selinux and firewalld, e.g. on CentOS
systemctl stop firewalld;
setenforcing 0;
If you can create containers with selinux disabled then you have confirmed selinux is your problem. You can enable firewall and selinux and then add exceptions and open ports as needed later.
This looks good (specific to ubuntu but general enough IMHO), It details ufw commands, firewalld commands and iptables commands needed for opening ports to allow docker swarm to work) https://www.digitalocean.com/community/tutorials/how-to-configure-the-linux-firewall-for-docker-swarm-on-ubuntu-16-04
I originally got useful info on ufw commands to open ports needed from here:
Error response from daemon: attaching to network failed, make sure your network options are correct and check manager logs: context deadline exceeded
ufw allow 2376/tcp
ufw allow 2377/tcp
ufw allow 7946/tcp
ufw allow 7946/udp
ufw allow 4789/udp
ufw enable #maybe
ufw reload
systemctl restart docker
This is a common enough problem where something usually selinux is not allowing access to ports needed.
e.g.
https://github.com/google/cadvisor/issues/333

Kubernetes kube-dns Pause container in crashloop with Error adding network: failed to Statfs \"/proc/54226/ns/net\":

I have a Kubernetes onebox deployment with the following (containerized) components, all running as --net=host, with kubelet running as a privileged Docker container with the kubernetes flag --allow-privileged set to true.
gcr.io/google_containers/hyperkube-amd64:v1.7.9 "/bin/bash -c './hype" kubelet
gcr.io/google_containers/hyperkube-amd64:v1.7.9 "/bin/bash -c './hype" kube-proxy
gcr.io/google_containers/hyperkube-amd64:v1.7.9 "/bin/bash -c './hype" kube-scheduler
gcr.io/google_containers/hyperkube-amd64:v1.7.9 "/bin/bash -c './hype" kube-controller-manager
gcr.io/google_containers/hyperkube-amd64:v1.7.9 "/bin/bash -c './hype" kube-apiserver
quay.io/coreos/etcd:v3.1.0 "/usr/local/bin/etcd " etcd
On top of this, I enabled the addon manager with kubectl create -f https://github.com/kubernetes/kubernetes/blob/master/test/kubemark/resources/manifests/kube-addon-manager.yaml,
with the default yaml manifests for calico 2.6.1 and kube-dns 1.14.5 mounted to /etc/kubernetes/addons/. The calico pod comes up with two nodes (install-cni and calico-node) as expected.
However, kube-dns gets stuck in ContainerCreating or ContainerCannotRun, with the following error while trying to start the Kubernetes pause container:
{"log":"I1111 00:35:19.549318 1 manager.go:913] Added container: \"/kubepods/burstable/pod3173eef3-c678-11e7-ac4b-e41d2d59689e/1dd57d6f6c996d7abe061f6236fc8a0150cf6f95d16d5c3c462c9ed7158d3c54\" (aliases: [k8s_POD_kube-dns-v20-141138543-pmdww_kube-system_3173eef3-c678-11e7-ac4b-e41d2d59689e_0 1dd57d6f6c996d7abe061f6236fc8a0150cf6f95d16d5c3c462c9ed7158d3c54], namespace: \"docker\")\n","stream":"stderr","time":"2017-11-11T00:35:19.5526284Z"}
{"log":"I1111 00:35:19.549433 1 cni.go:291] About to add CNI network cni-loopback (type=loopback)\n","stream":"stderr","time":"2017-11-11T00:35:19.5526748Z"}
{"log":"I1111 00:35:19.549504 1 handler.go:325] Added event \u0026{/kubepods/burstable/pod3173eef3-c678-11e7-ac4b-e41d2d59689e/1dd57d6f6c996d7abe061f6236fc8a0150cf6f95d16d5c3c462c9ed7158d3c54 2017-11-11 00:35:19.3931718 +0000 UTC containerCreation {\u003cnil\u003e}}\n","stream":"stderr","time":"2017-11-11T00:35:19.5527217Z"}
{"log":"I1111 00:35:19.551134 1 container.go:407] Start housekeeping for container \"/kubepods/burstable/pod3173eef3-c678-11e7-ac4b-e41d2d59689e/1dd57d6f6c996d7abe061f6236fc8a0150cf6f95d16d5c3c462c9ed7158d3c54\"\n","stream":"stderr","time":"2017-11-11T00:35:19.5527441Z"}
{"log":"E1111 00:35:19.555099 1 cni.go:294] Error adding network: failed to Statfs \"/proc/54226/ns/net\": no such file or directory\n","stream":"stderr","time":"2017-11-11T00:35:19.5553606Z"}
{"log":"E1111 00:35:19.555122 1 cni.go:237] Error while adding to cni lo network: failed to Statfs \"/proc/54226/ns/net\": no such file or directory\n","stream":"stderr","time":"2017-11-11T00:35:19.5553887Z"}
{"log":"I1111 00:35:19.600281 1 manager.go:970] Destroyed container: \"/kubepods/burstable/pod3173eef3-c678-11e7-ac4b-e41d2d59689e/1dd57d6f6c996d7abe061f6236fc8a0150cf6f95d16d5c3c462c9ed7158d3c54\" (aliases: [k8s_POD_kube-dns-v20-141138543-pmdww_kube-system_3173eef3-c678-11e7-ac4b-e41d2d59689e_0 1dd57d6f6c996d7abe061f6236fc8a0150cf6f95d16d5c3c462c9ed7158d3c54], namespace: \"docker\")\n","stream":"stderr","time":"2017-11-11T00:35:19.6005722Z"}
I see \pause containers keep coming up just to exit a second later, with an innocuous error message (this one is old, I stopped the cluster so it wouldn't keep spawning more containers):
ubuntu#r172-16-6-39:~$ docker ps -a | grep 216e39defa36
216e39defa36 gcr.io/google_containers/pause-amd64:3.0 "/pause" About an hour ago Exited (0) About an hour ago k8s_POD_kube-dns-v20-141138543-xvdmv_kube-system_0594732f-c688-11e7-9da5-e41d2d59689e_17
ubuntu#r172-16-6-39:~$ docker logs 216e39defa36
shutting down, got signal: Terminated
The dir /proc/54226 doesn't exist on my host, which I assume is why CNI is complaining. But the pause containers for Calico are fine, running the same image, so something must be either failing to write only in the case of kube-dns, or not trying to write in the case of Calico. I found some references to a similar SELinux-related error on Openshift, but I'm running a bare Ubuntu 14.04 VM without SELinux even installed.
ubuntu#r172-16-6-39:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.4 LTS
Release: 14.04
Codename: trusty
ubuntu#r172-16-6-39:~$ setenforce
The program 'setenforce' is currently not installed. You can install it by typing:
sudo apt-get install selinux-utils
My CNI conf is also pretty simple, generated by the install-cni calico container:
ubuntu#r172-16-6-39:~$ cat /etc/cni/net.d/10-calico.conf
{
"name": "k8s-pod-network",
"cniVersion": "0.1.0",
"type": "calico",
"log_level": "debug",
"datastore_type": "kubernetes",
"nodename": "172.16.6.39",
"mtu": 1500,
"ipam": {
"type": "host-local",
"subnet": "usePodCidr"
},
"policy": {
"type": "k8s",
"k8s_auth_token": "****"
},
"kubernetes": {
"k8s_api_root": "https://168.16.0.1:443",
"kubeconfig": "/etc/kubernetes/kubeconfig"
}
}
Has anyone hit something similar?

access docker daemon remote api in contanier

I use official version of docker-ce at centos7, start a docker daemon in container:
[root#5cae7be526b4 /]# rpm -qa docker-ce
docker-ce-17.09.0.ce-1.el7.centos.x86_64
Here is my daemon config
{
"hosts": ["unix:///var/run/docker.sock", "tcp://0.0.0.0:5555"],
"live-restore": true,
"insecure-registries": ["172.17.0.6:9980"]
}
Without changing config, docker daemon can start and restart in container:
[root#5cae7be526b4 /]# docker info
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: 17.09.0-ce
Storage Driver: vfs
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 3.10.0-514.el7.x86_64
Operating System: CentOS Linux 7 (Core) (containerized)
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 31.26GiB
Name: 5cae7be526b4
ID: N3Y4:VTIJ:WCHK:AQL3:MU3F:DNHE:BIXO:7ISI:4D4V:Q4IG:VYIT:FOH3
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
But change the config, it failed:
[root#5cae7be526b4 /]# systemctl restart docker
Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
[root#5cae7be526b4 /]# systemctl status docker -l
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Thu 2017-11-02 05:51:02 UTC; 2s ago
Docs: https://docs.docker.com
Process: 260 ExecStart=/usr/bin/dockerd (code=exited, status=1/FAILURE)
Main PID: 260 (code=exited, status=1/FAILURE)
Nov 02 05:51:02 5cae7be526b4 systemd[1]: Failed to start Docker Application Container Engine.
Nov 02 05:51:02 5cae7be526b4 systemd[1]: Unit docker.service entered failed state.
Nov 02 05:51:02 5cae7be526b4 systemd[1]: docker.service failed.
Nov 02 05:51:02 5cae7be526b4 systemd[1]: docker.service holdoff time over, scheduling restart.
Nov 02 05:51:02 5cae7be526b4 systemd[1]: start request repeated too quickly for docker.service
Nov 02 05:51:02 5cae7be526b4 systemd[1]: Failed to start Docker Application Container Engine.
Nov 02 05:51:02 5cae7be526b4 systemd[1]: Unit docker.service entered failed state.
Nov 02 05:51:02 5cae7be526b4 systemd[1]: docker.service failed.
Of course, this daemon config can run at host.
I has start container with --privileged and -v /sys/fs/cgroup:/sys/fs/cgroup to enable use systemctl in container.
The root cause is "hosts": ["unix:///var/run/docker.sock", "tcp://0.0.0.0:5555"], i do not know how to fix it but i need to set host indeed.
I want to make this container as a repo and start other container to do docker action like pull from this docker daemon.
How can i enable it?
I share my solution here, please tell me if wrong or has a better way.
First, check docker version, only new official version can run in container. For centos, it means docker-ce, refer to https://docs.docker.com/engine/installation/linux/docker-ce/centos/
Second, check the host port doesn't be set as -p when start container. docker daemon can listen it and specify with -p would make conflicts.
Third, start docker daemon in container should assign a volume for storage. If not, the storage option only can be vfs.

When using mesos, marathon, and zookeeper my mesos-slave doesnt start when I specify the "containerizers" file with "docker,mesos"?

I have 3 CentOS VMs and I have installed Zookeeper, Marathon, and Mesos on the master node, while only putting Mesos on the other 2 VMs. The master node has no mesos-slave running on it. I am trying to run Docker containers so i specified "docker,mesos" in the containerizes file. One of the mesos-agents starts fine with this configuration and I have been able to deploy a container to that slave. However, the second mesos-agent simply fails when I have this configuration (it works if i take out that containerizes file but then it doesn't run containers). Here are some of the logs and information that has come up:
Here are some "messages" in the log directory:
Apr 26 16:09:12 centos-minion-3 systemd: Started Mesos Slave.
Apr 26 16:09:12 centos-minion-3 systemd: Starting Mesos Slave...
WARNING: Logging before InitGoogleLogging() is written to STDERR
[main.cpp:243] Build: 2017-04-12 16:39:09 by centos
[main.cpp:244] Version: 1.2.0
[main.cpp:247] Git tag: 1.2.0
[main.cpp:251] Git SHA: de306b5786de3c221bae1457c6f2ccaeb38eef9f
[logging.cpp:194] INFO level logging started!
[systemd.cpp:238] systemd version `219` detected
[main.cpp:342] Inializing systemd state
[systemd.cpp:326] Started systemd slice `mesos_executors.slice`
[containerizer.cpp:220] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
[linux_launcher.cpp:150] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
[provisioner.cpp:249] Using default backend 'copy'
[slave.cpp:211] Mesos agent started on (1)#172.22.150.87:5051
[slave.cpp:212] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="docker,mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --http_heartbeat_interval="30secs" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher="linux" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_completed_executors_per_framework="150" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --runtime_dir="/var/run/mesos" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
[slave.cpp:541] Agent resources: cpus(*):1; mem(*):919; disk(*):2043; ports(*):[31000-32000]
[slave.cpp:549] Agent attributes: [ ]
[slave.cpp:554] Agent hostname: node3
[status_update_manager.cpp:177] Pausing sending status updates
[state.cpp:62] Recovering state from '/var/lib/mesos/meta'
[state.cpp:706] No committed checkpointed resources found at '/var/lib/mesos/meta/resources/resources.info'
[status_update_manager.cpp:203] Recovering status update manager
[docker.cpp:868] Recovering Docker containers
[containerizer.cpp:599] Recovering containerizer
[provisioner.cpp:410] Provisioner recovery complete
[group.cpp:340] Group process (zookeeper-group(1)#172.22.150.87:5051) connected to ZooKeeper
[group.cpp:830] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
[group.cpp:418] Trying to create path '/mesos' in ZooKeeper
[detector.cpp:152] Detected a new leader: (id='15')
[group.cpp:699] Trying to get '/mesos/json.info_0000000015' in ZooKeeper
[zookeeper.cpp:259] A new leading master (UPID=master#172.22.150.88:5050) is detected
Failed to perform recovery: Collect failed: Failed to run 'docker -H unix:///var/run/docker.sock ps -a': exited with status 1; stderr='Cannot connect to the Docker daemon. Is the docker daemon running on this host?'
To remedy this do as follows:
Step 1: rm -f /var/lib/mesos/meta/slaves/latest
This ensures agent doesn't recover old live executors.
Step 2: Restart the agent.
Apr 26 16:09:13 centos-minion-3 systemd: mesos-slave.service: main process exited, code=exited, status=1/FAILURE
Apr 26 16:09:13 centos-minion-3 systemd: Unit mesos-slave.service entered failed state.
Apr 26 16:09:13 centos-minion-3 systemd: mesos-slave.service failed.
Logs from docker:
$ sudo systemctl status docker
● docker.service - Docker Application Container Engine Loaded:
loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/docker.service.d
└─flannel.conf Active: inactive (dead) since Tue 2017-04-25 18:00:03 CDT;
24h ago Docs: docs.docker.com Main PID: 872 (code=exited, status=0/SUCCESS)
Apr 26 18:25:25 centos-minion-3 systemd[1]: Dependency failed for Docker Application Container Engine.
Apr 26 18:25:25 centos-minion-3 systemd[1]: Job docker.service/start failed with result 'dependency'
Logs from flannel:
[flanneld-start: network.go:102] failed to retrieve network config: client: etcd cluster is unavailable or misconfigured
You have answer in your logs
Failed to perform recovery: Collect failed:
Failed to run 'docker -H unix:///var/run/docker.sock ps -a': exited with status 1;
stderr='Cannot connect to the Docker daemon. Is the docker daemon running on this host?'
To remedy this do as follows:
Step 1: rm -f /var/lib/mesos/meta/slaves/latest
This ensures agent doesn't recover old live executors.
Step 2: Restart the agent.
Mesos keeps it state/metadata on local disk. When it's restarted it try to load this state. If configuration changed and is not compatible with previous state it won't start.
Just bring docker to live by fixing problems with flannel and etcd and everything will be fine.
add the following flag while starting agent,
--reconfiguration_policy=additive
more details here: http://mesos.apache.org/documentation/latest/agent-recovery/

Setting multiple DOCKER_OPTS arguments

If you want to pass an option to the Docker Engine at startup on Ubuntu, you can edit the /etc/defaults/docker file.
Here I'm setting the storage driver to AUFS:
DOCKER_OPTS="--storage-driver=aufs"
However, if I pass more than one argument, Docker doesn't start. For example:
DOCKER_OPTS="--insecure-registry=0.0.0.0:5000 --storage-driver=aufs"
Now Docker fails to start:
# service docker stop && service docker start
docker start/running, process 31569
# service docker status
docker stop/waiting
From /var/log/syslog:
Mar 11 14:55:30 myhost kernel: [ 2788.030270] init: docker main process (31253) terminated with status 1
Mar 11 14:55:30 myhost kernel: [ 2788.030279] init: docker main process ended, respawning
Mar 11 14:55:30 myhost kernel: [ 2788.085931] init: docker main process (31287) terminated with status 1
Mar 11 14:55:30 myhost kernel: [ 2788.085940] init: docker respawning too fast, stopped
Each argument works on its own, but if passed together the Docker service refuses to start. I am using Docker version 1.10.3, build 20f81dd on Ubuntu 14.04 3.13.0-74-generic.
How can I pass more than one argument to DOCKER_OPTS?
The arguments must be separated by ,
This format works:
DOCKER_OPTS="--insecure-registry=0.0.0.0:5000,--storage-driver=aufs"

Resources