When using mesos, marathon, and zookeeper my mesos-slave doesnt start when I specify the "containerizers" file with "docker,mesos"? - docker

I have 3 CentOS VMs and I have installed Zookeeper, Marathon, and Mesos on the master node, while only putting Mesos on the other 2 VMs. The master node has no mesos-slave running on it. I am trying to run Docker containers so i specified "docker,mesos" in the containerizes file. One of the mesos-agents starts fine with this configuration and I have been able to deploy a container to that slave. However, the second mesos-agent simply fails when I have this configuration (it works if i take out that containerizes file but then it doesn't run containers). Here are some of the logs and information that has come up:
Here are some "messages" in the log directory:
Apr 26 16:09:12 centos-minion-3 systemd: Started Mesos Slave.
Apr 26 16:09:12 centos-minion-3 systemd: Starting Mesos Slave...
WARNING: Logging before InitGoogleLogging() is written to STDERR
[main.cpp:243] Build: 2017-04-12 16:39:09 by centos
[main.cpp:244] Version: 1.2.0
[main.cpp:247] Git tag: 1.2.0
[main.cpp:251] Git SHA: de306b5786de3c221bae1457c6f2ccaeb38eef9f
[logging.cpp:194] INFO level logging started!
[systemd.cpp:238] systemd version `219` detected
[main.cpp:342] Inializing systemd state
[systemd.cpp:326] Started systemd slice `mesos_executors.slice`
[containerizer.cpp:220] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
[linux_launcher.cpp:150] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
[provisioner.cpp:249] Using default backend 'copy'
[slave.cpp:211] Mesos agent started on (1)#172.22.150.87:5051
[slave.cpp:212] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="docker,mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --http_heartbeat_interval="30secs" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher="linux" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_completed_executors_per_framework="150" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --runtime_dir="/var/run/mesos" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
[slave.cpp:541] Agent resources: cpus(*):1; mem(*):919; disk(*):2043; ports(*):[31000-32000]
[slave.cpp:549] Agent attributes: [ ]
[slave.cpp:554] Agent hostname: node3
[status_update_manager.cpp:177] Pausing sending status updates
[state.cpp:62] Recovering state from '/var/lib/mesos/meta'
[state.cpp:706] No committed checkpointed resources found at '/var/lib/mesos/meta/resources/resources.info'
[status_update_manager.cpp:203] Recovering status update manager
[docker.cpp:868] Recovering Docker containers
[containerizer.cpp:599] Recovering containerizer
[provisioner.cpp:410] Provisioner recovery complete
[group.cpp:340] Group process (zookeeper-group(1)#172.22.150.87:5051) connected to ZooKeeper
[group.cpp:830] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
[group.cpp:418] Trying to create path '/mesos' in ZooKeeper
[detector.cpp:152] Detected a new leader: (id='15')
[group.cpp:699] Trying to get '/mesos/json.info_0000000015' in ZooKeeper
[zookeeper.cpp:259] A new leading master (UPID=master#172.22.150.88:5050) is detected
Failed to perform recovery: Collect failed: Failed to run 'docker -H unix:///var/run/docker.sock ps -a': exited with status 1; stderr='Cannot connect to the Docker daemon. Is the docker daemon running on this host?'
To remedy this do as follows:
Step 1: rm -f /var/lib/mesos/meta/slaves/latest
This ensures agent doesn't recover old live executors.
Step 2: Restart the agent.
Apr 26 16:09:13 centos-minion-3 systemd: mesos-slave.service: main process exited, code=exited, status=1/FAILURE
Apr 26 16:09:13 centos-minion-3 systemd: Unit mesos-slave.service entered failed state.
Apr 26 16:09:13 centos-minion-3 systemd: mesos-slave.service failed.
Logs from docker:
$ sudo systemctl status docker
● docker.service - Docker Application Container Engine Loaded:
loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/docker.service.d
└─flannel.conf Active: inactive (dead) since Tue 2017-04-25 18:00:03 CDT;
24h ago Docs: docs.docker.com Main PID: 872 (code=exited, status=0/SUCCESS)
Apr 26 18:25:25 centos-minion-3 systemd[1]: Dependency failed for Docker Application Container Engine.
Apr 26 18:25:25 centos-minion-3 systemd[1]: Job docker.service/start failed with result 'dependency'
Logs from flannel:
[flanneld-start: network.go:102] failed to retrieve network config: client: etcd cluster is unavailable or misconfigured

You have answer in your logs
Failed to perform recovery: Collect failed:
Failed to run 'docker -H unix:///var/run/docker.sock ps -a': exited with status 1;
stderr='Cannot connect to the Docker daemon. Is the docker daemon running on this host?'
To remedy this do as follows:
Step 1: rm -f /var/lib/mesos/meta/slaves/latest
This ensures agent doesn't recover old live executors.
Step 2: Restart the agent.
Mesos keeps it state/metadata on local disk. When it's restarted it try to load this state. If configuration changed and is not compatible with previous state it won't start.
Just bring docker to live by fixing problems with flannel and etcd and everything will be fine.

add the following flag while starting agent,
--reconfiguration_policy=additive
more details here: http://mesos.apache.org/documentation/latest/agent-recovery/

Related

Docker service fails to start due to dependency

I have docker 20.10.6 & CentOS 7.5
-bash-4.2$ docker version
Client: Docker Engine - Community
Version: 20.10.6
API version: 1.41
Go version: go1.13.15
Git commit: 370c289
Built: Fri Apr 9 22:45:33 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
when I try to run the service with
sudo systemctl start docker
I get an error of
A dependency job for docker.service failed. See 'journalctl -xe' for details.
systemctl returns this
systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/etc/systemd/system/docker.service; disabled; vendor preset: disabled)
Active: inactive (dead)
Docs: https://docs.docker.com
I am following the guide from https://docs.docker.com/engine/install/centos/
I have tried reinstalling docker & dependencies, tried creating a /etc/docker/daemon.json file with the contents
{
"storage-driver": "overlay2"
}
but no success
The command
export VERSION_STRING=20.10.6
sudo yum install docker-ce-${VERSION_STRING} docker-ce-cli-${VERSION_STRING} containerd.io
indicates no missing dependency
The logs in journalctl are not very informative:
sudo journalctl -fu docker
-- Logs begin at .... --
Dependency failed for Docker Application Container Engine.
systemd[1]: Job docker.service/start failed with result 'dependency'.
systemd[1]: Dependency failed for Docker Application Container Engine.
systemd[1]: Job docker.service/start failed with result 'dependency'.
systemd[1]: Dependency failed for Docker Application Container Engine.
systemd[1]: Job docker.service/start failed with result 'dependency'.
systemd[1]: Dependency failed for Docker Application Container Engine.
systemd[1]: Job docker.service/start failed with result 'dependency'.
The following made the trick
sudo /usr/bin/dockerd -H unix://
So I start the docker engine that way, and I can start running containers, etc.

Cannot initialize Kubernetes cluster on Ubuntu 18.04 (Virtual Box)

I struggle to initialize a simple Kubernetes cluster using Ubuntu on Virtualbox. I tried both server and desktop version, following the official documentation:
https://kubernetes.io/docs/setup/production-environment/container-runtimes/#docker
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/
I also tried to follow some other ones, thinking the issue was because i'm using Virtualbox VM's, like this one:
https://medium.com/#gunjangarge/create-kubernetes-cluster-using-kubeadm-on-ubuntu-virtualbox-step-by-step-68a3eeb1f74c
But everytime I have the same issue with port 6443 not being exposed. Sometimes the process starts correctly, giving me the join command:
kubeadm init --pod-network-cidr=192.168.0.0/16
W1029 08:47:53.841460 11540 configset.go:348] WARNING: kubeadm cannot validate component configs
for API groups [kubelet.config.k8s.io kubeproxy.config.k8s.io]
[init] Using Kubernetes version: v1.19.3
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[addons] Applied essential addon: CoreDNS
[addons] Applied essential addon: kube-proxy
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join 192.168.1.192:6443 --token ztnoww.t8ng5a3jo2kx5cb2 \
--discovery-token-ca-cert-hash
sha256:907dde6cc6d72ed4cd7fe7e7f252e2cf657dd3256fba6ee5ec92027132a9c5af
Sometimes it's not starting at all and timeouting:
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all Kubernetes containers running in docker:
- 'docker ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'docker logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher
Anyway, even when it's starting, port 6443 is never exposed, and kubelet is not happy with it:
kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Thu 2020-10-29 08:48:15 CET; 20s ago
Docs: https://kubernetes.io/docs/home/
Main PID: 13262 (kubelet)
Tasks: 14 (limit: 4666)
CGroup: /system.slice/kubelet.service
└─13262 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-contai
Okt 29 08:48:22 master kubelet[13262]: E1029 08:48:22.588386 13262 controller.go:136] failed to ensure node lease exists, will retry in 800ms, error: Get
"https://192.168.1.192:6443/apis/coordination.k8s.io/v1/names
Okt 29 08:48:22 master kubelet[13262]: E1029 08:48:22.785951 13262 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Service: failed to list *v1.Service: Get "https://192.168.1.192:644
Okt 29 08:48:23 master kubelet[13262]: I1029 08:48:23.022354 13262 kubelet_node_status.go:70] Attempting to register node master
Okt 29 08:48:24 master kubelet[13262]: I1029 08:48:24.188510 13262 request.go:645] Throttling request took 1.097264312s, request: POST:https://192.168.1.192:6443/api/v1/namespaces/kube-system/pods
Okt 29 08:48:25 master kubelet[13262]: I1029 08:48:25.678880 13262 kubelet_node_status.go:108] Node master was previously registered
Okt 29 08:48:25 master kubelet[13262]: I1029 08:48:25.679004 13262 kubelet_node_status.go:73] Successfully registered node master
Okt 29 08:48:25 master kubelet[13262]: W1029 08:48:25.765981 13262 cni.go:239] Unable to update cni config: no networks found in /etc/cni/net.d
Okt 29 08:48:27 master kubelet[13262]: E1029 08:48:27.148246 13262 kubelet.go:2103] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: c
Okt 29 08:48:30 master kubelet[13262]: W1029 08:48:30.767511 13262 cni.go:239] Unable to update cni config: no networks found in /etc/cni/net.d
Okt 29 08:48:32 master kubelet[13262]: E1029 08:48:32.164211 13262 kubelet.go:2103] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: c
I have to say I don't know what to do now. I tried for hours with different Ubuntu versions, trying to find solutions on the Internet but I didn't found any solution. I also went trough the logs and found that maybe the config file is not created correctly for any reason:
failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml
but I found nothing about it, except "try to init the cluster again", which I did several times..
Thank you in advance for your help!
OK, I think I finally found the problem. I tried the same process on another PC and everything worked smoothly, so for anyway of you having a similar issue, just don't try to use VirtualBox and WSL at the same time (even if wsl is shut off)
I just did what's explained here: https://stackoverflow.com/a/63229718/2428805 and now everything's fine...

Error starting docker service. "A dependency job for docker.service failed. See 'journalctl -xe' for details."

So, after some minor changes in the docker configuration I tried to restart docker and it resulted in below error message:
A dependency job for docker.service failed. See 'journalctl -xe' for details.
Kubernetes is also running on the same machine where this docker daemon was running.
Below are the logs of the docker service (output of journalctl -u docker.service).
May 15 08:56:06 ilcepoc500 systemd[1]: Stopping Docker Application Container Engine...
May 15 08:56:07 ilcepoc500 oci-umount[42741]: umounthook <debug>: 5148572ffa9c: only runs in prestart stage, ignoring
May 15 08:56:07 ilcepoc500 oci-systemd-hook[42824]: systemdhook <debug>: 4676114a4bcd: Skipping as container command is /fission-bundle, not init or systemd
May 15 08:56:07 ilcepoc500 oci-systemd-hook[43025]: systemdhook <debug>: 92140d272e14: Skipping as container command is /go/bin/all-in-one-linux, not init or systemd
May 15 08:56:18 ilcepoc500 oci-umount[44315]: umounthook <debug>: prestart container_id:12d638f87c0d rootfs:/storage/docker/overlay2/ab7502908ea8a939e9ea7379f9715e40b563717404fc5c2ee923062e67520f15/merged
May 15 08:56:21 ilcepoc500 systemd[1]: Dependency failed for Docker Application Container Engine.
May 15 08:56:21 ilcepoc500 systemd[1]: Job docker.service/start failed with result 'dependency'.
I followed some linked of Github and SO but no luck so far any hints are appreciated
Below are the things that I have tried:
deleted /var/log/docker, reloaded docker-daemon and tried restarting docker, didn't work.
Create a file named override.conf inside the dir /etc/systemd/system/containerd.service.d and tried restarting the docker service didnt work.

Start Docker daemon as another user?

Installed Docker 17.x version in RHEL and we are getting below excetption.
-bash-4.2$ docker version
Client:
Version: 17.09.1-ce
API version: 1.32
Go version: go1.8.3
Git commit: 19e2cf6
Built: Thu Dec 7 22:23:40 2017
OS/Arch: linux/amd64
Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Get http://%2Fvar%2Frun%2Fdocker.sock/v1.32/version: dial unix /var/run/docker.sock: connect: permission denied
-bash-4.2$
to solve this , we introduce another user group (docker-user) and we added all the users in this group. after that we ran this command and able to ran docker .
sudo systemctl stop docker
sudo systemctl start docker
cd /var/run
sudo chown :docker-user docker.sock
But we are facing another issue that whenever VM is getting restarted ,DOCKER is not running. So we decided to setup run docker as daemon process and we followed
below steps.
1. create docker.conf file under /etc/systemd/system/docker.service.d folder.
2. added this entry in docker.conf file
[Service]
ExecStart=
ExecStart=/usr/bin/docker daemon -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock
ExecStartPost=/bin/chown :docker-user /var/run/docker.sock
After adding this entry and we ran
1. sudo systemctl daemon-reload
2. sudo systemctl stop docker
3. sudo systemctl start docker
We are getting below exception
-bash-4.2$ sudo systemctl status docker.service
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/docker.service.d
└─docker.conf
Active: failed (Result: start-limit) since Wed 2018-03-28 09:10:50 PDT; 12s ago
Docs: https://docs.docker.com
Process: 23395 ExecStart=/usr/bin/docker daemon -H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock (code=exited, status=1/FAILURE)
Main PID: 23395 (code=exited, status=1/FAILURE)
Mar 28 09:10:50 hostname systemd[1]: Failed to start Docker Application Container Engine.
Mar 28 09:10:50 hostname systemd[1]: Unit docker.service entered failed state.
Mar 28 09:10:50 hostname systemd[1]: docker.service failed.
Mar 28 09:10:50 hostname systemd[1]: docker.service holdoff time over, scheduling restart.
Mar 28 09:10:50 hostname systemd[1]: start request repeated too quickly for docker.service
Mar 28 09:10:50 hostname systemd[1]: Failed to start Docker Application Container Engine.
Mar 28 09:10:50 hostname systemd[1]: Unit docker.service entered failed state.
Mar 28 09:10:50 hostname systemd[1]: docker.service failed.
Guide me how to setup docker as daemon process
So, you have already dug in pretty good.
However, this behavior is built into Docker. For user groups, the docker daemon will allow users with the docker group to access the server (important to remember, this is exactly the same as giving root access to any users in that group!). If you wanted to specify a different group, you can start the daemon with the -g option.
Installing docker also installs a systemd service unit to run the daemon. The correct way to enable that (so that it restarts automatically) is
sudo systemctl enable docker
At this point, you didn't include enough of the journalctl logs to actually say why the daemon is not starting for you- if it is an option, I would try starting over since I don't know everything you have messed with, but if that is not an option the journalctl logs will likely explain the problem (probably that the user 'docker' no longer has access to the socket after you chowned it, but that is just a guess)

Docker fails to start due to "volume store metadata database: timeout"

I have followed the installation instructions of Docker CE for CentOS. Initially this worked. At some point the system was restarted and now starting Docker fails. Appreciate expert eyes on this matter...
systemctl start docker produces:
Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
systemctl status docker.service produces:
Apr 21 11:25:23 sec-services-build-1 systemd[1]: Starting Docker Application Container Engine...
Apr 21 11:25:23 sec-services-build-1 dockerd[9693]: time="2017-04-21T11:25:23.370390797+03:00" level=info msg="libcontainerd: previous instance of containerd still alive (8908)"
Apr 21 11:25:23 sec-services-build-1 dockerd[9693]: time="2017-04-21T11:25:23.382492171+03:00" level=warning msg="overlay: the backing xfs filesystem is formatted without d_type support, which leads to incorrect behavior. Reformat the filesystem with ftype=1 to enable d_type support. Running without d_type support will no longer be supported in Docker 17.12."
Apr 21 11:25:23 sec-services-build-1 dockerd[9693]: time="2017-04-21T11:25:23.382547668+03:00" level=info msg="[graphdriver] using prior storage driver: overlay"
Apr 21 11:25:24 sec-services-build-1 dockerd[9693]: Error starting daemon: error while opening volume store metadata database: timeout
Apr 21 11:25:24 sec-services-build-1 systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
Apr 21 11:25:24 sec-services-build-1 systemd[1]: Failed to start Docker Application Container Engine.
Apr 21 11:25:24 sec-services-build-1 systemd[1]: Unit docker.service entered failed state.
Apr 21 11:25:24 sec-services-build-1 systemd[1]: docker.service failed.
From here: https://github.com/moby/moby/issues/22507
I ran:
ps axf | grep docker | grep -v grep | awk '{print "kill -9 " $1}' | sudo sh
I was then able to restart docker using:
sudo systemctl start docker
Step 1: systemctl status docker (if docker is running) stop the docker.
step 2: systemctl stop docker.
step 3: dockerd
i got this message when copying volumes from production machine, ended up to overwrite metadata.db inside /var/lib/docker/volumes, then it crashes. A fix is so simple
docker system prune --volumes -f && rm /var/lib/docker/volumes/metadata.db && docker-compose up -d
I encountered the same error.
❶tried
sudo kill -9 1452
multiple times, but it doesn't work. There's still a dockerd process active.
1452 ? Zsl 127:42 [dockerd] <defunct>
❷tried as #Artur Mustafin suggested:
sudo mv /var/lib/docker/volumes/metadata.db /var/lib/docker/volumes/metadata.db.bk
it worked.
so I tried all of these and nothing worked. However what worked was removing all the containers from /var/lib/docker/containers. Then i killed all docker processes (ps -ef | grep docker) then restarted docker and the docker socket. When docker became active I added the containers one at a time and 1 container was what caused the issues

Resources