I deployed a gitlab EE instance omnibus on a Google Compute Engine instance following these steps: https://docs.gitlab.com/ee/install/google_cloud_platform/
Everything OK. I have updated my VM on ubuntu 18.04 and I restarted the instance, everything OK.
But, when I've upgraded my VM to increase his capacity, a problem occured at system boot :
dockerd[1286]: time="2020-09-28T15:25:42.251633392+02:00" level=info msg="Starting up"
dockerd[1286]: time="2020-09-28T15:25:42.256850354+02:00" level=info msg="parsed scheme: \"unix\"" module=grpc
dockerd[1286]: time="2020-09-28T15:25:42.257123624+02:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
dockerd[1286]: time="2020-09-28T15:25:42.257177289+02:00" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock 0 <nil>}] <nil>}" module=grpc
dockerd[1286]: time="2020-09-28T15:25:42.257196202+02:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
dockerd[1286]: time="2020-09-28T15:25:42.258723872+02:00" level=info msg="parsed scheme: \"unix\"" module=grpc
dockerd[1286]: time="2020-09-28T15:25:42.258801514+02:00" level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
dockerd[1286]: time="2020-09-28T15:25:42.258922665+02:00" level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock 0 <nil>}] <nil>}" module=grpc
dockerd[1286]: time="2020-09-28T15:25:42.258947831+02:00" level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
dockerd[1286]: time="2020-09-28T15:25:42.264233013+02:00" level=error msg="failed to mount overlay: no such device" storage-driver=overlay2
dockerd[1286]: time="2020-09-28T15:25:42.264281115+02:00" level=error msg="[graphdriver] prior storage driver overlay2 failed: driver not supported"
dockerd[1286]: failed to start daemon: error initializing graphdriver: driver not supported
systemd[1]: docker.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: docker.service: Failed with result 'exit-code'.
systemd[1]: Failed to start Docker Application Container Engine.
And, I am not unable to connect to the instance via SSH anymore (Error 4003 because back-end is not working, and then I got a timeout when I use SSH on port 22)...
I suppose that the Gitlab instance used Docker to work...
Do you hava any idea for this issue ?
Thank's a lot
I have followed the guide you have shared, and I was able to install a Gitlab instance without problems:
After my installation I resized my instance and it continued working well.
Also, I can see that my Gitlab installation is working without Docker. You can install Gitlab with Docker but the guide you are following is not for this purpose.
But the problem you are having with your docker; looks like your docker is trying to use overlay2 without the proper backing filesystem, you could check this documentation for further information.
Now regarding the problem with your SSH, the error code 4003, this error could have many explanations:
Your instance can’t start the ssh service.
Your firewall is not configured properly to allow the port 22.
You have modified something in your instance and your metadata stopped working.
If your instance is running well,we could check points 1 and 2 using a port scanner, we could check tough port 22 for ssh if it is reachable. If it is not in green it means that it could be a problem in the firewall or with the SSH service.
If you have problems with your firewall, please check this documentation to create the proper rules.
If the problem is with your metadata it could be necessary to Install the guest environment in-place, you could validate if the guest environment is working by inspecting system logs emitted to the console while an instance starts up, please check this document of Validating the guest environment for more information.
In your case you should see something like Started Google Compute Engine Startup Scripts. because you are using Ubuntu, if your metadata is working properly.
Since you are not able to connect to the instance via SSH button, you have to enable the serial console of this vm instance to get into the command line of the vm instance.
Please try to create a startup-script to create a user and password and with this login in the VM from serial ports, the script would be like this:
Go to the VM instances page in Google Cloud Platform console.
Click on the instance for which you want to add a startup script.
Click the Edit button at the top of the page.
Click on Enable connecting to serial ports
Under Custom metadata, click Add item.
Set 'Key' to 'startup-script' and set 'Value' to this script:
#! /bin/bash
useradd -G sudo <user>
echo '<user>:<password>' | chpasswd
Click Save and then click RESET on the top of the page. You might need to wait for some time for the instance to reboot.
Click on ”Connect to serial port” on the page.
In the new window, you might need to wait a bit and press Enter on your keyboard once; then, you should see the login prompt.
Login using the USERNAME and PASSWORD you provided and you will be logged.
If your metadata is not working properly, you should to install the guest environment in your VM:
Ensure that the version of your operating system is supported .
Enable the Universe repository. Canonical publishes packages for its guest environment to the Universe repository.
sudo apt-add-repository universe
Update package lists:
sudo apt update
Install the guest environment packages:
sudo apt install -y gce-compute-image-packages
Restart the instance and inspect its console log to make sure the guest environment loads as it starts back up.
Verify that you can connect to the instance using SSH.
I hope this information would be useful to you.
Related
I am start docker(19.03.1) using this command in CentOS 7.4:
[root#ops001 ~]# dockerd
INFO[2020-04-12T22:35:39.495452068+08:00] Starting up
INFO[2020-04-12T22:35:39.496948674+08:00] parsed scheme: "unix" module=grpc
INFO[2020-04-12T22:35:39.496992096+08:00] scheme "unix" not registered, fallback to default scheme module=grpc
INFO[2020-04-12T22:35:39.497023029+08:00] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock 0 <nil>}] } module=grpc
INFO[2020-04-12T22:35:39.497038420+08:00] ClientConn switching balancer to "pick_first" module=grpc
INFO[2020-04-12T22:35:39.497209216+08:00] pickfirstBalancer: HandleSubConnStateChange: 0xc00071db20, CONNECTING module=grpc
INFO[2020-04-12T22:35:39.497753603+08:00] pickfirstBalancer: HandleSubConnStateChange: 0xc00071db20, READY module=grpc
INFO[2020-04-12T22:35:39.498779029+08:00] parsed scheme: "unix" module=grpc
INFO[2020-04-12T22:35:39.498797692+08:00] scheme "unix" not registered, fallback to default scheme module=grpc
INFO[2020-04-12T22:35:39.498815974+08:00] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock 0 <nil>}] } module=grpc
INFO[2020-04-12T22:35:39.498826873+08:00] ClientConn switching balancer to "pick_first" module=grpc
INFO[2020-04-12T22:35:39.498878131+08:00] pickfirstBalancer: HandleSubConnStateChange: 0xc000139b40, CONNECTING module=grpc
INFO[2020-04-12T22:35:39.499129163+08:00] pickfirstBalancer: HandleSubConnStateChange: 0xc000139b40, READY module=grpc
INFO[2020-04-12T22:35:39.533436628+08:00] Loading containers: start.
ERRO[2020-04-12T22:35:39.540803539+08:00] Failed to load container mount d794ba7c8b44c278e832ca8acb03d9feaf27b86c3760fd3ee8b2ea9ceebb04b7: mount does not exist
INFO[2020-04-12T22:35:40.217719280+08:00] Removing stale sandbox 0a43d02f8e048bd40681184ea250eea2544097479f309a22428accb45323c2fe (e4dbf0fc8fe729d18d13610a927d2bccff29a0a3a61fe1fcca24b15f3e9106be)
WARN[2020-04-12T22:35:40.252193267+08:00] Error (Unable to complete atomic operation, key modified) deleting object [endpoint de761b3ebcb8035eaf1e0ddae667c31d09794f7ddb1df6b44b495cd17f5a59b1 4d464ba20130c163e7084407f25b5cff86fa478596834683c3c28ea1c84badae], retrying....
INFO[2020-04-12T22:35:40.298580208+08:00] Default bridge (docker0) is assigned with an IP address 172.30.224.0/21. Daemon option --bip can be used to set a preferred IP address
INFO[2020-04-12T22:35:40.806008604+08:00] Loading containers: done.
INFO[2020-04-12T22:35:40.836875725+08:00] Docker daemon commit=74b1e89 graphdriver(s)=overlay2 version=19.03.1
INFO[2020-04-12T22:35:40.836965995+08:00] Daemon has completed initialization
INFO[2020-04-12T22:35:40.848651741+08:00] API listen on /var/run/docker.sock
but when I run list docker it shows error like this:
[root#ops001 ~]# docker ps
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
[root#ops001 ~]# docker version
Client: Docker Engine - Community
Version: 19.03.1
API version: 1.40
Go version: go1.12.5
Git commit: 74b1e89
Built: Thu Jul 25 21:21:07 2019
OS/Arch: linux/amd64
Experimental: false
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
why the docker already startup but could not list by docker command? where is going wrong? what should I do to fix it?
To start docker daemon use: sudo service docker start.
If you want docker daemon to start on system boot: sudo systemctl enable docker
Docker-CE 19.03.8
Swarm init
Setup: 1 Manager Node nothing more.
We deploy many new stacks per day and sometime i see the following line:
evel=error msg="Failed to allocate network resources for node sdlk0t6pyfb7lxa2ie3w7fdzr" error="could not find network allocator state for network qnkxurc5etd2xrkb53ry0fu59" module=node node.id=yp0u6n9c31yh3xyekondzr4jc
After 2 to 3 days. No new services can be started because there are no free VIPs.
I see the following line in my logs:
level=error msg="Could not parse VIP address while releasing"
level=error msg="error deallocating vip" error="invalid CIDR address: " vip.addr= vip.network=oqcsj99taftdu3b0t3nrgbgy1
level=error msg="Event api.EventUpdateTask: Failed to get service idid0u7vjuxf2itpv8n31da57 for task 6vnc8jdkgxwxqbs3ixly2i6u4 state NEW: could not find service idid0u7vjuxf2itpv8n31da57" module=node ...
level=error msg="Event api.EventUpdateTask: Failed to get service sbjb7nk0wk31c2ayg8x898fhr for task noo21whnbwkyijnqavseirfg0 state NEW: could not find service sbjb7nk0wk31c2ayg8x898fhr" module=node ...
level=error msg="Failed to find network y73pnq85mjpn1pon38pdbtaw2 on node sdlk0t6pyfb7lxa2ie3w7fdzr" module=node node.id=yp0u6n9c31yh3xyekondzr4jc
We tried to investigate this by using the debug mode.
Here are some lines that get to me:
level=debug msg="Remove interface veth84e7185 failed: Link not found"
level=debug msg="Remove interface veth64c3a65 failed: Link not found"
level=debug msg="Remove interface vethf1703f1 failed: Link not found"
level=debug msg="Remove interface vethe069254 failed: Link not found"
level=debug msg="Remove interface veth2b81763 failed: Link not found"
level=debug msg="Remove interface veth0bf3390 failed: Link not found"
level=debug msg="Remove interface veth2ed04cc failed: Link not found"
level=debug msg="Remove interface veth0bc27ef failed: Link not found"
level=debug msg="Remove interface veth444343f failed: Link not found"
level=debug msg="Remove interface veth036acf9 failed: Link not found"
level=debug msg="Remove interface veth62d7977 failed: Link not found"
and
level=debug msg="Request address PoolID:10.0.0.0/24 App: ipam/default/data, ID: GlobalDefault/10.0.0.0/24, DBIndex: 0x0, Bits: 256, Unselected: 60, Sequence: (0xf7dfeeee, 1)->(0xedddddb7, 1)->(0x77777777, 3)->(0x77777775, 1)->(0x77ffffff, 1)->(0xffd55555, 1)->end Curr:233 Serial:true PrefAddress:<
When the UNSELECTED part goes to 0 no new containers can be deployed. They are stuck in the NEW state.
Has anyone expirenced something like this? Or can someone help me?
We believe, that the problem has to do something with the release of the 10.0.0.0/24 (our ingress) addresses.
Did you tried to stop and re- start the docker demon?
sudo service docker stop
sudo service docker start
Also, you may find it useful to have a look at the magnificent documentation on https://dockerswarm.rocks/
I usually use this sequence to update a service
export DOMAIN=xxxx.xxxxx.xxx
docker stack rm $service_name
export NODE_ID=$(docker info -f '{{.Swarm.NodeID}}')
# export environment vars if needed
# update data if needed
docker node update --label-add $service_name.$service_name-data=true $NODE_ID
docker stack deploy -c $service_name.yml $service_name
If you see your container stuck in NEW state, probably your are affected by this problem: https://github.com/moby/moby/issues/37338 reported by cintiadr:
Docker stack fails to allocate IP on an overlay network, and gets stuck in NEW current state #37338
Reproducing it:
Create a swarm cluster (1 manager, 1 worker). I created AWS t2.large Amazon linux instances, installed docker using their docs, version 18.06.1-ce.
# Deploy a new overlay network from a stack (docker-network.yml)
$ ./deploy-network.sh
Deploy 60 identical services attaching to that network - 3 replicas each - from stacks (docker-network.yml)
$ ./deploy-services.sh
You can verify that all services are happily running.
Now let's bring the worker down.
Run:
docker node update --availability drain <node id> && docker node rm --force <node id>
Note: drain is an async operation (something I wasn't aware), so to reproduce this use case you shouldn't wait for the drain to complete
Create a new worker (completely new node/machine), and join the cluster.
You are going to see that very few services are actually able to start. All other will be continuously being rejected due to no IP available.
In past versions (17 I believe), the containers wouldn't be rejected (but rather be stuck in NEW).
How to avoid that problem?
If you drain and patiently wait for all the containers to be terminated before removing the node, it appears that this problem is completely avoided.
Docker for Windows Server
Windows Server version 1709, with containers
Docker version 17.06.2-ee-6, build e75fdb8
Swarm mode (worker node, part of swarm with ubuntu masters)
After containers connected to an overlay network started intermittently losing their network adapters, I restarted the machine. Now daemon will not start. Below is the last lines of output from running docker -D.
Please let me know how to fix this.
time="2018-05-15T15:10:06.731160000Z" level=debug msg="Option Experimental: false"
time="2018-05-15T15:10:06.731160000Z" level=debug msg="Option DefaultDriver: nat"
time="2018-05-15T15:10:06.731160000Z" level=debug msg="Option DefaultNetwork: nat"
time="2018-05-15T15:10:06.734183700Z" level=info msg="Restoring existing overlay networks from HNS into docker"
time="2018-05-15T15:10:06.735174400Z" level=debug msg="[GET]=>[/networks/] Request : "
time="2018-05-15T15:12:06.789120400Z" level=debug msg="Network (d4d37ce) restored"
time="2018-05-15T15:12:06.796122200Z" level=debug msg="Endpoint (4114b6e) restored to network (d4d37ce)"
time="2018-05-15T15:12:06.796122200Z" level=debug msg="Endpoint (819eb70) restored to network (d4d37ce)"
time="2018-05-15T15:12:06.797124900Z" level=debug msg="Endpoint (ade55ea) restored to network (d4d37ce)"
time="2018-05-15T15:12:06.798125600Z" level=debug msg="Endpoint (d0054fc) restored to network (d4d37ce)"
time="2018-05-15T15:12:06.798125600Z" level=debug msg="Endpoint (e2af8d8) restored to network (d4d37ce)"
time="2018-05-15T15:12:06.854118500Z" level=debug msg="[GET]=>[/networks/] Request : "
time="2018-05-15T15:14:06.860654000Z" level=debug msg="start clean shutdown of all containers with a 15 seconds timeout..."
Error starting daemon: Error initializing network controller: hnsCall failed in Win32: Server execution failed (0x80080005)
Here is complete set of steps to completely rebuild all docker issues withing swarm host. Sometimes only some steps are sufficient (specifically hns part), so you can try those first.
Remove all docker services and user-defined networks (so all docker networks except `nat` and `none`
Leave the swarm cluster (docker swarm leave --force)
Stop the docker service (PS C:\> stop-service docker)
Stop the HNS service (PS C:\> stop-service hns)
In regedit, delete all of the registry keys under these paths:
HKLM:\SYSTEM\CurrentControlSet\Services\vmsmp\parameters\SwitchList
HKLM:\SYSTEM\CurrentControlSet\Services\vmsmp\parameters\NicList
Now go to Device Manager, and disable then remove all network adapters that are “Hyper-V Virtual Ethernet…” adapters
Now rename your HNS.data file (the goal is to effectively “delete” it by renaming it):
C:\ProgramData\Microsoft\Windows\HNS\HNS.data
Also rename C:\ProgramData\docker folder (the goal is to effectively “delete” it by renaming it)
C:\ProgramData\docker
Now reboot your machine
I have installed docker-ce from repository following instructions at:
https://docs.docker.com/install/linux/docker-ce/centos/
I receive an error attempting to start docker:
Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
journalctl has the following:
...
dockerd[3647]: time="2018-02-05T14:47:05-08:00" level=info msg="containerd successfully booted in 0.002946s" module=containerd
dockerd[3647]: time="2018-02-05T14:47:05.456552594-08:00" level=error msg="There are no more loopback devices available."
dockerd[3647]: time="2018-02-05T14:47:05.456585240-08:00" level=error msg="[graphdriver] prior storage driver devicemapper failed: loopback attach failed"
dockerd[3647]: Error starting daemon: error initializing graphdriver: loopback attach failed
systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
systemd[1]: Failed to start Docker Application Container Engine.
I have seen articles about using something other than loopback devices, but as far as I can tell, those indicate an optimization to be made - and do not imply that the initial startup should fail.
CentOS Linux release 7.4.1708 (Core)
If you run Linux in a VM on Xen, you need to install the kernel and use pygrub (see https://wiki.debian.org/PyGrub) and update to docker version 19.03.0.
install pygrub
1. In your VM execute:
mkdir /boot/grub
apt-get install -y linux-image-amd64
cat > /boot/grub/menu.lst << EOF
default 0
timeout 2
title Debian GNU/Linux
root (hd0,0)
kernel /vmlinuz root=/dev/xvda2 ro
initrd /initrd.img
title Debian GNU/Linux (recovery mode)
root (hd0,0)
kernel /vmlinuz root=/dev/xvda2 ro single
initrd /initrd.img
EOF
2. halt your VM, for example:
xen destroy vm01
3. edit your xen config
for example for your VM /etc/xen/vm01.cfg in your DOM0 (comment out the first two lines and add the last three):
#kernel = '/boot/vmlinuz-4.9.0-9-amd64'
#ramdisk = '/boot/initrd.img-4.9.0-9-amd64'
extra = 'elevator=noop'
bootloader = '/usr/lib/xen-4.8/bin/pygrub'
bootloader_args = [ '--kernel=/vmlinuz', '--ramdisk=/initrd.img', ]
4. start your vm:
xen create /etc/xen/vm01.cfg
I have the same problem in a Debian 9 VM and the same in Debian 8 VM both on the same Debian XEN 4.8 host.
loopback seems not to exist:
# losetup -f
losetup: cannot find an unused loop device: No such device
You can create those with
#!/bin/bash
ensure_loop(){
num="$1"
dev="/dev/loop$num"
if test -b "$dev"; then
echo "$dev is a usable loop device."
return 0
fi
echo "Attempting to create $dev for docker ..."
if ! mknod -m660 $dev b 7 $num; then
echo "Failed to create $dev!" 1>&2
return 3
fi
return 0
}
ensure_loop 0
ensure_loop 0
But this is just a tip to find the right solution, it didn't solve it completely, now since /dev/loop0 exists, I have the error:
Error opening loopback device: open /dev/loop0: no such device or address
[graphdriver] prior storage driver devicemapper failed: loopback attach failed
Update:
I installed apt-get install docker-ce docker-ce-cli containerd.io like described in the latest docs and now with the latest version:
$ docker --version
Docker version 19.03.0, build aeac9490dc
still the same issue:
failed to start daemon: error initializing graphdriver: loopback attach failed
This is the full log:
level=info msg="Starting up"
level=warning msg="failed to rename /var/lib/docker/tmp for background deletion: rename /var/lib/docker/tmp
/var/lib/docker/tmp-old: file exists. Deleting synchronously"
level=info msg="parsed scheme: \"unix\"" module=grpc
level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock 0 <nil>}
] }" module=grpc
level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc0005e8660, CONNECTING" module=grpc
level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc0005e8660, READY" module=grpc
level=info msg="parsed scheme: \"unix\"" module=grpc
level=info msg="scheme \"unix\" not registered, fallback to default scheme" module=grpc
level=info msg="ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock 0 <nil>}
] }" module=grpc
level=info msg="ClientConn switching balancer to \"pick_first\"" module=grpc
level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc0007f5b10, CONNECTING" module=grpc
level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc0007f5b10, READY" module=grpc
level=error msg="There are no more loopback devices available."
level=error msg="[graphdriver] prior storage driver devicemapper failed: loopback attach failed"
failed to start daemon: error initializing graphdriver: loopback attach failed
Update 2:
In the end I found out, that pygrub was missing in the VM, which seems to be a new dependency since some version.
This answer was a dead end path, I added another answer, but I leave this here for other users, that have a different problem to get some hints.
I have meet this issue too. I resolved this issue!
In my VMWare workstation, the VM have TWO virtual network interfaces.
I removed one of the virtual network interfaces, and reserved only one.
Start VMWare workstation,start docker service, it works successfully!
I installed docker on CentOS7.6(1810),but when I start docker:
#systemctl start docker
Docker starts failed.
#journalctl -xe
It show some messages like "start daemon: error initializing graphdriver: loopback attach failed".
After stopping docker it refused to start again. It complaint that another bridge called docker0 already exists:
level=warning msg="devmapper: Base device already exists and has filesystem xfs on it. User specified filesystem will be ignored."
level=info msg="[graphdriver] using prior storage driver \"devicemapper\""
level=info msg="Graph migration to content-addressability took 0.00 seconds"
level=info msg="Firewalld running: false"
level=info msg="Default bridge (docker0) is assigned with an IP address 172.17.0.0/16. Daemon option --bip can be used to set a preferred IP address"
level=fatal msg="Error starting daemon: Error initializing network controller: Error creating default \"bridge\" network: cannot create network fa74b0de61a17ffe68b9a8f7c1cd698692fb56f6151a7898d66a30350ca0085f (docker0): conflicts with network bb9e0aab24dd1f4e61f8e7a46d4801875ade36af79d7d868c9a6ddf55070d4d7 (docker0): networks have same bridge name"
docker.service: Main process exited, code=exited, status=1/FAILURE
Failed to start Docker Application Container Engine.
docker.service: Unit entered failed state.
docker.service: Failed with result 'exit-code'.
Deleting the bridge with ip link del docker0 and then starting docker leads to the same result with another id.
For me I downgraded my OS (Centos Atomic Host in this case) and came across this error message. The docker of the older Centos Atomic was 1.9.1. I did not have any running docker containers or images pulled before running the downgrade.
I simply ran the below and docker was happy again:
sudo rm -rf /var/lib/docker/network
sudo systemctl start docker
More info.
The Problem seems to be in /var/docker/network/. There are a lot of sockets stored that reference the bridge by its old id. To solve the Problem you can delete all sockets, delete the interface and then start docker but all your container will refuse to work since their sockets are gone. In my case I did not care about my stateless containers anyway so this fixed the problem:
ip link del docker0
rm -rf /var/docker/network/*
mkdir /var/docker/network/files
systemctl start docker
# delete all containers
docker ps -a | cut -d' ' -f 1 | xargs -n 1 echo docker rm -f
# recreate all containers
It may sound obvious, but you may want to consider rebooting, especially if there was some major system update recently.
Worked for me, since I didn't reboot my VM after installing some kernel updates, which probably led to many network modules being left in an undefined state.