I want to run a cassandra container from a nomad job. It seems to start, but after a few seconds it dies (it seems to be killed by nomad itself).
If I run the container from the command line, with:
docker run --name some-cassandra -p 9042:9042 -d cassandra:3.0
the container starts flawlessly. But if I create a nomad job like this:
job "cassandra" {
datacenters = ["dc1"]
type = "service"
update {
max_parallel = 1
min_healthy_time = "10s"
healthy_deadline = "5m"
progress_deadline = "10m"
auto_revert = false
canary = 0
}
migrate {
max_parallel = 1
health_check = "checks"
min_healthy_time = "10s"
healthy_deadline = "5m"
}
group "cassandra" {
restart {
attempts = 2
interval = "240s"
delay = "120s"
mode = "delay"
}
task "cassandra" {
driver = "docker"
config {
image = "cassandra:3.0"
network_mode = "bridge"
port_map {
cql = 9042
}
}
resources {
memory = 2048
cpu = 800
network {
port "cql" {}
}
}
env {
CASSANDRA_LISTEN_ADDRESS = "${NOMAD_IP_cql}"
}
service {
name = "cassandra"
tags = ["global", "cassandra"]
port = "cql"
}
}
}
}
Then it will never start. The nomad's web interface shows nothing in the stdout logs of the created allocation, and the stdin stream only shows Killed.
I know that as this is happening, docker containers are created, and removed after a few seconds. I cannot read the logs of these containers, for when I try (with docker logs <container_id>), all I get is:
Error response from daemon: configured logging driver does not support reading
And the allocation overview shows this message:
12/06/18 14:16:04 Terminated Exit Code: 137, Exit Message: "Docker container exited with non-zero exit code: 137"
According to docker:
If there is no database initialized when the container starts, then a
default database will be created. While this is the expected behavior,
this means that it will not accept incoming connections until such
initialization completes. This may cause issues when using automation
tools, such as docker-compose, which start several containers
simultaneously.
But I doubt this is the source of my problem, because I've increased the restart stanza values with no effect, and because the task fails after just a few seconds.
Not long ago I experience a somewhat similar problem, with a kafka container, that -it turns out- was not happy because it wanted more memory. But in this case I've provided generous values for memory and for CPU in the resources stanza, and it doesn't seem to make any difference.
My host OS is Arch Linux, with kernel 4.19.4-arch1-1-ARCH. I'm running consul as a systemd service, and the nomad agent with this command line:
sudo nomad agent -dev
What can I possibly be missing? Any help and/or pointers will be appreciated.
Update (2018-12-06 16:26 GMT): by reading in detail the output of the nomad agent, I get that some valuable information can be read at the host's /tmp directory. A snippet of that output:
2018/12/06 16:03:03 [DEBUG] memberlist: TCP connection from=127.0.0.1:45792
2018/12/06 16:03:03.180586 [DEBUG] driver.docker: docker pull cassandra:latest succeeded
2018-12-06T16:03:03.184Z [DEBUG] plugin: starting plugin: path=/usr/bin/nomad args="[/usr/bin/nomad executor {"LogFile":"/tmp/NomadClient073551030/1c315bf2-688c-2c7b-8d6f-f71fec1254f3/cassandra/executor.out","LogLevel":"DEBUG"}]"
2018-12-06T16:03:03.185Z [DEBUG] plugin: waiting for RPC address: path=/usr/bin/nomad
2018-12-06T16:03:03.235Z [DEBUG] plugin.nomad: plugin address: timestamp=2018-12-06T16:03:03.235Z address=/tmp/plugin681788273 network=unix
2018/12/06 16:03:03.253166 [DEBUG] driver.docker: Setting default logging options to syslog and unix:///tmp/plugin559865372
2018/12/06 16:03:03.253196 [DEBUG] driver.docker: Using config for logging: {Type:syslog ConfigRaw:[] Config:map[syslog-address:unix:///tmp/plugin559865372]}
2018/12/06 16:03:03.253206 [DEBUG] driver.docker: using 2147483648 bytes memory for cassandra
2018/12/06 16:03:03.253217 [DEBUG] driver.docker: using 800 cpu shares for cassandra
2018/12/06 16:03:03.253237 [DEBUG] driver.docker: binding directories []string{"/tmp/NomadClient073551030/1c315bf2-688c-2c7b-8d6f-f71fec1254f3/alloc:/alloc", "/tmp/NomadClient073551030/1c315bf2-688c-2c7b-8d6f-f71fec1254f3/cassandra/local:/local", "/tmp/NomadClient073551030/1c315bf2-688c-2c7b-8d6f-f71fec1254f3/cassandra/secrets:/secrets"} for cassandra
2018/12/06 16:03:03.253282 [DEBUG] driver.docker: allocated port 127.0.0.1:29073 -> 9042 (mapped)
2018/12/06 16:03:03.253296 [DEBUG] driver.docker: exposed port 9042
2018/12/06 16:03:03.253320 [DEBUG] driver.docker: setting container name to: cassandra-1c315bf2-688c-2c7b-8d6f-f71fec1254f3
2018/12/06 16:03:03.361162 [INFO] driver.docker: created container 29b0764bd2de69bda6450ebb1a55ffd2cbb4dc3002f961cb5db71b323d611199
2018/12/06 16:03:03.754476 [INFO] driver.docker: started container 29b0764bd2de69bda6450ebb1a55ffd2cbb4dc3002f961cb5db71b323d611199
2018/12/06 16:03:03.757642 [DEBUG] consul.sync: registered 1 services, 0 checks; deregistered 0 services, 0 checks
2018/12/06 16:03:03.765001 [DEBUG] client: error fetching stats of task cassandra: stats collection hasn't started yet
2018/12/06 16:03:03.894514 [DEBUG] client: updated allocations at index 371 (total 2) (pulled 0) (filtered 2)
2018/12/06 16:03:03.894584 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 2)
2018/12/06 16:03:05.190647 [DEBUG] driver.docker: error collecting stats from container 29b0764bd2de69bda6450ebb1a55ffd2cbb4dc3002f961cb5db71b323d611199: io: read/write on closed pipe
2018-12-06T16:03:09.191Z [DEBUG] plugin.nomad: 2018/12/06 16:03:09 [ERR] plugin: plugin server: accept unix /tmp/plugin681788273: use of closed network connection
2018-12-06T16:03:09.194Z [DEBUG] plugin: plugin process exited: path=/usr/bin/nomad
2018/12/06 16:03:09.223734 [INFO] client: task "cassandra" for alloc "1c315bf2-688c-2c7b-8d6f-f71fec1254f3" failed: Wait returned exit code 137, signal 0, and error Docker container exited with non-zero exit code: 137
2018/12/06 16:03:09.223802 [INFO] client: Restarting task "cassandra" for alloc "1c315bf2-688c-2c7b-8d6f-f71fec1254f3" in 2m7.683274502s
2018/12/06 16:03:09.230053 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 1 services, 0 checks
2018/12/06 16:03:09.233507 [DEBUG] consul.sync: registered 0 services, 0 checks; deregistered 0 services, 0 checks
2018/12/06 16:03:09.296185 [DEBUG] client: updated allocations at index 372 (total 2) (pulled 0) (filtered 2)
2018/12/06 16:03:09.296313 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 2)
2018/12/06 16:03:11.541901 [DEBUG] http: Request GET /v1/agent/health?type=client (452.678µs)
But the contents of /tmp/NomadClient.../<alloc_id>/... is deceptively simple:
[root#singularity 1c315bf2-688c-2c7b-8d6f-f71fec1254f3]# ls -lR
.:
total 0
drwxrwxrwx 5 nobody nobody 100 Dec 6 15:52 alloc
drwxrwxrwx 5 nobody nobody 120 Dec 6 15:53 cassandra
./alloc:
total 0
drwxrwxrwx 2 nobody nobody 40 Dec 6 15:52 data
drwxrwxrwx 2 nobody nobody 80 Dec 6 15:53 logs
drwxrwxrwx 2 nobody nobody 40 Dec 6 15:52 tmp
./alloc/data:
total 0
./alloc/logs:
total 0
-rw-r--r-- 1 root root 0 Dec 6 15:53 cassandra.stderr.0
-rw-r--r-- 1 root root 0 Dec 6 15:53 cassandra.stdout.0
./alloc/tmp:
total 0
./cassandra:
total 4
-rw-r--r-- 1 root root 1248 Dec 6 16:19 executor.out
drwxrwxrwx 2 nobody nobody 40 Dec 6 15:52 local
drwxrwxrwx 2 nobody nobody 60 Dec 6 15:52 secrets
drwxrwxrwt 2 nobody nobody 40 Dec 6 15:52 tmp
./cassandra/local:
total 0
./cassandra/secrets:
total 0
./cassandra/tmp:
total 0
Both cassandra.stdout.0 and cassandra.stderr.0 are empty, and the full contents of the executor.out file is:
2018/12/06 15:53:22.822072 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin278120866
2018/12/06 15:55:53.009611 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin242312234
2018/12/06 15:58:29.135309 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin226242288
2018/12/06 16:00:53.942271 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin373025133
2018/12/06 16:03:03.252389 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin559865372
2018/12/06 16:05:19.656317 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin090082811
2018/12/06 16:07:28.468809 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin383954837
2018/12/06 16:09:54.068604 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin412544225
2018/12/06 16:12:10.085157 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin279043152
2018/12/06 16:14:48.255653 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin209533710
2018/12/06 16:17:23.735550 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin168184243
2018/12/06 16:19:40.232181 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin839254781
2018/12/06 16:22:13.485457 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin406142133
2018/12/06 16:24:24.869274 [DEBUG] syslog-server: launching syslog server on addr: /tmp/plugin964077792
Update (2018-12-06 16:40 GMT): since it's apparent that logging to syslog is desirable for the agent, I've setup and launched a local syslog server, to no avail. And the syslog server receive no message whatsoever.
Problem solved. Its nature is twofold:
Nomad's docker driver is (very efficiently) encapsulating the
behaviour of the containers, making them at times very silent.
Cassandra is very demanding of resources. Much more than I
originally thought. I was convinced that 4 GB RAM would be enough for
it run comfortably, but as it turns out it needs (at least in my
environment) 6 GB.
Disclaimer: I'm actually using now bitnami/cassandra instead of cassandra, because I believe their images are of very high quality, secure and configurable by means of environment variables. This discovery I made using bitnami's image, and I haven't tested how the original one reacts to having this amount of memory.
As to why it doesn't fail when running the container directly from docker's CLI, I think that's because there is no specification of limits when running it that way. Docker simply takes as much memory as it needs for its containers, so if eventually host's memory is insufficient for all containers, the realisation will come much later (and possibly painfully). So this early failure should be a welcome benefit of an orchestration platform as nomad. If there is any complain on my part is that finding what the problem was took so long because of the lack of visibility of the container!
Related
I'm setting up the development environment following the instructions on Hyperledger fabric's official website:
https://hyperledger-fabric.readthedocs.io/en/latest/peer-chaincode-devmode.html
I have started the orderer successfully using:
ORDERER_GENERAL_GENESISPROFILE=SampleDevModeSolo orderer
This command didn't work at first but it worked after I cd fabric/sampleconfig
2020-12-21 11:23:15.084 CST [orderer.common.server] Main -> INFO 009 Starting orderer: Version: 2.3.0 Commit SHA: dc2e59b3c Go version: go1.15.6 OS/Arch: darwin/amd64
2020-12-21 11:23:15.084 CST [orderer.common.server] Main -> INFO 00a Beginning to serve requests
but when I start the peer using:
export PATH=$(pwd)/build/bin:$PATH
export FABRIC_CFG_PATH=$(pwd)/sampleconfig
export FABRIC_LOGGING_SPEC=chaincode=debug
export CORE_PEER_CHAINCODELISTENADDRESS=0.0.0.0:7052
peer node start --peer-chaincodedev=true
An error is spotted:
FABRIC_LOGGING_SPEC=chaincode=debug
CORE_PEER_CHAINCODELISTENADDRESS=0.0.0.0:7052
peer node start --peer-chaincodedev=true
2020-12-21 11:25:13.047 CST [nodeCmd] serve -> INFO 001 Starting peer: Version: 2.3.0 Commit SHA: dc2e59b3c Go version: go1.15.6 OS/Arch: darwin/amd64 Chaincode: Base Docker Label: org.hyperledger.fabric Docker Namespace: hyperledger
2020-12-21 11:25:13.048 CST [peer] getLocalAddress -> INFO 002 Auto-detected peer address: 10.200.83.208:7051
2020-12-21 11:25:13.048 CST [peer] getLocalAddress -> INFO 003 Host is 0.0.0.0 , falling back to auto-detected address: 10.200.83.208:7051 Error: failed to initialize operations subsystem: listen tcp 127.0.0.1:9443: bind: address already in use
this is the error:
Error: failed to initialize operations subsystem: listen tcp 127.0.0.1:9443: bind: address already in use
I checked this issue and it seems this happens because the peer node is using the same port 9443 as the orderer node for the same service. How can I get the two nodes running separately? It seems the docker is running as well.
If you see your error, you can easily follow
Error: failed to initialize operations subsystem: listen tcp 127.0.0.1:9443: bind: address already in use
It is said that the 9443 port is already in use.
It seems that you are not running the orderer and peer as separate containers on the docker-based virtual network, but running on the host pc.
This eventually seems to conflict with two servers requesting one port 9443 on your pc.\
Referring to the configuration below of fabric-2.3/sampleconfig, you can see that each port 9443 is assigned to the server. Assigning one of them to the other port solves this.
fabric-2.3/sampleconfig/orderer.yaml
configuration of orderer
# orderer.yaml
...
Admin:
# host and port for the admin server
ListenAddress: 127.0.0.1:9443
...
fabric-2.3/sampleconfig/core.yaml
configuration of peer
# core.yaml
...
operations:
# host and port for the operations server
# listenAddress: 127.0.0.1:9443
listenAddress: 127.0.0.1:10443
...
This is not a direct answer to the port mapping / collision issue, but we've had great success using the new Kubernetes Test Network as a development platform running on a local system with a virtual Kubernetes cluster running in KIND (Kubernetes in Docker).
In this mode, applications can be developed using the Gateway client (exposed via a port forward or ingress), and smart contracts running As a Service can be launched either in the cluster OR run on the local host OS in a container, binary, or launched in a debugger.
The documentation for the development setup is still sparse, but we'd love to hear feedback on the overall approach, as it offers an exponentially better experience for working with a test network in a development context. In general the process of "port juggling" with Compose is no longer relevant when working on a local Kubernetes cluster. In this mode, you can run services on the host network, instructing peers/orderers/etc. to connect to the remote process running on the host OS.
Please do as i do in your vps and then maybe the issue reproduced,replace the variable $vps_ip with your real vps ip during the below steps.
wget https://saimei.ftp.acc.umu.se/debian-cd/current/amd64/iso-cd/debian-10.4.0-amd64-netinst.iso
transmission-create -o debian.torrent debian-10.4.0-amd64-netinst.iso
Create a trackerless torrent ,show info on it:
transmission-show debian.torrent
Name: debian-10.4.0-amd64-netinst.iso
File: debian.torrent
GENERAL
Name: debian-10.4.0-amd64-netinst.iso
Hash: a7fbe3ac2451fc6f29562ff034fe099c998d945e
Created by: Transmission/2.92 (14714)
Created on: Mon Jun 8 00:04:33 2020
Piece Count: 2688
Piece Size: 128.0 KiB
Total Size: 352.3 MB
Privacy: Public torrent
TRACKERS
FILES
debian-10.4.0-amd64-netinst.iso (352.3 MB)
Open the port which transmission running on your vps.
firewall-cmd --zone=public --add-port=51413/tcp --permanent
firewall-cmd --reload
Check it from your local pc.
sudo nmap $vps_ip -p51413
Host is up (0.24s latency).
PORT STATE SERVICE
51413/tcp open unknown
Nmap done: 1 IP address (1 host up) scanned in 1.74 seconds
Add the torrent and seed it with transmission's default username and password on your vps(with your own if you already change it):
transmission-remote -n "transmission:transmission" --add debian.torrent
localhost:9091/transmission/rpc/ responded: "success"
transmission-remote -n "transmission:transmission" --list
ID Done Have ETA Up Down Ratio Status Name
1 0% None Unknown 0.0 0.0 None Idle debian-10.4.0-amd64-netinst.iso
Sum: None 0.0 0.0
transmission-remote -n "transmission:transmission" -t 1 --start
localhost:9091/transmission/rpc/ responded: "success"
Get the debian.torrent from your vps into local pc.
scp root#$vps_ip:/root/debian.torrent /tmp
Now to try download it in your local pc.
aria2c --enable-dht=true /tmp/debian.torrent
06/08 09:28:04 [NOTICE] Downloading 1 item(s)
06/08 09:28:04 [NOTICE] IPv4 DHT: listening on UDP port 6921
06/08 09:28:04 [NOTICE] IPv4 BitTorrent: listening on TCP port 6956
06/08 09:28:04 [NOTICE] IPv6 BitTorrent: listening on TCP port 6956
*** Download Progress Summary as of Mon Jun 8 09:29:04 2020 ***
===============================================================================
[#a34431 0B/336MiB(0%) CN:0 SD:0 DL:0B]
FILE: /tmp/debian-10.4.0-amd64-netinst.iso
-------------------------------------------------------------------------------
I wait about one hour ,the download progress is always 0%.
If you're using DHT, you have to open a UDP port in your firewall and then, depending on what you're doing, you can specify that port to aria2c. From the docs:
DHT uses UDP. Since aria2 doesn't configure firewalls or routers for port forwarding, it's up to you to do it manually.
$ aria2c --enable-dht --dht-listen-port=6881 file.torrent
See this page for some more examples of using DHT with aria2c.
Most of docker commands never end. I have to interrupt them manually with CTRL+C. Even simple commands like docker ps or docker info do not respond.
However, docker help and docker version still work.
I think there is something like a deadlock with a particular container, so commands related to containers won't complete.
How to handle such a situation ?
My docker version is 1.12.3. I don't use Swarm mode. The docker logs command doesn't work too. Using dmesg I can see a lot of I/O errors, but I don't know if it is related with my problem:
[12898.121287] loop: Write error at byte offset 8882749440, length 4096.
[12898.122837] loop: Write error at byte offset 8883666944, length 4096.
[12898.124685] loop: Write error at byte offset 8882814976, length 4096.
[12898.126459] loop: Write error at byte offset 8883404800, length 4096.
[12898.128201] loop: Write error at byte offset 8883470336, length 4096.
[12898.129921] loop: Write error at byte offset 8883535872, length 4096.
[12898.131774] loop: Write error at byte offset 8883601408, length 4096.
[12898.133594] loop: Write error at byte offset 8883732480, length 4096.
[12917.269786] loop: Write error at byte offset 8883798016, length 4096.
[12917.270331] quiet_error: 632 callbacks suppressed
[12917.270334] Buffer I/O error on device dm-6, logical block 1313320
[12917.270540] lost page write due to I/O error on dm-6
[12917.270543] Buffer I/O error on device dm-6, logical block 1313321
[12917.270740] lost page write due to I/O error on dm-6
[12917.270742] Buffer I/O error on device dm-6, logical block 1313322
[12917.270957] lost page write due to I/O error on dm-6
[12917.270959] Buffer I/O error on device dm-6, logical block 1313323
[12917.271177] lost page write due to I/O error on dm-6
[12917.271179] Buffer I/O error on device dm-6, logical block 1313324
[12917.271377] lost page write due to I/O error on dm-6
[12917.271379] Buffer I/O error on device dm-6, logical block 1313325
[12917.271573] lost page write due to I/O error on dm-6
[12917.301759] loop: Write error at byte offset 8883863552, length 4096.
[12917.312038] loop: Write error at byte offset 8883929088, length 4096.
[12917.312396] Buffer I/O error on device dm-6, logical block 1313328
[12917.312635] lost page write due to I/O error on dm-6
[12917.312638] Buffer I/O error on device dm-6, logical block 1313329
[12917.312867] lost page write due to I/O error on dm-6
[12917.312869] Buffer I/O error on device dm-6, logical block 1313330
[12917.313121] lost page write due to I/O error on dm-6
[12917.313123] Buffer I/O error on device dm-6, logical block 1313331
[12917.313346] lost page write due to I/O error on dm-6
[13090.853726] INFO: task kworker/u8:0:17212 blocked for more than 120 seconds.
[13090.854055] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Using the command sudo systemctl status -l docker, the following messages are printed, but I cannot tell if they are related:
dockerd[1344]: time="2016-11-24T17:49:01.184874648+01:00" level=warning msg="libcontainerd: container c9f35af1836bf856001ca6156663f713c1217a697e8d2451927c67797fb5a770 restart canceled"
dockerd[1344]: time="2016-11-24T17:49:02.627116016+01:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers : [nameserver 8.8.8.8 nameserver 8.8.4.4]"
dockerd[1344]: time="2016-11-24T17:49:02.627152661+01:00" level=info msg="IPv6 enabled; Adding default IPv6 external servers : [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
dockerd[1344]: time="2016-11-24T18:19:51.472701647+01:00" level=warning msg="libcontainerd: container c9f35af1836bf856001ca6156663f713c1217a697e8d2451927c67797fb5a770 restart canceled"
dockerd[1344]: time="2016-11-24T18:19:56.712126199+01:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers : [nameserver 8.8.8.8 nameserver 8.8.4.4]"
dockerd[1344]: time="2016-11-24T18:19:56.712159759+01:00" level=info msg="IPv6 enabled; Adding default IPv6 external servers : [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
dockerd[1344]: time="2016-11-24T18:34:24.301786606+01:00" level=info msg="No non-localhost DNS nameservers are left in resolv.conf. Using default external servers : [nameserver 8.8.8.8 nameserver 8.8.4.4]"
dockerd[1344]: time="2016-11-24T18:34:24.302208751+01:00" level=info msg="IPv6 enabled; Adding default IPv6 external servers : [nameserver 2001:4860:4860::8888 nameserver 2001:4860:4860::8844]"
That Docker commands hanging bug happened after I deleted a container.
The daemon dockerd was in an abnormal state: it couldn't be started (sudo service docker start) after having been stopped (service docker stop).
# sudo service docker start
Redirecting to /bin/systemctl start docker.service
Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
# journalctl -xe
kernel: device-mapper: ioctl: unable to remove open device docker-253:0-19468577-d6f74dd67f106d6bfa483df4ee534dd9545dc8ca
...
systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
systemd[1]: Failed to start Docker Application Container Engine.
systemd[1]: Unit docker.service entered failed state.
systemd[1]: docker.service failed.
polkitd[896]: Unregistered Authentication Agent for unix-process:22551:34177094 (system bus name :1.290, object path /org
ESCESC
kernel: dev_remove: 41 callbacks suppressed
kernel: device-mapper: ioctl: unable to remove open device docker-253:0-19468577-fc63401af903e22d05a4518e02504527f0d7883f9d997d7d97fdfe72ba789863
...
dockerd[22566]: time="2016-11-28T10:18:09.840268573+01:00" level=fatal msg="Error starting daemon: timeout"
systemd[1]: docker.service: main process exited, code=exited, status=1/FAILURE
systemd[1]: Failed to start Docker Application Container Engine.
Moreover, many zombie Docker processes could be observed using ps -eax | grep docker (presence of a "Z" in the "STAT" column), for example docker-proxies.
After rebooting the server and restarting Docker, the zombie processes disappeared and Docker commands were working again.
I just had a similar issue as well. Rebooting the server did not work for me. I got this issue, because I just installed a new container with some kind of errors. After that, most Docker commands did not respond. I fixed it by executing the following command:
docker system prune -a
This removes all unused containers. In my case also the container I just added. More information:
https://docs.docker.com/engine/reference/commandline/system_prune/
I had the same problem (commands not responding) and I fix it by increasing the resources allocated to Docker.
Docker Desktop -> Preferences -> Advanced
In my case, I increased:
Memory from 2GB to 8GB
Swap from 1GB to 2GB
Try different values according with your machine.
From the symptoms that you present, it seems something I struggled as well.
I did the following, hope it helps!
After checking it the service was not responding successfully, using:
system status docker.service
I used the following command to put it to work:
sudo dockerd --debug
Restarting my PC worked for me
I just followed this tutorial step by step for setting up a docker swarm in EC2 -- https://docs.docker.com/swarm/install-manual/
I created 4 Amazon Servers using the Amazon Linux AMI.
manager + consul
manager
node1
node2
I followed the instructions to start the swarm and everything seems to go ok regarding making the docker instances.
Server 1
Running docker ps gives:
The Consul logs show this
2016/07/05 20:18:47 [INFO] serf: EventMemberJoin: 729a440e5d0d 172.17.0.2
2016/07/05 20:18:47 [INFO] serf: EventMemberJoin: 729a440e5d0d.dc1 172.17.0.2
2016/07/05 20:18:48 [INFO] raft: Node at 172.17.0.2:8300 [Follower] entering Follower state
2016/07/05 20:18:48 [INFO] consul: adding server 729a440e5d0d (Addr: 172.17.0.2:8300) (DC: dc1)
2016/07/05 20:18:48 [INFO] consul: adding server 729a440e5d0d.dc1 (Addr: 172.17.0.2:8300) (DC: dc1)
2016/07/05 20:18:48 [ERR] agent: failed to sync remote state: No cluster leader
2016/07/05 20:18:49 [WARN] raft: Heartbeat timeout reached, starting election
2016/07/05 20:18:49 [INFO] raft: Node at 172.17.0.2:8300 [Candidate] entering Candidate state
2016/07/05 20:18:49 [INFO] raft: Election won. Tally: 1
2016/07/05 20:18:49 [INFO] raft: Node at 172.17.0.2:8300 [Leader] entering Leader state
2016/07/05 20:18:49 [INFO] consul: cluster leadership acquired
2016/07/05 20:18:49 [INFO] consul: New leader elected: 729a440e5d0d
2016/07/05 20:18:49 [INFO] raft: Disabling EnableSingleNode (bootstrap)
2016/07/05 20:18:49 [INFO] consul: member '729a440e5d0d' joined, marking health alive
2016/07/05 20:18:50 [INFO] agent: Synced service 'consul'
I registered each node using the following command with appropriate IP's
docker run -d swarm join --advertise=x-x-x-x:2375 consul://x-x-x-x:8500
Each of those created a docker instance
Node1
Running docker ps gives:
With logs that suggest there's a problem:
time="2016-07-05T21:33:50Z" level=info msg="Registering on the discovery service every 1m0s..." addr="172.31.17.35:2375" discovery="consul://172.31.3.233:8500"
time="2016-07-05T21:36:20Z" level=error msg="cannot set or renew session for ttl, unable to operate on sessions"
time="2016-07-05T21:37:20Z" level=info msg="Registering on the discovery service every 1m0s..." addr="172.31.17.35:2375" discovery="consul://172.31.3.233:8500"
time="2016-07-05T21:39:50Z" level=error msg="cannot set or renew session for ttl, unable to operate on sessions"
time="2016-07-05T21:40:50Z" level=info msg="Registering on the discovery service every 1m0s..." addr="172.31.17.35:2375" discovery="consul://172.31.3.233:8500"
...
And lastly when I get to the last step of trying to get host information like so on my Consul machine,
docker -H :4000 info
I see no nodes. Lastly when I try the step of running an app, I get the obvious error:
[ec2-user#ip-172-31-3-233 ~]$ docker -H :4000 run hello-world
docker: Error response from daemon: No healthy node available in the cluster.
See 'docker run --help'.
[ec2-user#ip-172-31-3-233 ~]$
Thanks for any insight on this. I'm still pretty confused by much of the swarm model and not sure where to go from here to diagnose.
It looks like Consul is either not binding to a public IP address, or is not accessible on the public IP due to security group or VPC settings. You are setting the discovery URL to consul://172.31.3.233:8500 on the Docker nodes, so I would sugest trying to connect to that address from an external IP, either in your browser or via curl like this:
% curl http://172.31.3.233:8500/ui/dist/
HTML
If you cannot connect (connection refused or timeout) then add a TCP port 8500 ingress rule to your AWS VMs, and try again.
After investigating your issue, I see that you forgot open port 2375 for Docker Engine in all four nodes.
Before starting Swarm Manager or Swarm Node, you have to open a TCP Port for Docker Engine, so Swarm will work with Docker Engine via that Port.
With Docker on Ubuntu 14.04, you can open the port by change file /etc/default/docker and add -H tcp://0.0.0.0:2375 to DOCKER_OPTS. For example:
DOCKER_OPTS="-H tcp://0.0.0.0:2375 -H unix:///var/run/docker.sock"
After that, you restart Docker Engine
service docker restart
If you are using CentOS, the solution is same, you can read my blog article https://sonnguyen.ws/install-docker-docker-swarm-centos7/
And the other thing, I think that you should install and run Consul in all nodes (4 servers). So your Swarm can work with Consul on its node
I feel so dumb admitting this, but I am struggling on the uWSGI tutorial for Django here
My problem is after making a test.py file as described in the tutorial, and running the command:
uwsgi --http :8000 --wsgi-file test.py
I go to port :8000 on the IP adress for my VPS and the connection times out. I have been playing around with nginx and have been able to get the "Welcome to nginx" screen to successfully show itself. The output on my terminal after starting uwsgi with the above command is:
--wsgi-file test.py
*** Starting uWSGI 1.9.17.1 (64bit) on [Thu Oct 10 20:58:40 2013] ***
compiled with version: 4.6.3 on 10 October 2013 20:17:02
os: Linux-3.9.3-x86_64-linode33 #1 SMP Mon May 20 10:22:57 EDT 2013
nodename: Name
machine: x86_64
clock source: unix
detected number of CPU cores: 8
current working directory: /usr/local/uwsgi-tutorail/mytest
detected binary path: /usr/local/bin/uwsgi
!!! no internal routing support, rebuild with pcre support !!!
uWSGI running as root, you can use --uid/--gid/--chroot options
*** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
*** WARNING: you are running uWSGI without its master process manager ***
your processes number limit is 7883
your memory page size is 4096 bytes
detected max file descriptor number: 1024
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uWSGI http bound on :8000 fd 4
spawned uWSGI http 1 (pid: 18638)
uwsgi socket 0 bound to TCP address 127.0.0.1:52306 (port auto-assigned) fd 3
Python version: 2.7.3 (default, Sep 26 2013, 20:13:52) [GCC 4.6.3]
*** Python threads support is disabled. You can enable it with --enable-threads ***
Python main interpreter initialized at 0x26599f0
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 72792 bytes (71 KB) for 1 cores
*** Operational MODE: single process ***
WSGI app 0 (mountpoint='') ready in 0 seconds on interpreter 0x26599f0 pid: 18637 (default app)
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI worker 1 (and the only) (pid: 18637, cores: 1)
I am a complete newb at uwsgi, any help would be greatly appreciated.
Not an elegant solution but I was able to "fix" the problem by rebuilding my VPS and starting from scratch.