RabbitMQ fails to boot from docker-compose - docker

Im trying to set up rabbitmq instance from docker-compose command.
My docker compose yaml
version: '3.8'
services:
rabbitmq:
image: rabbitmq:3-management
hostname: rabbit
container_name: 'rabbitmq'
volumes:
- ./etc/rabbitmq.conf:/etc/rabbitmq/rabbitmq.conf
- ./data:/var/lib/rabbitmq/mnesia/rabbit#rabbit
- ./logs:/var/log/rabbitmq/log
- ./etc/ssl/CERT_LAB_CA.pem:/etc/rabbitmq/ssl/cacert.pem
- ./etc/ssl/CERT_LAB_RABBITMQ.pem:/etc/rabbitmq/ssl/cert.pem
- ./etc/ssl/KEY_LAB_RABBITMQ.pem:/etc/rabbitmq/ssl/key.pem
ports:
- 5672:5672
- 15672:15672
- 15671:15671
- 5671:5671
environment:
- RABBITMQ_DEFAULT_USER=secret
- RABBITMQ_DEFAULT_PASS=secret
When I run docker compose up for the first time, everything works fine. But when I add queues and exchanged(loaded from definitions.json), shut down and remove container and try to docker compose up again, I got this error
2022-09-29 13:32:09.522956+00:00 [notice] <0.44.0> Application mnesia exited with reason: stopped
2022-09-29 13:32:09.523096+00:00 [error] <0.229.0>
2022-09-29 13:32:09.523096+00:00 [error] <0.229.0> BOOT FAILED
2022-09-29 13:32:09.523096+00:00 [error] <0.229.0> ===========
2022-09-29 13:32:09.523096+00:00 [error] <0.229.0> Error during startup: {error,
2022-09-29 13:32:09.523096+00:00 [error] <0.229.0> {schema_integrity_check_failed,
2022-09-29 13:32:09.523096+00:00 [error] <0.229.0> [{table_missing,rabbit_listener}]}}
2022-09-29 13:32:09.523096+00:00 [error] <0.229.0>
BOOT FAILED
===========
Error during startup: {error,
{schema_integrity_check_failed,
[{table_missing,rabbit_listener}]}}
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> crasher:
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> initial call: application_master:init/4
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> pid: <0.228.0>
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> registered_name: []
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> exception exit: {{schema_integrity_check_failed,
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> [{table_missing,rabbit_listener}]},
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> {rabbit,start,[normal,[]]}}
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> in function application_master:init/4 (application_master.erl, line 142)
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> ancestors: [<0.227.0>]
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> message_queue_len: 1
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> messages: [{'EXIT',<0.229.0>,normal}]
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> links: [<0.227.0>,<0.44.0>]
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> dictionary: []
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> trap_exit: true
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> status: running
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> heap_size: 2586
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> stack_size: 28
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> reductions: 180
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0> neighbours:
2022-09-29 13:32:10.524073+00:00 [error] <0.228.0>
And here is my rabbitmq.conf file
listeners.tcp.default = 5672
listeners.ssl.default = 5671
ssl_options.cacertfile = /etc/rabbitmq/ssl/cacert.pem
ssl_options.certfile = /etc/rabbitmq/ssl/cert.pem
ssl_options.keyfile = /etc/rabbitmq/ssl/key.pem
#Generate client cert and uncomment this if client has to provide cert.
#ssl_options.verify = verify_peer
#ssl_options.fail_if_no_peer_cert = true
collect_statistics_interval = 10000
#load_definitions = /path/to/exported/definitions.json
#definitions.skip_if_unchanged = true
management.tcp.port = 15672
management.ssl.port = 15671
management.ssl.cacertfile = /etc/rabbitmq/ssl/cacert.pem
management.ssl.certfile = /etc/rabbitmq/ssl/cert.pem
management.ssl.keyfile = /etc/rabbitmq/ssl/key.pem
management.http_log_dir = /var/log/rabbitmq/http
What am I missing?

Try to substitute ./data:/var/lib/rabbitmq/mnesia/rabbit#rabbit in your config with ./data:/var/lib/rabbitmq.
I had the same error and spent quite time trying to figure out the problem. My configuration was slightly different from yours and looked like this:
rabbitmq:
image: rabbitmq:3.11.2-management-alpine
hostname: rabbitmq
environment:
RABBITMQ_DEFAULT_USER: tester
RABBITMQ_DEFAULT_PASS: qwerty
RABBITMQ_MNESIA_DIR: /my-custom-data-folder-path-inside-container
RABBITMQ_NODENAME: rabbitmq
volumes:
- type: bind
source: /my-custom-data-folder-path-on-host
target: /my-custom-data-folder-path-inside-container
I'm not an expert in RabbitMQ, and my idea was just to make RabbitMQ to persist it's database in the /my-custom-data-folder-path-on-host folder on host. And just like in your case on the first run it was able to start successfully, but after container restart I was getting the following error:
BOOT FAILED
Error during startup: {error, {schema_integrity_check_failed, [{table_missing,rabbit_listener}]}}
I learned from the documentation is that rabbit_listener is a table inside the Mnesia database that is used by RabbitMQ and that "listeners" are the TCP-listeners that are configured in RabbitMQ to accept client connections.
For RabbitMQ to accept client connections, it needs to bind to one or more interfaces and listen on (protocol-specific) ports. One such interface/port pair is called a listener in RabbitMQ parlance. Listeners are configured using the listeners.tcp.* configuration option(s).
I wanted to dig into the Mnesia database to troubleshoot but not managed to do that without Erlang knowledge. It seems that for some reason on the first run RabbitMQ does not create "rabbit_listener" table, but on subsequent runs requires it.
Finally, I managed to workaround the problem by changing my initial configuration as follows:
service-bus:
image: rabbitmq:3.11.2-management-alpine
hostname: rabbitmq
environment:
RABBITMQ_DEFAULT_USER: tester
RABBITMQ_DEFAULT_PASS: qwerty
RABBITMQ_NODENAME: rabbitmq
volumes:
- type: bind
source: /my-custom-data-folder-path-on-host
target: /var/lib/rabbitmq
Instead of overriding just the RABBITMQ_MNESIA_DIR folder I've overridden the entire /var/lib/rabbitmq. This did the trick and now my RabbitMQ successfully endures restarts.

I hit this problem and I changed my docker-compose.yml file to use rabbitmq:3.9-management rather than rabbitmq:3-management.
The problem happened for me when I restarted the stack and the rabbitmq image went to 3.11.

Related

one node of rabbitmq cluster will consume large memory(until OOM) occasionally

Environment:
Openstack Train, deploy by kolla-ansible
RabbitMQ 3.7.10 on Erlang 20.2.2
three control nodes(also run other components)
Problem:
node-34 rabbitmq consume large memory(30G) during 04-20 16:31 to 04-20 16:46(restart the rabbitmq process manual, or it will consume memory until it trigger OOM-killer although set vm_memory_high_watermark to 0.1 [another cluster with the same environment])
node-33 rabbitmq consume 15G virtual memory but only little physical memory during 04-20 16:26 to 04-20 16:28
fixed the problem only need to restart node-34 rabbitmq
Question:
what's the root cause of this issue?
how could I fix it totally but not restart while the issue happen?
Component logs:
node-33 rabbitmq
2022-04-20 16:20:00.731 [info] <0.30576.499> connection <0.30576.499> (1.1.1.45:33314 -> 1.1.1.33:5672 - nova-compute:7:1dab7694-168e-491c-8aa5-5e5a9f993750): user 'openstack' authenticated and granted access to vhost '/'
2022-04-20 16:25:25.678 [info] <0.14459.449> closing AMQP connection <0.14459.449> (1.1.1.32:53356 -> 1.1.1.33:5672 - nova-compute:7:facf1224-83df-4e48-8189-d78213ee5bc2, vhost: '/', user: 'openstack')
2022-04-20 16:25:25.679 [info] <0.21656.462> closing AMQP connection <0.21656.462> (1.1.1.32:58944 -> 1.1.1.33:5672 - nova-compute:7:9c706aca-9db6-4e61-bebd-568a6f282307, vhost: '/', user: 'openstack')
2022-04-20 16:25:25.679 [error] <0.3679.462> Supervisor {<0.3679.462>,rabbit_channel_sup_sup} had child channel_sup started with rabbit_channel_sup:start_link() at undefined exit with reason shutdown in context shutdown_error
2022-04-20 16:25:25.683 [info] <0.13987.330> closing AMQP connection <0.13987.330> (1.1.1.32:35890 -> 1.1.1.33:5672 - nova-compute:7:5fdd2029-8f50-4a81-b861-06b071fffc98, vhost: '/', user: 'openstack')
2022-04-20 16:25:41.101 [info] <0.1613.508> accepting AMQP connection <0.1613.508> (1.1.1.33:54246 -> 1.1.1.33:5672)
2022-04-20 16:25:41.104 [info] <0.1613.508> Connection <0.1613.508> (1.1.1.33:54246 -> 1.1.1.33:5672) has a client-provided name: nova-conductor:24:71983386-ad60-4608-8186-a4aef8644d9d
2022-04-20 16:25:41.104 [info] <0.1613.508> connection <0.1613.508> (1.1.1.33:54246 -> 1.1.1.33:5672 - nova-conductor:24:71983386-ad60-4608-8186-a4aef8644d9d): user 'openstack' authenticated and granted access to vhost '/'
2022-04-20 16:25:42.000 [warning] <0.32.0> lager_error_logger_h dropped 2 messages in the last second that exceeded the limit of 1000 messages/sec
2022-04-20 16:27:36.137 [info] <0.24964.510> accepting AMQP connection <0.24964.510> (1.1.1.33:38314 -> 1.1.1.33:5672)
2022-04-20 16:27:36.141 [info] <0.24964.510> Connection <0.24964.510> (1.1.1.33:38314 -> 1.1.1.33:5672) has a client-provided name: nova-compute:7:be0a8525-b04c-465d-a938-e90599bd54d3
2022-04-20 16:27:36.142 [info] <0.24964.510> connection <0.24964.510> (1.1.1.33:38314 -> 1.1.1.33:5672 - nova-compute:7:be0a8525-b04c-465d-a938-e90599bd54d3): user 'openstack' authenticated and granted access to vhost '/'
2022-04-20 16:34:29.549 [error] <0.2946.6153> closing AMQP connection <0.2946.6153> (1.1.1.35:58822 -> 1.1.1.33:5672 - nova-conductor:21:e037e12d-2911-47d1-90ef-d00c3c288380):
missed heartbeats from client, timeout: 60s
2022-04-20 16:34:30.557 [info] <0.414.521> accepting AMQP connection <0.414.521> (1.1.1.35:38810 -> 1.1.1.33:5672)
2022-04-20 16:34:30.558 [info] <0.414.521> Connection <0.414.521> (1.1.1.35:38810 -> 1.1.1.33:5672) has a client-provided name: nova-conductor:21:e037e12d-2911-47d1-90ef-d00c3c288380
2022-04-20 16:34:30.559 [info] <0.414.521> connection <0.414.521> (1.1.1.35:38810 -> 1.1.1.33:5672 - nova-conductor:21:e037e12d-2911-47d1-90ef-d00c3c288380): user 'openstack' authenticated and granted access to vhost '/'
2022-04-20 16:34:32.117 [error] <0.13587.486> closing AMQP connection <0.13587.486> (1.1.1.31:36248 -> 1.1.1.33:5672 - nova-compute:7:e592f063-ff69-4387-86a5-d552bb43572e):
missed heartbeats from client, timeout: 60s
2022-04-20 16:40:36.440 [error] <0.13109.8083> closing AMQP connection <0.13109.8083> (1.1.1.35:47356 -> 1.1.1.33:5672 - cinder-volume:32:2a7ba690-3b2a-486d-888b-bc4bb19962ee):
missed heartbeats from client, timeout: 60s
2022-04-20 16:40:36.537 [error] <0.31800.280> closing AMQP connection <0.31800.280> (1.1.1.33:60648 -> 1.1.1.33:5672 - cinder-volume:32:c7fddb16-bb8c-4646-8535-980ba5900508):
missed heartbeats from client, timeout: 60s
2022-04-20 16:40:43.139 [error] <0.5296.525> closing AMQP connection <0.5296.525> (1.1.1.33:47884 -> 1.1.1.33:5672 - nova-conductor:24:71983386-ad60-4608-8186-a4aef8644d9d):
missed heartbeats from client, timeout: 60s
========== a lot of above "[info]" "[error]" "missed heartbeats" log, until restart the node-34 rabbitmq process
2022-04-20 16:46:19.528 [info] <0.16487.538> connection <0.16487.538> (1.1.1.34:51432 -> 1.1.1.33:5672 - nova-scheduler:59:d19d2307-c177-490f-9f36-9709f6f86345): user 'openstack' authenticated and granted access to vhost '/'
2022-04-20 16:46:19.562 [info] <0.20786.0> Mirrored queue 'q-l3-plugin.node-35' in vhost '/': Master <rabbit#node-33.1.1271.0> saw deaths of mirrors <rabbit#node-34.1.1245.0>
2022-04-20 16:46:19.563 [info] <0.23194.0> Mirrored queue 'q-plugin_fanout_3e006483c59744de91c4607550a2ea75' in vhost '/': Master <rabbit#node-33.1.3305.0> saw deaths of mirrors <rabbit#node-34.1.3374.0>
node-34 rabbitmq
2022-04-20 16:20:48.095 [info] <0.20912.3993> Connection <0.20912.3993> (1.1.1.31:55326 -> 1.1.1.34:5672) has a client-provided name: nova-compute:7:a480ff3a-1e36-4797-b1dc-cc1d7eff8d8f
2022-04-20 16:20:49.000 [warning] lager_file_backend dropped 1 messages in the last second that exceeded the limit of 50 messages/sec
2022-04-20 16:25:25.676 [info] <0.22127.3978> closing AMQP connection <0.22127.3978> (1.1.1.32:52030 -> 1.1.1.34:5672 - nova-compute:7:310dcd66-11e7-485e-8099-1e4ab9e1c05d, vhost: '/', user: 'openstack')
2022-04-20 16:27:36.116 [info] <0.19371.3997> accepting AMQP connection <0.19371.3997> (1.1.1.33:58880 -> 1.1.1.34:5672)
2022-04-20 16:27:36.138 [info] <0.19371.3997> Connection <0.19371.3997> (1.1.1.33:58880 -> 1.1.1.34:5672) has a client-provided name: nova-compute:7:012d81c4-a5eb-4b38-8a39-d577cef8c12a
2022-04-20 16:27:36.142 [info] <0.19371.3997> connection <0.19371.3997> (1.1.1.33:58880 -> 1.1.1.34:5672 - nova-compute:7:012d81c4-a5eb-4b38-8a39-d577cef8c12a): user 'openstack' authenticated and granted access to vhost '/'
2022-04-20 16:34:32.645 [error] <0.25171.3412> closing AMQP connection <0.25171.3412> (1.1.1.35:60358 -> 1.1.1.34:5672 - nova-conductor:23:7f5f17b1-e86e-476a-9047-b57317c02723):
missed heartbeats from client, timeout: 60s
2022-04-20 16:34:33.653 [info] <0.23357.4653> accepting AMQP connection <0.23357.4653> (1.1.1.35:44456 -> 1.1.1.34:5672)
2022-04-20 16:34:33.657 [info] <0.23357.4653> Connection <0.23357.4653> (1.1.1.35:44456 -> 1.1.1.34:5672) has a client-provided name: nova-conductor:23:7f5f17b1-e86e-476a-9047-b57317c02723
2022-04-20 16:34:33.658 [info] <0.23357.4653> connection <0.23357.4653> (1.1.1.35:44456 -> 1.1.1.34:5672 - nova-conductor:23:7f5f17b1-e86e-476a-9047-b57317c02723): user 'openstack' authenticated and granted access to vhost '/'
2022-04-20 16:34:34.484 [error] <0.3180.3713> closing AMQP connection <0.3180.3713> (1.1.1.33:41126 -> 1.1.1.34:5672 - nova-conductor:22:6de6e8f9-10c8-48c6-8aac-55489aa24d9b):
missed heartbeats from client, timeout: 60s
2022-04-20 16:34:35.492 [info] <0.19068.4664> accepting AMQP connection <0.19068.4664> (1.1.1.33:48246 -> 1.1.1.34:5672)
2022-04-20 16:34:35.493 [info] <0.19068.4664> Connection <0.19068.4664> (1.1.1.33:48246 -> 1.1.1.34:5672) has a client-provided name: nova-conductor:22:6de6e8f9-10c8-48c6-8aac-55489aa24d9b
2022-04-20 16:34:35.494 [info] <0.19068.4664> connection <0.19068.4664> (1.1.1.33:48246 -> 1.1.1.34:5672 - nova-conductor:22:6de6e8f9-10c8-48c6-8aac-55489aa24d9b): user 'openstack' authenticated and granted access to vhost '/'
2022-04-20 16:34:37.617 [error] <0.19797.3640> closing AMQP connection <0.19797.3640> (1.1.1.34:38380 -> 1.1.1.34:5672 - nova-conductor:24:af6de2d2-5fb4-43b8-aac7-eb363d60315c):
missed heartbeats from client, timeout: 60s
========== a lot of above "[info]" and "[error]" "missed heartbeats" log, until restart the this(node-34) rabbitmq process
2022-04-20 16:45:54.306 [info] <0.7671.7632> accepting AMQP connection <0.7671.7632> (1.1.1.31:38548 -> 1.1.1.34:5672)
2022-04-20 16:45:54.307 [info] <0.7671.7632> Connection <0.7671.7632> (1.1.1.31:38548 -> 1.1.1.34:5672) has a client-provided name: nova-compute:7:355e2e03-3d83-4d95-a8bf-0165643a40fd
2022-04-20 16:45:54.307 [info] <0.7671.7632> connection <0.7671.7632> (1.1.1.31:38548 -> 1.1.1.34:5672 - nova-compute:7:355e2e03-3d83-4d95-a8bf-0165643a40fd): user 'openstack' authenticated and granted access to vhost '/'
2022-04-20 16:45:55.359 [error] <0.22496.6560> closing AMQP connection <0.22496.6560> (1.1.1.33:46352 -> 1.1.1.34:5672 - nova-conductor:24:6341a486-5f31-4180-865e-49d9e6fef1fd):
missed heartbeats from client, timeout: 60s
2022-04-20 16:45:56.367 [info] <0.31031.7657> accepting AMQP connection <0.31031.7657> (1.1.1.33:38992 -> 1.1.1.34:5672)
2022-04-20 16:45:56.368 [info] <0.31031.7657> Connection <0.31031.7657> (1.1.1.33:38992 -> 1.1.1.34:5672) has a client-provided name: nova-conductor:24:6341a486-5f31-4180-865e-49d9e6fef1fd
2022-04-20 16:45:56.368 [info] <0.31031.7657> connection <0.31031.7657> (1.1.1.33:38992 -> 1.1.1.34:5672 - nova-conductor:24:6341a486-5f31-4180-865e-49d9e6fef1fd): user 'openstack' authenticated and granted access to vhost '/'
2022-04-20 16:45:58.115 [warning] <0.32550.7440> closing AMQP connection <0.32550.7440> (1.1.1.35:58740 -> 1.1.1.34:5672 - nova-conductor:22:bbe54f83-b7c4-452f-a120-9a220afc6c59, vhost: '/', user: 'openstack'):
client unexpectedly closed TCP connection
2022-04-20 16:45:59.123 [info] <0.5691.7687> accepting AMQP connection <0.5691.7687> (1.1.1.35:38818 -> 1.1.1.34:5672)
2022-04-20 16:45:59.124 [info] <0.5691.7687> Connection <0.5691.7687> (1.1.1.35:38818 -> 1.1.1.34:5672) has a client-provided name: nova-conductor:22:bbe54f83-b7c4-452f-a120-9a220afc6c59
2022-04-20 16:45:59.125 [info] <0.5691.7687> connection <0.5691.7687> (1.1.1.35:38818 -> 1.1.1.34:5672 - nova-conductor:22:bbe54f83-b7c4-452f-a120-9a220afc6c59): user 'openstack' authenticated and granted access to vhost '/'
2022-04-20 16:46:05.456 [warning] <0.1973.7449> closing AMQP connection <0.1973.7449> (1.1.1.36:52852 -> 1.1.1.34:5672 - nova-compute:7:37b137ea-7671-4ea9-ae86-f243e9a13606, vhost: '/', user: 'openstack'):
client unexpectedly closed TCP connection
2022-04-20 16:46:06.643 [warning] <0.2825.7450> closing AMQP connection <0.2825.7450> (1.1.1.33:33106 -> 1.1.1.34:5672 - nova-conductor:21:7bd7d6dc-7367-4690-aed8-81c3989f5c74, vhost: '/', user: 'openstack'):
client unexpectedly closed TCP connection
2022-04-20 16:46:08.968 [warning] <0.16541.7452> closing AMQP connection <0.16541.7452> (1.1.1.35:59814 -> 1.1.1.34:5672 - nova-conductor:25:2e3dd6d2-13a8-4661-96ad-d4fa6bbd2e72, vhost: '/', user: 'openstack'):
client unexpectedly closed TCP connection
2022-04-20 16:46:13.000 [warning] lager_file_backend dropped 13 messages in the last second that exceeded the limit of 50 messages/sec
2022-04-20 16:46:13.038 [info] <0.19552.7774> RabbitMQ is asked to stop...
2022-04-20 16:46:13.806 [info] <0.27176.7775> accepting AMQP connection <0.27176.7775> (1.1.1.35:40350 -> 1.1.1.34:5672)
2022-04-20 16:46:13.807 [info] <0.27176.7775> Connection <0.27176.7775> (1.1.1.35:40350 -> 1.1.1.34:5672) has a client-provided name: nova-conductor:22:99215d08-ba56-44a1-be0e-2cfa9935a4c7
2022-04-20 16:46:13.808 [info] <0.27176.7775> connection <0.27176.7775> (1.1.1.35:40350 -> 1.1.1.34:5672 - nova-conductor:22:99215d08-ba56-44a1-be0e-2cfa9935a4c7): user 'openstack' authenticated and granted access to vhost '/'
2022-04-20 16:46:14.112 [info] <0.19552.7774> Stopping RabbitMQ applications and their dependencies in the following order:
rabbitmq_management
amqp_client
rabbitmq_web_dispatch
cowboy
cowlib
rabbitmq_management_agent
rabbit
mnesia
rabbit_common
os_mon
2022-04-20 16:46:14.113 [info] <0.19552.7774> Stopping application 'rabbitmq_management'
2022-04-20 16:46:14.223 [warning] <0.8143.0> RabbitMQ HTTP listener registry could not find context rabbitmq_management_tls
2022-04-20 16:46:14.237 [info] <0.33.0> Application rabbitmq_management exited with reason: stopped
2022-04-20 16:46:14.237 [info] <0.19552.7774> Stopping application 'amqp_client'
2022-04-20 16:46:14.265 [info] <0.33.0> Application amqp_client exited with reason: stopped
2022-04-20 16:46:14.265 [info] <0.19552.7774> Stopping application 'rabbitmq_web_dispatch'
2022-04-20 16:46:14.282 [info] <0.33.0> Application rabbitmq_web_dispatch exited with reason: stopped
2022-04-20 16:46:14.282 [info] <0.19552.7774> Stopping application 'cowboy'
2022-04-20 16:46:14.293 [warning] <0.25152.7431> closing AMQP connection <0.25152.7431> (1.1.1.34:42738 -> 1.1.1.34:5672 - nova-conductor:25:b31b82a2-e738-4cd9-805e-36c7b520531e, vhost: '/', user: 'openstack'):
client unexpectedly closed TCP connection
2022-04-20 16:46:14.301 [info] <0.19552.7774> Stopping application 'cowlib'
2022-04-20 16:46:14.301 [info] <0.19552.7774> Stopping application 'rabbitmq_management_agent'
2022-04-20 16:46:14.301 [info] <0.33.0> Application cowboy exited with reason: stopped
2022-04-20 16:46:14.302 [info] <0.33.0> Application cowlib exited with reason: stopped
2022-04-20 16:46:14.324 [info] <0.19552.7774> Stopping application 'rabbit'
2022-04-20 16:46:14.324 [info] <0.33.0> Application rabbitmq_management_agent exited with reason: stopped
2022-04-20 16:46:14.326 [info] <0.260.0> Peer discovery backend rabbit_peer_discovery_classic_config does not support registration, skipping unregistration.
2022-04-20 16:46:14.327 [info] <0.8135.0> stopped TCP listener on 1.1.1.34:5672
2022-04-20 16:46:14.337 [error] <0.19192.0> Error on AMQP connection <0.19192.0> (1.1.1.34:53718 -> 1.1.1.34:5672 - barbican-keystone-listener:7:b59ea871-cbb6-462e-b8e7-3454536978dd, vhost: '/', user: 'openstack', state: running), channel 0:
operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"
========== a lot of these "[error]" and "operation none caused" log
2022-04-20 16:46:14.338 [error] <0.18651.0> Error on AMQP connection <0.18651.0> (1.1.1.35:49664 -> 1.1.1.34:5672 - magnum-conductor:112:0a1e307e-90cb-4e5f-bc1c-a721cdb7f83e, vhost: '/', user: 'openstack', state: running), channel 0:
operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"
2022-04-20 16:46:21.680 [info] <0.33.0> Application lager started on node 'rabbit#node-34'
2022-04-20 16:46:21.685 [info] <0.5.0> Log file opened with Lager
2022-04-20 16:46:25.645 [info] <0.33.0> Application mnesia started on node 'rabbit#node-34'
2022-04-20 16:46:25.649 [info] <0.33.0> Application mnesia exited with reason: stopped
2022-04-20 16:46:25.988 [info] <0.33.0> Application recon started on node 'rabbit#node-34'
2022-04-20 16:46:25.989 [info] <0.33.0> Application inets started on node 'rabbit#node-34'
2022-04-20 16:46:25.989 [info] <0.33.0> Application jsx started on node 'rabbit#node-34'
2022-04-20 16:46:25.989 [info] <0.33.0> Application os_mon started on node 'rabbit#node-34'
2022-04-20 16:46:25.989 [info] <0.33.0> Application crypto started on node 'rabbit#node-34'
2022-04-20 16:46:25.989 [info] <0.33.0> Application cowlib started on node 'rabbit#node-34'
2022-04-20 16:46:26.078 [info] <0.33.0> Application mnesia started on node 'rabbit#node-34'
2022-04-20 16:46:26.078 [info] <0.33.0> Application xmerl started on node 'rabbit#node-34'
2022-04-20 16:46:26.078 [info] <0.33.0> Application asn1 started on node 'rabbit#node-34'
2022-04-20 16:46:26.078 [info] <0.33.0> Application public_key started on node 'rabbit#node-34'
2022-04-20 16:46:26.078 [info] <0.33.0> Application ssl started on node 'rabbit#node-34'
2022-04-20 16:46:26.078 [info] <0.33.0> Application ranch started on node 'rabbit#node-34'
2022-04-20 16:46:26.085 [info] <0.33.0> Application cowboy started on node 'rabbit#node-34'
2022-04-20 16:46:26.085 [info] <0.33.0> Application rabbit_common started on node 'rabbit#node-34'
2022-04-20 16:46:26.088 [info] <0.33.0> Application amqp_client started on node 'rabbit#node-34'
2022-04-20 16:46:26.088 [info] <0.247.0>
Starting RabbitMQ 3.7.10 on Erlang 20.2.2
Copyright (C) 2007-2018 Pivotal Software, Inc.
Licensed under the MPL. See http://www.rabbitmq.com/
2022-04-20 16:46:26.089 [info] <0.247.0>
node : rabbit#node-34
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.conf
cookie hash : i***Q==
log(s) : /var/log/kolla/rabbitmq/rabbit#node-34.log
: /var/log/kolla/rabbitmq/rabbit#node-34_upgrade.log
database dir : /var/lib/rabbitmq/mnesia/rabbit#node-34
2022-04-20 16:46:26.385 [info] <0.331.0> Memory high watermark set to 618721 MiB (648776025702 bytes) of 1546802 MiB (1621940064256 bytes) total
2022-04-20 16:46:26.389 [info] <0.333.0> Enabling free disk space monitoring
2022-04-20 16:46:26.389 [info] <0.333.0> Disk free limit set to 50MB
2022-04-20 16:46:26.392 [info] <0.336.0> Limiting to approx 1048476 file handles (943626 sockets)
2022-04-20 16:46:26.392 [info] <0.337.0> FHC read buffering: OFF
2022-04-20 16:46:26.392 [info] <0.337.0> FHC write buffering: ON
2022-04-20 16:46:26.400 [info] <0.247.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-04-20 16:46:26.410 [info] <0.247.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-04-20 16:46:26.450 [info] <0.247.0> Waiting for Mnesia tables for 30000 ms, 9 retries left
2022-04-20 16:46:26.450 [info] <0.247.0> Peer discovery backend rabbit_peer_discovery_classic_config does not support registration, skipping registration.
2022-04-20 16:46:26.451 [info] <0.247.0> Priority queues enabled, real BQ is rabbit_variable_queue
2022-04-20 16:46:26.477 [info] <0.454.0> Starting rabbit_node_monitor
2022-04-20 16:46:26.504 [info] <0.247.0> Management plugin: using rates mode 'basic'
2022-04-20 16:46:26.556 [info] <0.486.0> Making sure data directory '/var/lib/rabbitmq/mnesia/rabbit#node-34/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L' for vhost '/' exists
2022-04-20 16:46:26.574 [info] <0.486.0> Starting message stores for vhost '/'
2022-04-20 16:46:26.574 [info] <0.490.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_transient": using rabbit_msg_store_ets_index to provide index
2022-04-20 16:46:26.616 [info] <0.486.0> Started message store of type transient for vhost '/'
2022-04-20 16:46:26.617 [info] <0.493.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": using rabbit_msg_store_ets_index to provide index
2022-04-20 16:46:26.617 [warning] <0.493.0> Message store "628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent": rebuilding indices from scratch
2022-04-20 16:46:26.618 [info] <0.486.0> Started message store of type persistent for vhost '/'
2022-04-20 16:46:26.627 [info] <0.486.0> Mirrored queue 'q-agent-notifier-port-update_fanout_9407d5931f8a498cb6c0268d585ed732' in vhost '/': Adding mirror on node 'rabbit#node-34': <0.512.0>
2022-04-20 16:46:26.627 [info] <0.486.0> Mirrored queue 'magnum-conductor_fanout_dd3536ae0b8e4efe8329be0454ba75b6' in vhost '/': Adding mirror on node 'rabbit#node-34': <0.516.0>
========== a lot of different "Mirrored queue" log
node-35 rabbitmq
2022-04-20 16:34:17.295 [error] <0.13948.7714> closing AMQP connection <0.13948.7714> (1.1.1.34:43322 -> 1.1.1.35:5672 - nova-conductor:23:3ca11891-5442-48ef-9b0f-f616ba13c1e3):
missed heartbeats from client, timeout: 60s
========== a lot of above "[info]" "[error]" "missed heartbeats" log, until restart the node-34 rabbitmq process
2022-04-20 16:34:57.656 [info] <0.12329.2581> accepting AMQP connection <0.12329.2581> (1.1.1.33:42474 -> 1.1.1.35:5672)
2022-04-20 16:34:57.657 [info] <0.12329.2581> Connection <0.12329.2581> (1.1.1.33:42474 -> 1.1.1.35:5672) has a client-provided name: nova-conductor:25:53cc4527-fa4b-41c0-b9e1-f3d24da7f31b
2022-04-20 16:34:57.658 [info] <0.12329.2581> connection <0.12329.2581> (1.1.1.33:42474 -> 1.1.1.35:5672 - nova-conductor:25:53cc4527-fa4b-41c0-b9e1-f3d24da7f31b): user 'openstack' authenticated and granted access to vhost '/'
2022-04-20 16:34:58.874 [error] <0.7604.103> closing AMQP connection <0.7604.103> (1.1.1.33:38718 -> 1.1.1.35:5672 - nova-conductor:21:b701bc54-c826-47bc-8c7d-803e27265e5f):
missed heartbeats from client, timeout: 60s
2022-04-20 16:34:59.531 [error] <0.24920.2100> closing AMQP connection <0.24920.2100> (1.1.1.35:41208 -> 1.1.1.35:5672 - nova-conductor:23:615ea986-452b-4943-ba4f-7d36d3b1536c):
missed heartbeats from client, timeout: 60s
========== less of "[error]" but lot of "[info]" log, until restart the node-34 rabbitmq process
2022-04-20 16:46:19.344 [info] <0.17721.2593> connection <0.17721.2593> (1.1.1.33:34162 -> 1.1.1.35:5672 - mod_wsgi:32:06e32cf4-2d5f-468e-9a50-a6c16b5f16bb): user 'openstack' authenticated and granted access to vhost '/'
2022-04-20 16:46:19.489 [info] <0.17996.2593> accepting AMQP connection <0.17996.2593> (1.1.1.33:34172 -> 1.1.1.35:5672)
2022-04-20 16:46:19.511 [info] <0.4731.0> Mirrored queue 'magnum-conductor_fanout_7af2012136fe49e88f5e561d2f03650f' in vhost '/': Slave <rabbit#node-35.3.4731.0> saw deaths of mirrors <rabbit#node-34.1.2887.0>
2022-04-20 16:46:19.511 [info] <0.25105.9> Mirrored queue 'scheduler_fanout_0a2d4e8a46b249018178164758b6736d' in vhost '/': Slave <rabbit#node-35.3.25105.9> saw deaths of mirrors <rabbit#node-34.1.14025.203>
node-33 nova-conduct
2022-04-20 16:34:20.689 23 ERROR oslo.messaging._drivers.impl_rabbit [req-30ed34ff-70e5-4e6f-a09f-a00de95385a3 - - - - -] [5b4dd218-31d1-4eb6-bab1-2b5bc8f1bc11] AMQP server on 1.1.1.35:5672 is unreachable: Server unexpectedly closed connection. Trying again in 1 seconds.: OSError: Server unexpectedly closed connection
2022-04-20 16:34:21.700 23 INFO oslo.messaging._drivers.impl_rabbit [req-30ed34ff-70e5-4e6f-a09f-a00de95385a3 - - - - -] [5b4dd218-31d1-4eb6-bab1-2b5bc8f1bc11] Reconnected to AMQP server on 1.1.1.35:5672 via [amqp] client with port 38842.
========== duplicate above log
2022-04-20 16:46:12.294 21 INFO oslo.messaging._drivers.impl_rabbit [req-24421232-3a85-4ff4-a6fe-e2bb58188e65 - - - - -] [7bd7d6dc-7367-4690-aed8-81c3989f5c74] Reconnected to AMQP server on 1.1.1.34:5672 via [amqp] client with port 40340.
2022-04-20 16:46:14.356 25 ERROR oslo.messaging._drivers.impl_rabbit [req-beeac432-5509-4d6f-8709-9a8ea9b3d0ad - - - - -] [58b08ac8-7061-4969-8fae-de5acad3c23b] AMQP server on 1.1.1.34:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2022-04-20 16:46:14.357 23 ERROR oslo.messaging._drivers.impl_rabbit [req-bd979e87-5a7f-4f6a-af79-9d1ccb99f944 - - - - -] [d5bff161-b9df-44e8-9fe3-292aec5b13f7] AMQP server on 1.1.1.34:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2022-04-20 16:46:14.357 21 ERROR oslo.messaging._drivers.impl_rabbit [req-ca57569e-3328-48f8-9469-e78d6def839c - - - - -] [98444783-6694-4688-873e-066abf61932c] AMQP server on 1.1.1.34:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2022-04-20 16:46:14.358 21 ERROR oslo.messaging._drivers.impl_rabbit [req-24421232-3a85-4ff4-a6fe-e2bb58188e65 - - - - -] [7bd7d6dc-7367-4690-aed8-81c3989f5c74] AMQP server on 1.1.1.34:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2022-04-20 16:46:14.359 23 ERROR oslo.messaging._drivers.impl_rabbit [req-a69ce54a-1739-44b2-b169-12f030a744b1 - - - - -] [84c7fa9d-14d9-4664-be6c-7e6d43ae7e83] AMQP server on 1.1.1.34:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
2022-04-20 16:46:14.360 22 ERROR oslo.messaging._drivers.impl_rabbit [-] [09269999-5a4f-419a-899c-69deb49689fb] AMQP server on 1.1.1.34:5672 is unreachable: [Errno 104] Connection reset by peer. Trying again in 1 seconds.: ConnectionResetError: [Errno 104] Connection reset by peer
========== duplicate above log
2022-04-20 16:46:16.422 23 INFO oslo.messaging._drivers.impl_rabbit [req-a69ce54a-1739-44b2-b169-12f030a744b1 - - - - -] [84c7fa9d-14d9-4664-be6c-7e6d43ae7e83] Reconnected to AMQP server on 1.1.1.33:5672 via [amqp] client with port 49066.
2022-04-20 16:46:16.425 25 ERROR oslo.messaging._drivers.impl_rabbit [req-50d0d15a-96a4-48a4-b60e-f2103ca9aa59 - - - - -] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-04-20 16:46:16.556 21 INFO oslo.messaging._drivers.impl_rabbit [req-beeac432-5509-4d6f-8709-9a8ea9b3d0ad - - - - -] [3d76c8aa-8e05-4ce0-a26e-db670ff2e48c] Reconnected to AMQP server on 1.1.1.35:5672 via [amqp] client with port 33400.
2022-04-20 16:46:16.571 24 INFO oslo.messaging._drivers.impl_rabbit [req-beeac432-5509-4d6f-8709-9a8ea9b3d0ad - - - - -] [c7a51125-c7b5-4439-b5d5-36e82309b943] Reconnected to AMQP server on 1.1.1.35:5672 via [amqp] client with port 33418.
2022-04-20 16:46:16.573 22 INFO oslo.messaging._drivers.impl_rabbit [req-30ed34ff-70e5-4e6f-a09f-a00de95385a3 - - - - -] [6de6e8f9-10c8-48c6-8aac-55489aa24d9b] Reconnected to AMQP server on 1.1.1.35:5672 via [amqp] client with port 33410.
2022-04-20 16:46:16.634 21 INFO oslo.messaging._drivers.impl_rabbit [req-24421232-3a85-4ff4-a6fe-e2bb58188e65 - - - - -] [7bd7d6dc-7367-4690-aed8-81c3989f5c74] Reconnected to AMQP server on 1.1.1.35:5672 via [amqp] client with port 33406.
2022-04-20 16:46:19.829 23 INFO oslo.messaging._drivers.impl_rabbit [-] [3132b985-89a4-4c5a-82e2-a0050d14fddf] Reconnected to AMQP server on 1.1.1.35:5672 via [amqp] client with port 33412.
2022-04-20 16:46:20.373 22 INFO oslo.messaging._drivers.impl_rabbit [-] [09269999-5a4f-419a-899c-69deb49689fb] Reconnected to AMQP server on 1.1.1.35:5672 via [amqp] client with port 33376.
2022-04-20 16:46:20.409 24 INFO oslo.messaging._drivers.impl_rabbit [-] [cd614243-a1db-4e57-996d-b17e9b3aea28] Reconnected to AMQP server on 1.1.1.35:5672 via [amqp] client with port 33426.
2022-04-20 16:46:26.326 21 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2022-04-20 16:46:26.334 21 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: [Errno 104] Connection reset by peer
2022-04-20 16:46:26.340 21 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-04-20 16:46:26.347 21 ERROR oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED (retrying in 0 seconds): ConnectionRefusedError: [Errno 111] ECONNREFUSED
2022-04-20 16:46:32.784 23 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
2022-04-20 16:46:47.707 24 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
2022-04-20 16:46:47.851 25 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
2022-04-20 16:47:02.833 25 INFO oslo.messaging._drivers.impl_rabbit [-] A recoverable connection/channel error occurred, trying to reconnect: Server unexpectedly closed connection
2022-04-20 16:49:16.486 25 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: Server unexpectedly closed connection: kombu.exceptions.OperationalError: Server unexpectedly closed connection
2022-04-20 16:49:16.494 25 ERROR oslo.messaging._drivers.impl_rabbit [-] Unable to connect to AMQP server on 1.1.1.35:5672 after inf tries: Server unexpectedly closed connection: kombu.exceptions.OperationalError: Server unexpectedly closed connection
2022-04-20 16:49:16.495 25 ERROR oslo_messaging._drivers.amqpdriver [-] Failed to process incoming message, retrying..: oslo_messaging.exceptions.MessageDeliveryFailure: Unable to connect to AMQP server on 1.1.1.35:5672 after inf tries: Server unexpectedly closed connection
I saw a similar issue with RabbitMQ 3.7.9. After upgrading to 3.7.19 and erlang 21.X the issue went away.
if the upgrade solutions didnt work. you can use quorum queues instead of classic mirror queues in rabbit-mq.
its recommended for large scale.
you can get deep understanding in link bellow
https://www.rabbitmq.com/quorum-queues.html

Keycloak and Docker OutOfMemoryError - process/resource limits reached

I have rented a virtual ubuntu server. Various applications run on it in Docker containers and natively:
Plesk
Wordpress
Flarum
MySQL
Wiki.js (in Docker container)
Keycloak (in Docker container)
MariaDB (in Docker container)
I use Keycloak as SSO for Wordpress, Wiki.js and Flarum. Now I have the problem that Keycloak simply crashes after a while and I can't restarted it in Docker. I get the following error message:
keycloak_1 | 17:22:06,447 DEBUG [org.jboss.as.config] (MSC service thread 1-3) VM Arguments: -D[Standalone] -Xms512m -Xmx2048m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -XX:+UseAdaptiveSizePolicy -XX:MaxMetaspaceSize=1024m -Djboss.modules.system.pkgs=org.jboss.byteman -Djava.awt.headless=true-Djava.net.preferIPv4Stack=true --add-exports=java.base/sun.nio.ch=ALL-UNNAMED --add-exports=jdk.unsupported/sun.misc=ALL-UNNAMED --add-exports=jdk.unsupported/sun.reflect=ALL-UNNAMED -Dorg.jboss.boot.log.file=/opt/jboss/keycloak/standalone/log/server.log -Dlogging.configuration=file:/opt/jboss/keycloak/standalone/configuration/logging.properties
keycloak_1 | 17:22:19,493 ERROR [org.jboss.as.controller.management-operation] (Controller Boot Thread) WFLYCTL0013: Operation ("add") failed - address: ([
keycloak_1 | ("subsystem" => "infinispan"),
keycloak_1 | ("cache-container" => "keycloak"),
keycloak_1 | ("thread-pool" => "transport")
keycloak_1 | ]) - failure description: {"WFLYCTL0080: Failed services" => {"org.wildfly.clustering.infinispan.cache-container.keycloak" => "org.infinispan.manager.EmbeddedCacheManagerStartupException: org.infinispan.commons.CacheException: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
keycloak_1 | Caused by: org.infinispan.manager.EmbeddedCacheManagerStartupException: org.infinispan.commons.CacheException: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
keycloak_1 | Caused by: org.infinispan.commons.CacheException: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
keycloak_1 | Caused by: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached"}}
keycloak_1 | 17:22:19,505 INFO [org.jboss.as.server] (ServerService Thread Pool -- 46) WFLYSRV0010: Deployed "keycloak-server.war" (runtime-name : "keycloak-server.war")
keycloak_1 | 17:22:19,507 INFO [org.jboss.as.controller] (Controller Boot Thread) WFLYCTL0183: Service status report
keycloak_1 | WFLYCTL0186: Services which failed to start: service org.wildfly.clustering.infinispan.cache.ejb.http-remoting-connector: org.infinispan.commons.CacheConfigurationException: Error starting component org.infinispan.expiration.impl.InternalExpirationManager
keycloak_1 | service org.wildfly.clustering.infinispan.cache-container.keycloak: org.infinispan.manager.EmbeddedCacheManagerStartupException: org.infinispan.commons.CacheException: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
keycloak_1 | WFLYCTL0448: 32 additional services are down due to their dependencies being missing or failed
keycloak_1 | 17:22:19,599 INFO [org.jboss.as.server] (Controller Boot Thread) WFLYSRV0212: Resuming server
keycloak_1 | 17:22:19,606 ERROR [org.jboss.as] (Controller Boot Thread) WFLYSRV0026: Keycloak 12.0.4 (WildFly Core 13.0.3.Final) started (with errors) in 15455ms - Started 558 of 926 services (44 services failed or missing dependencies, 684 services are lazy, passive or on-demand)
keycloak_1 | 17:22:19,614 INFO [org.jboss.as] (Controller Boot Thread) WFLYSRV0060: Http management interface listening on http://127.0.0.1:9990/management
keycloak_1 | 17:22:19,614 INFO [org.jboss.as] (Controller Boot Thread) WFLYSRV0051: Admin console listening on http://127.0.0.1:9990
The critical mistake should be the following:
keycloak_1 | 17:48:15,196 ERROR [org.jboss.msc.service.fail] (ServerService Thread Pool -- 60) MSC000001: Failed to start service org.wildfly.clustering.infinispan.cache-container.keycloak: org.jboss.msc.service.StartException in service org.wildfly.clustering.infinispan.cache-container.keycloak: org.infinispan.manager.EmbeddedCacheManagerStartupException: org.infinispan.commons.CacheException: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
At first time I thought that Keycloak with Docker need more memory. Unfortunately, the change didn't bring the desired success. After some research, I read that sometime there are some problems with the threads on virtual servers. Unfortunately, I don't know that much about this topic. I hope someone can help me. :)
Am I right that it can be due to the thread limit of the virtual server?
Attached is my docker-compose file:
version: '3'
services:
mariadb:
image: mariadb:latest
restart: always
environment:
MYSQL_ROOT_PASSWORD: ******
MYSQL_DATABASE: app_keycloak
MYSQL_USER: ******
MYSQL_PASSWORD: ******
ports:
- 3308:3306
# Copy-pasted from https://github.com/docker-library/mariadb/issues/94
healthcheck:
test: ["CMD", "mysqladmin", "ping", "--silent"]
keycloak:
image: jboss/keycloak:latest
restart: always
environment:
DB_VENDOR: mariadb
DB_ADDR: mariadb
DB_DATABASE: ******
DB_USER: ******
DB_PASSWORD: ******
KEYCLOAK_USER: ******
KEYCLOAK_PASSWORD: ******
JGROUPS_DISCOVERY_PROTOCOL: JDBC_PING
JAVA_OPTS: "-server -Xms512m -Xmx2048m -XX:MetaspaceSize=96M -XX:MaxMetaspaceSize=256m -XX:+UseAdaptiveSizePolicy -XX:MaxMetaspaceSize=1024m -Djboss.modules.system.pkgs=org.jboss.byteman -Djava.awt.head$t.headless=true-Djava.net.preferIPv4Stack=true"
ports:
- 8080:8080
depends_on:
- mariadb
Update 1:
It does not seem to be due to the thread limit.
systemctl show --property=DefaultTasksMax
I looked to see if there was a limit. I read that Ubuntu set DefaultTasksMax to 15%.
cat /proc/user_beancounters
Overall I have by provider a limit of 700 threads.
Additionally, I looked at how many threads were using the current services. Docker in particular.
systemctl status *.service | grep -e Tasks
systemctl status docker.service | grep -e Tasks --> 75
With the findings I set DefaultTasksMax to 200.
nano /etc/systemd/system.conf
systemctl daemon-reload
In the end, I restarted the Docker Compose.
docker-compose down
docker-compose up
Unfortunately, I still get the same error. :(
Update 2:
An update to version 13 of Keycloak has apparently fixed the problem. I will continue to monitor the behavior.

Docker userns remap give permission issues within defined range

/etc/subuid
ubuntu:1000:1
ubuntu:165533:65536
/etc/subgid
ubuntu:999:1
ubuntu:165536:65536
So I am expecting files created by root in the container to map to my username on the host, which avoids permission issues with bind mounts directories on the host.
This works fine, except when I docker-compose up anchore-engine
This creates a named volume with these permissions:
The anchore services immediately terminates and exit unless I manually correct the permissions with chown to ubuntu:docker on the _data directory.
I was expecting that the 166531 is within the range defined in the subuid file. What's wrong?
docker-compose.yaml
version: '2.1'
volumes:
anchore-db-volume:
# Set this to 'true' to use an external volume. In which case, it must be created manually with "docker volume create anchore-db-volume"
external: false
services:
# The primary API endpoint service
api:
image: anchore/anchore-engine:v0.8.2
depends_on:
- db
- catalog
ports:
- "8228:8228"
logging:
driver: "json-file"
options:
max-size: 100m
environment:
- ANCHORE_ENDPOINT_HOSTNAME=api
- ANCHORE_DB_HOST=db
- ANCHORE_DB_PASSWORD=mysecretpassword
command: ["anchore-manager", "service", "start", "apiext"]
# Catalog is the primary persistence and state manager of the system
catalog:
image: anchore/anchore-engine:v0.8.2
depends_on:
- db
logging:
driver: "json-file"
options:
max-size: 100m
expose:
- 8228
environment:
- ANCHORE_ENDPOINT_HOSTNAME=catalog
- ANCHORE_DB_HOST=db
- ANCHORE_DB_PASSWORD=mysecretpassword
command: ["anchore-manager", "service", "start", "catalog"]
queue:
image: anchore/anchore-engine:v0.8.2
depends_on:
- db
- catalog
expose:
- 8228
logging:
driver: "json-file"
options:
max-size: 100m
environment:
- ANCHORE_ENDPOINT_HOSTNAME=queue
- ANCHORE_DB_HOST=db
- ANCHORE_DB_PASSWORD=mysecretpassword
command: ["anchore-manager", "service", "start", "simplequeue"]
policy-engine:
image: anchore/anchore-engine:v0.8.2
depends_on:
- db
- catalog
expose:
- 8228
logging:
driver: "json-file"
options:
max-size: 100m
environment:
- ANCHORE_ENDPOINT_HOSTNAME=policy-engine
- ANCHORE_DB_HOST=db
- ANCHORE_DB_PASSWORD=mysecretpassword
command: ["anchore-manager", "service", "start", "policy_engine"]
analyzer:
image: anchore/anchore-engine:v0.8.2
depends_on:
- db
- catalog
expose:
- 8228
logging:
driver: "json-file"
options:
max-size: 100m
environment:
- ANCHORE_ENDPOINT_HOSTNAME=analyzer
- ANCHORE_DB_HOST=db
- ANCHORE_DB_PASSWORD=mysecretpassword
volumes:
- /analysis_scratch
command: ["anchore-manager", "service", "start", "analyzer"]
db:
image: "postgres:9"
volumes:
- anchore-db-volume:/var/lib/postgresql/data
environment:
- POSTGRES_PASSWORD=mysecretpassword
expose:
- 5432
logging:
driver: "json-file"
options:
max-size: 100m
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
Logs from one of the stopped containers:
/usr/local/lib/python3.6/site-packages/yosai/core/conf/yosaisettings.py:100: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(stream)
Traceback (most recent call last):
File "/usr/local/bin/twistd", line 11, in <module>
sys.exit(run())
File "/usr/local/lib64/python3.6/site-packages/twisted/scripts/twistd.py", line 31, in run
app.run(runApp, ServerOptions)
File "/usr/local/lib64/python3.6/site-packages/twisted/application/app.py", line 674, in run
runApp(config)
File "/usr/local/lib64/python3.6/site-packages/twisted/scripts/twistd.py", line 25, in runApp
runner.run()
File "/usr/local/lib64/python3.6/site-packages/twisted/application/app.py", line 383, in run
self.logger.start(self.application)
File "/usr/local/lib64/python3.6/site-packages/twisted/application/app.py", line 184, in start
observer = self._observerFactory()
File "/usr/local/lib/python3.6/site-packages/anchore_engine/subsys/twistd_logger.py", line 14, in logger
f = logfile.LogFile(thefile, '/var/log/', rotateLength=10000000, maxRotatedFiles=10)
File "/usr/local/lib64/python3.6/site-packages/twisted/python/logfile.py", line 170, in __init__
BaseLogFile.__init__(self, name, directory, defaultMode)
File "/usr/local/lib64/python3.6/site-packages/twisted/python/logfile.py", line 45, in __init__
self._openFile()
File "/usr/local/lib64/python3.6/site-packages/twisted/python/logfile.py", line 175, in _openFile
BaseLogFile._openFile(self)
File "/usr/local/lib64/python3.6/site-packages/twisted/python/logfile.py", line 85, in _openFile
self._file = open(self.path, "wb+", 0)
PermissionError: [Errno 13] Permission denied: '/var/log/anchore/anchore-api.log'
[MainThread] [anchore_manager.cli.service/start()] [INFO] Loading DB routines from module (anchore_engine)
[MainThread] [anchore_manager.util.db/connect_database()] [INFO] DB params: {"db_connect_args": {"connect_timeout": 86400}, "db_pool_size": 30, "db_pool_max_overflow": 100, "db_echo": false, "db_engine_args": null}
[MainThread] [anchore_manager.util.db/connect_database()] [INFO] DB connection configured: True
[MainThread] [anchore_manager.util.db/connect_database()] [INFO] DB attempting to connect...
[MainThread] [anchore_manager.util.db/connect_database()] [INFO] DB connected: True
[MainThread] [anchore_manager.util.db/init_database()] [INFO] DB compatibility check: running...
[MainThread] [anchore_manager.util.db/init_database()] [INFO] DB compatibility check success
[MainThread] [anchore_manager.util.db/init_database()] [INFO] DB post actions: running...
[MainThread] [anchore_manager.cli.service/start()] [INFO] DB version and code version in sync.
[MainThread] [anchore_manager.cli.service/start()] [INFO] Starting services: ['anchore-api']
[MainThread] [anchore_manager.cli.service/terminate_service()] [INFO] Looking for pre-existing service (anchore-api) pid from pidfile (/var/run/anchore/anchore-api.pid)
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 0/30
[anchore-api] [anchore_manager.cli.service/startup_service()] [INFO] cleaning up service: anchore-api
[anchore-api] [anchore_manager.cli.service/terminate_service()] [INFO] Looking for pre-existing service (anchore-api) pid from pidfile (/var/run/anchore/anchore-api.pid)
[anchore-api] [anchore_manager.cli.service/startup_service()] [INFO] starting service: anchore-api
[anchore-api] [anchore_manager.cli.service/startup_service()] [INFO] /usr/local/bin/twistd --logger=anchore_engine.subsys.twistd_logger.logger --pidfile /var/run/anchore/anchore-api.pid -n anchore-api --config /config
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 1/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 2/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 3/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 4/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 5/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 6/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 7/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 8/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 9/30
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/anchore_manager/cli/service.py", line 165, in startup_service
raise Exception("process exited: " + str(rc))
Exception: process exited: 1
[anchore-api] [anchore_manager.cli.service/startup_service()] [ERROR] service process exited at (Tue Dec 1 16:30:54 2020): process exited: 1
[anchore-api] [anchore_manager.cli.service/startup_service()] [FATAL] Could not start service due to: process exited: 1
[anchore-api] [anchore_manager.cli.service/startup_service()] [INFO] exiting service thread
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 10/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] service thread has stopped anchore-api
[MainThread] [anchore_manager.cli.service/start()] [INFO] auto_restart_services setting: False
[MainThread] [anchore_manager.cli.service/start()] [INFO] checking for startup failure pidfile=False, is_alive=False
[MainThread] [anchore_manager.cli.service/start()] [WARN] service start failed - exception: service thread for (anchore-api) failed to start
[MainThread] [anchore_manager.cli.service/start()] [FATAL] one or more services failed to start. cleanly terminating the others
[MainThread] [anchore_manager.cli.service/terminate_service()] [INFO] Looking for pre-existing service (anchore-api) pid from pidfile (/var/run/anchore/anchore-api.pid)
/usr/local/lib/python3.6/site-packages/yosai/core/conf/yosaisettings.py:100: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
config = yaml.load(stream)
Traceback (most recent call last):
File "/usr/local/bin/twistd", line 11, in <module>
sys.exit(run())
File "/usr/local/lib64/python3.6/site-packages/twisted/scripts/twistd.py", line 31, in run
app.run(runApp, ServerOptions)
File "/usr/local/lib64/python3.6/site-packages/twisted/application/app.py", line 674, in run
runApp(config)
File "/usr/local/lib64/python3.6/site-packages/twisted/scripts/twistd.py", line 25, in runApp
runner.run()
File "/usr/local/lib64/python3.6/site-packages/twisted/application/app.py", line 383, in run
self.logger.start(self.application)
File "/usr/local/lib64/python3.6/site-packages/twisted/application/app.py", line 184, in start
observer = self._observerFactory()
File "/usr/local/lib/python3.6/site-packages/anchore_engine/subsys/twistd_logger.py", line 14, in logger
f = logfile.LogFile(thefile, '/var/log/', rotateLength=10000000, maxRotatedFiles=10)
File "/usr/local/lib64/python3.6/site-packages/twisted/python/logfile.py", line 170, in __init__
BaseLogFile.__init__(self, name, directory, defaultMode)
File "/usr/local/lib64/python3.6/site-packages/twisted/python/logfile.py", line 45, in __init__
self._openFile()
File "/usr/local/lib64/python3.6/site-packages/twisted/python/logfile.py", line 175, in _openFile
BaseLogFile._openFile(self)
File "/usr/local/lib64/python3.6/site-packages/twisted/python/logfile.py", line 85, in _openFile
self._file = open(self.path, "wb+", 0)
PermissionError: [Errno 13] Permission denied: '/var/log/anchore/anchore-api.log'
[MainThread] [anchore_manager.cli.service/start()] [INFO] Loading DB routines from module (anchore_engine)
[MainThread] [anchore_manager.util.db/connect_database()] [INFO] DB params: {"db_connect_args": {"connect_timeout": 86400}, "db_pool_size": 30, "db_pool_max_overflow": 100, "db_echo": false, "db_engine_args": null}
[MainThread] [anchore_manager.util.db/connect_database()] [INFO] DB connection configured: True
[MainThread] [anchore_manager.util.db/connect_database()] [INFO] DB attempting to connect...
[MainThread] [anchore_manager.util.db/connect_database()] [INFO] DB connected: True
[MainThread] [anchore_manager.util.db/init_database()] [INFO] DB compatibility check: running...
[MainThread] [anchore_manager.util.db/init_database()] [INFO] DB compatibility check success
[MainThread] [anchore_manager.util.db/init_database()] [INFO] DB post actions: running...
[MainThread] [anchore_manager.cli.service/start()] [INFO] DB version and code version in sync.
[MainThread] [anchore_manager.cli.service/start()] [INFO] Starting services: ['anchore-api']
[MainThread] [anchore_manager.cli.service/terminate_service()] [INFO] Looking for pre-existing service (anchore-api) pid from pidfile (/var/run/anchore/anchore-api.pid)
[anchore-api] [anchore_manager.cli.service/startup_service()] [INFO] cleaning up service: anchore-api
[anchore-api] [anchore_manager.cli.service/terminate_service()] [INFO] Looking for pre-existing service (anchore-api) pid from pidfile (/var/run/anchore/anchore-api.pid)
[anchore-api] [anchore_manager.cli.service/startup_service()] [INFO] starting service: anchore-api
[anchore-api] [anchore_manager.cli.service/startup_service()] [INFO] /usr/local/bin/twistd --logger=anchore_engine.subsys.twistd_logger.logger --pidfile /var/run/anchore/anchore-api.pid -n anchore-api --config /config
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 0/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 1/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 2/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 3/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 4/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 5/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 6/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 7/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 8/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 9/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 10/30
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/anchore_manager/cli/service.py", line 165, in startup_service
raise Exception("process exited: " + str(rc))
Exception: process exited: 1
[anchore-api] [anchore_manager.cli.service/startup_service()] [ERROR] service process exited at (Tue Dec 1 16:32:07 2020): process exited: 1
[anchore-api] [anchore_manager.cli.service/startup_service()] [FATAL] Could not start service due to: process exited: 1
[anchore-api] [anchore_manager.cli.service/startup_service()] [INFO] exiting service thread
[MainThread] [anchore_manager.cli.service/start()] [INFO] waiting for service pidfile /var/run/anchore/anchore-api.pid to exist 11/30
[MainThread] [anchore_manager.cli.service/start()] [INFO] service thread has stopped anchore-api
[MainThread] [anchore_manager.cli.service/start()] [INFO] auto_restart_services setting: False
[MainThread] [anchore_manager.cli.service/start()] [INFO] checking for startup failure pidfile=False, is_alive=False
[MainThread] [anchore_manager.cli.service/start()] [WARN] service start failed - exception: service thread for (anchore-api) failed to start
[MainThread] [anchore_manager.cli.service/start()] [FATAL] one or more services failed to start. cleanly terminating the others
[MainThread] [anchore_manager.cli.service/terminate_service()] [INFO] Looking for pre-existing service (anchore-api) pid from pidfile (/var/run/anchore/anchore-api.pid)
I was expecting that the 166531 is within the range defined in the subuid file. What's wrong?
The 166531 is within the subuid range, it maps to uid 999 (165533-1+999) which matches the uid of the postgres user within the postgres image:
$ docker run -it --rm --entrypoint /bin/bash postgres:9
root#d99b2bbb3d48:/# ls -al /var/lib/postgresql/data
total 8
drwxrwxrwx 2 postgres postgres 4096 Nov 18 08:39 .
drwxr-xr-x 1 postgres postgres 4096 Nov 18 08:39 ..
root#d99b2bbb3d48:/# id postgres
uid=999(postgres) gid=999(postgres) groups=999(postgres),101(ssl-cert)
By default, docker will initialize a new/empty named volume to match the contents of the image, including file ownership and permissions. This is expected behavior for postgres. We'd need to see logs of the failing containers to give more details of why you're seeing containers exit.

Connect consul agent to consul

I'm trying to setup the consul server and connect an agent to it for 2 or 3 days already. I'm using docker-compose.
But after performing a join operation, agent gets a message "Agent not live or unreachable".
Here are the logs:
root#e33a6127103f:/app# consul agent -join 10.1.30.91 -data-dir=/tmp/consul
==> Starting Consul agent...
==> Joining cluster...
Join completed. Synced with 1 initial agents
==> Consul agent running!
Version: 'v1.0.1'
Node ID: '0e1adf74-462d-45a4-1927-95ed123f1526'
Node name: 'e33a6127103f'
Datacenter: 'dc1' (Segment: '')
Server: false (Bootstrap: false)
Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, DNS: 8600)
Cluster Addr: 172.17.0.2 (LAN: 8301, WAN: 8302)
Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false
==> Log data will now stream in as it occurs:
2017/12/06 10:44:43 [INFO] serf: EventMemberJoin: e33a6127103f 172.17.0.2
2017/12/06 10:44:43 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
2017/12/06 10:44:43 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
2017/12/06 10:44:43 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
2017/12/06 10:44:43 [INFO] agent: (LAN) joining: [10.1.30.91]
2017/12/06 10:44:43 [INFO] serf: EventMemberJoin: consul1 172.19.0.2 2017/12/06 10:44:43 [INFO] consul: adding server consul1 (Addr: tcp/172.19.0.2:8300) (DC: dc1)
2017/12/06 10:44:43 [INFO] agent: (LAN) joined: 1 Err: <nil>
2017/12/06 10:44:43 [INFO] agent: started state syncer
2017/12/06 10:44:43 [WARN] manager: No servers available
2017/12/06 10:44:43 [ERR] agent: failed to sync remote state: No known Consul servers
2017/12/06 10:44:54 [INFO] memberlist: Suspect consul1 has failed, no acks received
2017/12/06 10:44:55 [ERR] consul: "Catalog.NodeServices" RPC failed to server 172.19.0.2:8300: rpc error getting client: failed to get conn: dial tcp <nil>->172.19.0.2:8300: i/o timeout
2017/12/06 10:44:55 [ERR] agent: failed to sync remote state: rpc error getting client: failed to get conn: dial tcp <nil>->172.19.0.2:8300: i/o timeout
2017/12/06 10:44:58 [INFO] memberlist: Marking consul1 as failed, suspect timeout reached (0 peer confirmations)
2017/12/06 10:44:58 [INFO] serf: EventMemberFailed: consul1 172.19.0.2
2017/12/06 10:44:58 [INFO] consul: removing server consul1 (Addr: tcp/172.19.0.2:8300) (DC: dc1)
2017/12/06 10:45:05 [INFO] memberlist: Suspect consul1 has failed, no acks received
2017/12/06 10:45:06 [WARN] manager: No servers available
2017/12/06 10:45:06 [ERR] agent: Coordinate update error: No known Consul servers
2017/12/06 10:45:12 [WARN] manager: No servers available
2017/12/06 10:45:12 [ERR] agent: failed to sync remote state: No known Consul servers
2017/12/06 10:45:13 [INFO] serf: attempting reconnect to consul1 172.19.0.2:8301
2017/12/06 10:45:28 [WARN] manager: No servers available
2017/12/06 10:45:28 [ERR] agent: failed to sync remote state: No known Consul servers
2017/12/06 10:45:32 [WARN] manager: No servers available . `
My settings are:
docker-compose SERVER:
consul1:
image: "consul.1.0.1"
container_name: "consul1"
hostname: "consul1"
volumes:
- ./consul/config:/config/
ports:
- "8400:8400"
- "8500:8500"
- "8600:53"
- "8300:8300"
- "8301:8301"
command: "agent -config-dir=/config -ui -server -bootstrap-expect 1"
Help please solve the problem.
I think you using wrong ip-address of consul-server
"consul agent -join 10.1.30.91 -data-dir=/tmp/consul"
10.1.30.91 this is not docker container ip it might be your host address/virtualbox.
Get consul-container ip and use that to join in consul-agent command.
For more info about how consul and agent works follow the link
https://dzone.com/articles/service-discovery-with-docker-and-consul-part-1
Try to get the right IP address by executing this command:
docker inspect <container id> | grep "IPAddress"
Where the is the container ID of the consul server.
Than use the obtained address instead of "10.1.30.91" in the command
consul agent -join <IP ADDRESS CONSUL SERVER> -data-dir=/tmp/consul

Consul Empty reply from server

I'm trying to get a consul server cluster up and running. I have 3 dockerized consul servers running, but I can't access the Web UI, the HTTP API nor the DNS.
$ docker logs net-sci_discovery-service_consul_1
==> WARNING: Expect Mode enabled, expecting 3 servers
==> Starting Consul agent...
==> Consul agent running!
Version: 'v0.8.5'
Node ID: 'ccd38897-6047-f8b6-be1c-2aa0022a1483'
Node name: 'consul1'
Datacenter: 'dc1'
Server: true (bootstrap: false)
Client Addr: 127.0.0.1 (HTTP: 8500, HTTPS: -1, DNS: 8600)
Cluster Addr: 172.20.0.2 (LAN: 8301, WAN: 8302)
Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false
==> Log data will now stream in as it occurs:
2017/07/07 23:24:07 [INFO] raft: Initial configuration (index=0): []
2017/07/07 23:24:07 [INFO] raft: Node at 172.20.0.2:8300 [Follower] entering Follower state (Leader: "")
2017/07/07 23:24:07 [INFO] serf: EventMemberJoin: consul1 172.20.0.2
2017/07/07 23:24:07 [INFO] consul: Adding LAN server consul1 (Addr: tcp/172.20.0.2:8300) (DC: dc1)
2017/07/07 23:24:07 [INFO] serf: EventMemberJoin: consul1.dc1 172.20.0.2
2017/07/07 23:24:07 [INFO] consul: Handled member-join event for server "consul1.dc1" in area "wan"
2017/07/07 23:24:07 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
2017/07/07 23:24:07 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
2017/07/07 23:24:07 [INFO] agent: Started HTTP server on 127.0.0.1:8500
2017/07/07 23:24:09 [INFO] serf: EventMemberJoin: consul2 172.20.0.3
2017/07/07 23:24:09 [INFO] consul: Adding LAN server consul2 (Addr: tcp/172.20.0.3:8300) (DC: dc1)
2017/07/07 23:24:09 [INFO] serf: EventMemberJoin: consul2.dc1 172.20.0.3
2017/07/07 23:24:09 [INFO] consul: Handled member-join event for server "consul2.dc1" in area "wan"
2017/07/07 23:24:10 [INFO] serf: EventMemberJoin: consul3 172.20.0.4
2017/07/07 23:24:10 [INFO] consul: Adding LAN server consul3 (Addr: tcp/172.20.0.4:8300) (DC: dc1)
2017/07/07 23:24:10 [INFO] consul: Found expected number of peers, attempting bootstrap: 172.20.0.2:8300,172.20.0.3:8300,172.20.0.4:8300
2017/07/07 23:24:10 [INFO] serf: EventMemberJoin: consul3.dc1 172.20.0.4
2017/07/07 23:24:10 [INFO] consul: Handled member-join event for server "consul3.dc1" in area "wan"
2017/07/07 23:24:14 [ERR] agent: failed to sync remote state: No cluster leader
2017/07/07 23:24:17 [WARN] raft: Heartbeat timeout from "" reached, starting election
2017/07/07 23:24:17 [INFO] raft: Node at 172.20.0.2:8300 [Candidate] entering Candidate state in term 2
2017/07/07 23:24:17 [INFO] raft: Election won. Tally: 2
2017/07/07 23:24:17 [INFO] raft: Node at 172.20.0.2:8300 [Leader] entering Leader state
2017/07/07 23:24:17 [INFO] raft: Added peer 172.20.0.3:8300, starting replication
2017/07/07 23:24:17 [INFO] raft: Added peer 172.20.0.4:8300, starting replication
2017/07/07 23:24:17 [INFO] consul: cluster leadership acquired
2017/07/07 23:24:17 [INFO] consul: New leader elected: consul1
2017/07/07 23:24:17 [WARN] raft: AppendEntries to {Voter 172.20.0.3:8300 172.20.0.3:8300} rejected, sending older logs (next: 1)
2017/07/07 23:24:17 [WARN] raft: AppendEntries to {Voter 172.20.0.4:8300 172.20.0.4:8300} rejected, sending older logs (next: 1)
2017/07/07 23:24:17 [INFO] raft: pipelining replication to peer {Voter 172.20.0.3:8300 172.20.0.3:8300}
2017/07/07 23:24:17 [INFO] raft: pipelining replication to peer {Voter 172.20.0.4:8300 172.20.0.4:8300}
2017/07/07 23:24:18 [INFO] consul: member 'consul1' joined, marking health alive
2017/07/07 23:24:18 [INFO] consul: member 'consul2' joined, marking health alive
2017/07/07 23:24:18 [INFO] consul: member 'consul3' joined, marking health alive
2017/07/07 23:24:20 [INFO] agent: Synced service 'consul'
2017/07/07 23:24:20 [INFO] agent: Synced service 'messaging-service-kafka'
2017/07/07 23:24:20 [INFO] agent: Synced service 'messaging-service-zookeeper'
$ curl http://127.0.0.1:8500/v1/catalog/service/consul
curl: (52) Empty reply from server
dig #127.0.0.1 -p 8600 consul.service.consul
; <<>> DiG 9.8.3-P1 <<>> #127.0.0.1 -p 8600 consul.service.consul
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
$ dig #127.0.0.1 -p 8600 messaging-service-kafka.service.consul
; <<>> DiG 9.8.3-P1 <<>> #127.0.0.1 -p 8600 messaging-service-kafka.service.consul
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
I can't get my services to register via the HTTP API either; those shown above are registered using a config script when the container launches.
Here's my docker-compose.yml:
version: '2'
services:
consul1:
image: "consul:latest"
container_name: "net-sci_discovery-service_consul_1"
hostname: "consul1"
ports:
- "8400:8400"
- "8500:8500"
- "8600:8600"
volumes:
- ./etc/consul.d:/etc/consul.d
command: "agent -server -ui -bootstrap-expect 3 -config-dir=/etc/consul.d -bind=0.0.0.0"
consul2:
image: "consul:latest"
container_name: "net-sci_discovery-service_consul_2"
hostname: "consul2"
command: "agent -server -join=consul1"
links:
- "consul1"
consul3:
image: "consul:latest"
container_name: "net-sci_discovery-service_consul_3"
hostname: "consul3"
command: "agent -server -join=consul1"
links:
- "consul1"
I'm relatively new to both docker and consul. I've had a look around the web and the above options are my understanding of what is required. Any suggestions on the way forward would be very welcome.
Edit:
Result of docker container ps -all:
$ docker container ps --all
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
e0a1c3bba165 consul:latest "docker-entrypoint..." 38 seconds ago Up 36 seconds 8300-8302/tcp, 8500/tcp, 8301-8302/udp, 8600/tcp, 8600/udp net-sci_discovery-service_consul_3
7f05555e81e0 consul:latest "docker-entrypoint..." 38 seconds ago Up 36 seconds 8300-8302/tcp, 8500/tcp, 8301-8302/udp, 8600/tcp, 8600/udp net-sci_discovery-service_consul_2
9e2dedaa224b consul:latest "docker-entrypoint..." 39 seconds ago Up 38 seconds 0.0.0.0:8400->8400/tcp, 8301-8302/udp, 0.0.0.0:8500->8500/tcp, 8300-8302/tcp, 8600/udp, 0.0.0.0:8600->8600/tcp net-sci_discovery-service_consul_1
27b34c5dacb7 messagingservice_kafka "start-kafka.sh" 3 hours ago Up 3 hours 0.0.0.0:9092->9092/tcp net-sci_messaging-service_kafka
0389797b0b8f wurstmeister/zookeeper "/bin/sh -c '/usr/..." 3 hours ago Up 3 hours 22/tcp, 2888/tcp, 3888/tcp, 0.0.0.0:2181->2181/tcp net-sci_messaging-service_zookeeper
Edit:
Updated docker-compose.yml to include long format for ports:
version: '3.2'
services:
consul1:
image: "consul:latest"
container_name: "net-sci_discovery-service_consul_1"
hostname: "consul1"
ports:
- target: 8400
published: 8400
mode: host
- target: 8500
published: 8500
mode: host
- target: 8600
published: 8600
mode: host
volumes:
- ./etc/consul.d:/etc/consul.d
command: "agent -server -ui -bootstrap-expect 3 -config-dir=/etc/consul.d -bind=0.0.0.0 -client=127.0.0.1"
consul2:
image: "consul:latest"
container_name: "net-sci_discovery-service_consul_2"
hostname: "consul2"
command: "agent -server -join=consul1"
links:
- "consul1"
consul3:
image: "consul:latest"
container_name: "net-sci_discovery-service_consul_3"
hostname: "consul3"
command: "agent -server -join=consul1"
links:
- "consul1"
From the Consul Web Gui page, make sure you have launched an agent with the -ui parameter.
The UI is available at the /ui path on the same port as the HTTP API.
By default this is http://localhost:8500/ui
I do see 8500 mapped to your host on broadcast (0.0.0.0).
Check also (as in this answer) if the client_addr can help (at least for testing)

Resources