Corruption of Portainer's DB - docker-swarm

I have a deployment of Portainer 2.14.2 and Docker Engine 20.10.7.
It has been functional for quite a few months. Today I had some problems as the Portainer container (the one that is in charge of the UI, not the agent) was restarting. In one of those restarts, for an unknown reason, the database has been corrupted.
Logs:
time="2022-10-19T10:59:15Z" level=info msg="Encryption key file `portainer` not present"
time="2022-10-19T10:59:15Z" level=info msg="Proceeding without encryption key"
time="2022-10-19T10:59:15Z" level=info msg="Loading PortainerDB: portainer.db"
panic: page 8 already freed
goroutine 35 [running]:
go.etcd.io/bbolt.(*freelist).free(0xc000728600, 0xb175, 0x7f104c311000)
/tmp/go/pkg/mod/go.etcd.io/bbolt#v1.3.6/freelist.go:175 +0x2c8
go.etcd.io/bbolt.(*node).spill(0xc000152070)
/tmp/go/pkg/mod/go.etcd.io/bbolt#v1.3.6/node.go:359 +0x216
go.etcd.io/bbolt.(*node).spill(0xc000152000)
/tmp/go/pkg/mod/go.etcd.io/bbolt#v1.3.6/node.go:346 +0xaa
go.etcd.io/bbolt.(*Bucket).spill(0xc00013e018)
/tmp/go/pkg/mod/go.etcd.io/bbolt#v1.3.6/bucket.go:570 +0x33f
go.etcd.io/bbolt.(*Tx).Commit(0xc00013e000)
/tmp/go/pkg/mod/go.etcd.io/bbolt#v1.3.6/tx.go:160 +0xe7
go.etcd.io/bbolt.(*DB).Update(0xc0001f1000?, 0xc000134ef8)
/tmp/go/pkg/mod/go.etcd.io/bbolt#v1.3.6/db.go:748 +0xe5
go.etcd.io/bbolt.(*batch).run(0xc00031c000)
/tmp/go/pkg/mod/go.etcd.io/bbolt#v1.3.6/db.go:856 +0x126
sync.(*Once).doSlow(0x0?, 0x1?)
/opt/hostedtoolcache/go/1.18.3/x64/src/sync/once.go:68 +0xc2
sync.(*Once).Do(...)
/opt/hostedtoolcache/go/1.18.3/x64/src/sync/once.go:59
go.etcd.io/bbolt.(*batch).trigger(0xc000321a00?)
/tmp/go/pkg/mod/go.etcd.io/bbolt#v1.3.6/db.go:838 +0x45
created by time.goFunc
/opt/hostedtoolcache/go/1.18.3/x64/src/time/sleep.go:176 +0x32
My hypothesis is that in one of those restarts, the container might have been stopped in the middle of a writing procedure (although I am not 100% sure).
This is the first time this has happened to me, so I don't know how to recover from this state without deploying a new Portainer stack or erasing the whole database, as this would be a really drastic solution.
If it helps this is the docker-compose:
version: "3.8"
networks:
net:
external: true
services:
agent:
image: portainer/agent:2.14.2-alpine
environment:
AGENT_CLUSTER_ADDR: tasks.agent
AGENT_PORT: 9001
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- /var/lib/docker/volumes:/var/lib/docker/volumes
networks:
- net
deploy:
mode: global
restart_policy:
condition: on-failure
portainer:
image: portainer/portainer-ce:2.14.2-alpine
command: -H tcp://tasks.agent:9001 --tlsskipverify --admin-password-file=/run/secrets/portainer_secret
ports:
- "9000:9000"
- "8000:8000"
volumes:
- "/var/volumes/portainer/data:/data"
networks:
- net
secrets:
- portainer_secret
- source: ca_cert_secret
target: /etc/ssl/certs/localCA.pem
deploy:
mode: replicated
replicas: 1
restart_policy:
condition: on-failure
placement:
constraints:
- node.labels.stateful == true
labels:
- "traefik.enable=true"
- "traefik.passHostHeader=true"
- "traefik.http.routers.portainer.rule=Host(`portainer`)"
- "traefik.http.services.portainer.loadbalancer.server.port=9000"
- "traefik.http.routers.portainer.entrypoints=web"
- "traefik.http.routers.portainer.service=portainer"
- "traefik.http.routers.portainer.tls=true"
- "traefik.http.routers.portainer.entrypoints=web-secure"
secrets:
portainer_secret:
external: true
ca_cert_secret:
external: true

Related

Traefik 2 Gateway timeout when attempting to proxy to a container that is on two networks

Plenty of questions on this front but they are using a more complex docker-compose.yml than I am, so I fear they may have mis-configurations in their compose file such as this one:
Traefik 2 Gateway Timeout
Within a single docker-compose.yml, I am trying to keep a database container on its own network, an app container on both the database network and the Traefik network, and the Traefik network managed elsewhere by Traefik.
version: '3.9'
services:
wordpress:
image: wordpress:6.1
container_name: dev-wp1
deploy:
resources:
limits:
cpus: '0.50'
memory: 256M
restart: always
environment:
WORDPRESS_DB_HOST: db
WORDPRESS_DB_USER: dev
WORDPRESS_DB_PASSWORD: dev
WORDPRESS_DB_NAME: dev
volumes:
- /opt/container_config/exampledomain.local/wp:/var/www/html
networks:
- traefik-network
- db-network
labels:
- traefik.enable=true
- traefik.http.routers.dev-wp1.rule=Host(`exampledomain.local`)
- traefik.http.routers.dev-wp1.entrypoints=websecure
- traefik.http.routers.dev-wp1.tls=true
db:
image: mariadb:10.10
container_name: dev-db1
deploy:
resources:
limits:
cpus: '0.50'
memory: 256M
restart: always
environment:
MYSQL_DATABASE: dev
MYSQL_USER: dev
MYSQL_PASSWORD: dev
MYSQL_RANDOM_ROOT_PASSWORD: '1'
volumes:
- /opt/container_config/exampledomain.local/db:/var/lib/mysql
networks:
- db-network
networks:
db-network:
name: db-network
traefik-network:
name: traefik-network
external: true
Attempting to hit exampledomain.local fails.
If I eliminate the db-network, and place the database on the traefik network, resolution to exampledomain.local works fine. I do not wish to expose the ports of the wp1 container and would desire traefik to be the only exposed ports on the host. I would prefer not having the db container on the traefik-network. What am I missing?

Elasticsearch Logstash and Kibana (ELK) stack docker-compose on EC2 failed status checks

I'm running an ELK stack on a T4g.medium box (arm & 4GB ram) on AWS. When using the official Kibana image I see weird behaviour where after approx 4 hours running the CPU spikes (50-60%) and the EC2 box becomes unreachable until restarted. 1 out of 2 status checks fail also. Once restarted it runs for another 4 or so hours then the same happens again. The instance is not under heavy load and it goes down in the middle of the night when there is no load. I'm 99.9% its Kibana causing the issue as gagara/kibana-oss-arm64:7.6.2 has ran for months without issue. Its not an ARM issue or Kibana 7.13 either as I've encountered the same with x86 on older versions of Kibana. Mu config is:
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:7.13.0
configs:
- source: elastic_config
target: /usr/share/elasticsearch/config/elasticsearch.yml
environment:
ES_JAVA_OPTS: "-Xmx2g -Xms2g"
networks:
- internal
volumes:
- /mnt/data/elasticsearch:/usr/share/elasticsearch/data
deploy:
mode: replicated
replicas: 1
logstash:
image: docker.elastic.co/logstash/logstash:7.13.0
ports:
- "5044:5044"
- "9600:9600"
configs:
- source: logstash_config
target: /usr/share/logstash/config/logstash.yml
- source: logstash_pipeline
target: /usr/share/logstash/pipeline/logstash.conf
environment:
LS_JAVA_OPTS: "-Xmx1g -Xms1g"
networks:
- internal
deploy:
mode: replicated
replicas: 1
kibana:
image: docker.elastic.co/kibana/kibana:7.13.0
configs:
- source: kibana_config
target: /usr/share/kibana/config/kibana.yml
environment:
NODE_OPTIONS: "--max-old-space-size=300"
networks:
- internal
deploy:
mode: replicated
replicas: 1
labels:
- "traefik.enable=true"
load-balancer:
image: traefik:v2.2.8
ports:
- 5601:443
configs:
- source: traefik_config
target: /etc/traefik/traefik.toml
volumes:
- /var/run/docker.sock:/var/run/docker.sock
deploy:
restart_policy:
condition: any
mode: replicated
replicas: 1
networks:
- internal
configs:
elastic_config:
file: ./config/elasticsearch.yml
logstash_config:
file: ./config/logstash/logstash.yml
logstash_pipeline:
file: ./config/logstash/pipeline/pipeline.conf
kibana_config:
file: ./config/kibana.yml
traefik_config:
file: ./config/traefik.toml
networks:
internal:
driver: overlay
And I've disabled a pile of stuff in kibana.yml to see if that helped:
server.name: kibana
server.host: "0.0.0.0"
elasticsearch.hosts: ["http://elasticsearch:9200"]
xpack.monitoring.ui.enabled: false
xpack.graph.enabled: false
xpack.infra.enabled: false
xpack.canvas.enabled: false
xpack.ml.enabled: false
xpack.uptime.enabled: false
xpack.maps.enabled: false
xpack.apm.enabled: false
timelion.enabled: false
Has anyone encountered similar problems with a single node ELK stack running on Docker?

Netdata in a docker swarm environment

I'm quite new to Netdata and also Docker Swarm. I ran Netdata for a while on single hosts but now trying to stream Netdata from workers to a manager node in a swarm environment where the manager also should act as a central Netdata instance. I'm aiming to only monitor the data from the manager.
Here's my compose file for the stack:
version: '3.2'
services:
netdata-client:
image: titpetric/netdata
hostname: "{{.Node.Hostname}}"
cap_add:
- SYS_PTRACE
security_opt:
- apparmor:unconfined
environment:
- NETDATA_STREAM_DESTINATION=control:19999
- NETDATA_STREAM_API_KEY=1x214ch15h3at1289y
- PGID=999
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /var/run/docker.sock:/var/run/docker.sock
networks:
- netdata
deploy:
mode: global
placement:
constraints: [node.role == worker]
netdata-central:
image: titpetric/netdata
hostname: control
cap_add:
- SYS_PTRACE
security_opt:
- apparmor:unconfined
environment:
- NETDATA_API_KEY_ENABLE_1x214ch15h3at1289y=1
ports:
- '19999:19999'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /var/run/docker.sock:/var/run/docker.sock
networks:
- netdata
deploy:
mode: replicated
replicas: 1
placement:
constraints: [node.role == manager]
networks:
netdata:
driver: overlay
attachable: true
Netdata on the manager works fine and the container runs on the one worker node I'm testing on. According to log output it seems to run well and gathers names from the docker containers running as it does in a local environment.
Problem is that it can't connect to the netdata-central service running on the manager.
This is the error message:
2019-01-04 08:35:28: netdata INFO : STREAM_SENDER[7] : STREAM 7 [send to control:19999]: connecting...,
2019-01-04 08:35:28: netdata ERROR : STREAM_SENDER[7] : Cannot resolve host 'control', port '19999': Name or service not known,
not sure why it can't resolve the hostname, thought it should work that way on the overlay network. Maybe there's a better way to connect and not rely on the hostname?
Any help is appreciated.
EDIT: as this question might come up - the firewall (ufw) on the control host is inactive, also I think the error message clearly points to a problem with name resolution.
Your API-Key is in the wrong format..it has to be a GUID. You can generate one with the "uuidgen" command...
https://github.com/netdata/netdata/blob/63c96aa96f96f3aea10bdcd2ecd92c889f26b3af/conf.d/stream.conf#L7
In the latest image the environment variables does not work.
The solution is to create a configuration file for the stream.
My working compose file is:
version: '3.7'
configs:
netdata_stream_master:
file: $PWD/stream-master.conf
netdata_stream_client:
file: $PWD/stream-client.conf
services:
netdata-client:
image: netdata/netdata:v1.21.1
hostname: "{{.Node.Hostname}}"
depends_on:
- netdata-central
configs:
-
mode: 444
source: netdata_stream_client
target: /etc/netdata/stream.conf
security_opt:
- apparmor:unconfined
environment:
- PGID=999
volumes:
- /proc:/host/proc:ro
- /etc/passwd:/host/etc/passwd:ro
- /etc/group:/host/etc/group:ro
- /sys:/host/sys:ro
- /var/run/docker.sock:/var/run/docker.sock
deploy:
mode: global
netdata-central:
image: netdata/netdata:v1.21.1
hostname: control
configs:
-
mode: 444
source: netdata_stream_master
target: /etc/netdata/stream.conf
security_opt:
- apparmor:unconfined
environment:
- PGID=999
ports:
- '19999:19999'
volumes:
- /etc/passwd:/host/etc/passwd:ro
- /etc/group:/host/etc/group:ro
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /var/run/docker.sock:/var/run/docker.sock
deploy:
mode: replicated
replicas: 1
placement:
constraints: [node.role == manager]

Docker Stack Swarm - Service Replicas are not spread for Mutli Service Stack

I have deployed a stack with a of 4 services on two hosts (docker compose version 3).
The services are Elasticsearch, Kibana. Redis, Visualiser and finally my Web App. I have't set any resource restrictions yet.
I spun two virtual host via docker-machine , one with 2GB and one with 1GB.
Then I increased the replicas of my web app to 2 replicas, which resolved to the following distribution:
Host1 (Master):
Kibana, Redis, Web App, Visualiser, WebApp
Host2 (Worker):
Elasticsearch
Why is the Swarm Manager distributing both Web App Containers to the same host. Wouldn't it be smarter if Web App is distributed to both hosts?
Besides node tagging I couldn't find any other way in the docs to influence the distribution.
Am I missing something?
Thanks
Bjorn
docker-compose.yml
version: "3"
services:
visualizer:
image: dockersamples/visualizer:stable
ports:
- "8080:8080"
volumes:
- "/var/run/docker.sock:/var/run/docker.sock"
deploy:
placement:
constraints: [node.role == manager]
networks:
- webnet
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:5.4.3
environment:
ES_JAVA_OPTS: -Xms1g -Xmx1g
ulimits:
memlock: -1
nofile:
hard: 65536
soft: 65536
nproc: 65538
deploy:
resources:
limits:
cpus: "0.5"
memory: 1g
volumes:
- esdata:/usr/share/elasticsearch/data
ports:
- 9200:9200
- 9300:9300
networks:
- webnet
web:
# replace username/repo:tag with your name and image details
image: bjng/workinseason:swarm
deploy:
replicas: 2
restart_policy:
condition: on-failure
ports:
- "80:6000"
networks:
- webnet
kibana:
image: docker.elastic.co/kibana/kibana:5.4.3
deploy:
placement:
constraints: [node.role == manager]
ports:
- "5601:5601"
networks:
- webnet
redis:
image: "redis:alpine"
networks:
- webnet
volumes:
esdata:
driver: local
networks:
webnet:
Docker schedules tasks (containers) based on available resources; if two nodes have enough resources, the container can be scheduled on either one.
Recent versions of Docker use "HA" scheduling by default, which means that tasks for the same service are spread over multiple nodes, if possible (see this pull request) https://github.com/docker/swarmkit/pull/1446

Docker Swarm connection between containers refused for some containers

simplified swarm:
manager1 node
- consul-agent
worker1 node
- consul-client1
- web-app:80
- web-network:9000
database1 node
- consul-client2
- redis:6379
- mongo:27017
The web-app and web-network services can connect to redis and mongo through their service names correctly, e.g redis.createClient('6379', 'redis') and mongoose.connect('mongodb://mongo').
However, container web-app cannot connect to web-network, I'm trying to make a request like so:
request('http://web-network:9000')
But get the error:
errorno: ECONNREFUSED
address: 10.0.1.9
port: 9000
Request to web-network using a private IP does work:
request('http://11.22.33.44:9000')
What am I missing? Why can they connect to redis and mongo but not between each container? When moving redis/mongo to the same node as web-app, it will still work, so I don't think the issue comes because the services cannot talk to a service on the same server node.
Can we make docker network use private IP instead of the pre-configured subnet?
docker stack deploy file
version: '3'
services:
web-app:
image: private-repo/private-image
networks:
- swarm-network
ports:
- "80:8080"
deploy:
placement:
constraints:
- node.role==worker
web-network:
image: private-repo/private-image2
networks:
- swarm-network
ports:
- "9000:8080"
deploy:
placement:
constraints:
- node.role==worker
redis:
image: redis:latest
networks:
- swarm-network
ports:
- "6739:6739"
deploy:
placement:
constraints:
- engine.labels.purpose==database
mongo:
image: mongo:latest
networks:
- swarm-network
ports:
- "27017:27017"
deploy:
placement:
constraints:
- engine.labels.purpose==database
networks:
swarm-network:
driver: overlay
docker stack deploy app -c docker-compose.yml

Resources