How to achieve zero downtime with docker stack - docker

Docker updates container but network registration takes 10 minutes to complete so while the new container is being registered the page returns 502 because internal network is still pointing at the old container. How can i delay the removal of the old container after the update to the new container by 10 minutes or so? Ideally I would like to push this config with docker stack but I'll do whatever it takes. I should also note that I am unable to use replicas right now due to certain limitations of a security package i'm being forced to use.
version: '3.7'
services:
xxx:
image: ${xxx}/com.xxx:${xxx}
environment:
- SERVICE_NAME=xxx
- xxx
- _xxx
- SPRING_PROFILES_ACTIVE=${xxx}
networks:
- xxx${xxx}
healthcheck:
interval: 1m
deploy:
mode: replicated
replicas: 1
resources:
limits:
cpus: '3'
memory: 1024M
reservations:
cpus: '0.50'
memory: 256M
labels:
- com.docker.lb.hosts=xxx${_xxx}.xxx.com
- jenkins.url=${xxx}
- com.docker.ucp.access.label=/${xxx}/xxx
- com.docker.lb.network=xxx${_xxx}
- com.docker.lb.port=8080
- com.docker.lb.service_cluster=${xxx}
- com.docker.lb.ssl_cert=xxx.cert
- com.docker.lb.ssl_key=xxx.key
- com.docker.lb.redirects=http://xxx${_xxx}.xxx.com/xxx,https://xxx${_xxx}.xxx.com/xxx
restart_policy:
condition: any
delay: 5s
max_attempts: 3
window: 120s
update_config:
parallelism: 1
delay: 10s
order: start-first
failure_action: rollback
rollback_config:
parallelism: 0
order: stop-first
secrets:
- ${xxx}
networks:
xxx${_xxx}:
external: true
secrets:
${xxx}:
external: true
xxx.cert:
external: true
xxx.key:
external: true

Use proper healthcheck - see the reference here: https://docs.docker.com/compose/compose-file/#healthcheck
So:
You need to define proper test to know when your new container is fully up (that goes inside test instruction of your healthcheck).
Use start_period instruction to specify your 10 (or so) minute way - otherwise, Docker Swarm would just kill your new container and never let it start.
Basically, once you get healthcheck right, this should solve your issue.

Related

Docker swarm stack service replicas zero down time

i have been trying to fine tune the docker compose settings but i am not satisfied with the result and the docs are so unspecific for the healthcheck and update_config options.
The scenario are react apps which need to run build and start during entrypoint execution. The builds can not be done on Dockerfile because then i would need to tag redundant images for each environment (amongst other inconveniences)
Because of the build and run steps the container is deployed and after the healthcheck will give a positive from node server it takes about 30 secs.
Now in a rollig update zero downtime scenario what settings would i use? The thing is i dont need more then 1 replica. The ideal config option would be wait_rolling_update_delay or something that would provoke docker to replace containers never before this wait time. i am playing around with the healthcheck.start_period but i am not seeing a difference.
deploy:
mode: replicated
replicas: 1
placement:
constraints:
- node.role == worker
labels:
- "APP=name"
- "traefik.http.services.name.loadbalancer.server.port=1338"
restart_policy:
condition: any
delay: 10s
max_attempts: 3
window: 60s
update_config:
parallelism: 1
delay: 10s
monitor: 10s
order: start-first
failure_action: rollback
healthcheck:
test: "curl --silent --fail http://localhost:1338/_health || exit 1"
interval: 10s
timeout: 120s
retries: 10
start_period: 120s

Docker deploy swarm instance on specific node matching instance index

Using docker swarm, I am trying to deploy N instances of my app on N nodes in a way that each app is deployed on the node with the corresponding index. E.g.: app1 must be deployed on node1, app2 on node2, ...
The bellow is not working as it complains Error response from daemon: rpc error: code = Unknown desc = value 'node{{.Task.Slot}}' is invalid.
Any suggestion how to achieve this ?
I also have the impression, in a long shot, to use something with labels but I cannot wrap my head over it yet. Anyhow please advise.
version: "3.8"
services:
app:
image: app:latest
hostname: "app{{.Task.Slot}}"
networks:
- app-net
volumes:
- "/data/shared/app{{.Task.Slot}}/config:/app/config"
deploy:
replicas: 5
update_config:
parallelism: 1
delay: 10s
restart_policy:
condition: any
placement:
constraints:
- "node.hostname==node{{.Task.Slot}}" <========= ERROR
Service template parameters are documented as only resolving in:
the hostname: directive
for volume definitions
in labels.
environment variables.
Placement preference / constraints is not supported, but would be brilliant as it would allow simple deployments of Minio, etcd, consul and other clustered services where you need to pin replicas to nodes.

How to reduce the amount of chunks to prevent running out of disk space for Loki/Promtail?

I'm currently evaluating Loki and facing issues with running out of disk space due to the amount of chunks.
My instance is running in Docker containers using a docker-compose setup (Loki, Promtail, Grafana) from the official documentation (see docker-compose.yml below).
I'm more or less using the default configuration of Loki and Promtail. Except for some tweaks for the retention period (I need 3 months) plus a higher ingestion rate and ingestion burst size (see configs below).
I bind-mounted a volume containing 1TB of log files (MS Exchange logs) and set up a job in promtail using only one label.
The resulting chunks are constantly eating up disk space and I had to expand the VM disk incrementally up to 1TB.
Currently, I have 0.9 TB of chunks. Shouldn't this be far less? (Like 25% of initial log size?). Over the last weekend, I stopped the Promtail container to prevent running out of disk space. Today I started Promtail again and get the following warning.
level=warn ts=2022-01-24T08:54:57.763739304Z caller=client.go:349 component=client host=loki:3100 msg="error sending batch, will retry" status=429 error="server returned HTTP status 429 Too Many Requests (429): Ingestion rate limit exceeded (limit: 12582912 bytes/sec) while attempting to ingest '2774' lines totaling '1048373' bytes, reduce log volume or contact your Loki administrator to see if the limit can be increased"
I had this warning beforehand and increasing ingestion_rate_mb to 12and ingestion_burst_size_mb to 24 fixed this...
Kind of at a dead-end here.
Docker Compose
version: "3"
networks:
loki:
services:
loki:
image: grafana/loki:2.4.1
container_name: loki
restart: always
ports:
- "3100:3100"
command: -config.file=/etc/loki/local-config.yaml
volumes:
- ${DATADIR}/loki/etc:/etc/loki:rw
networks:
- loki
promtail:
image: grafana/promtail:2.4.1
container_name: promtail
restart: always
volumes:
- /var/log/exchange:/var/log
- ${DATADIR}/promtail/etc:/etc/promtail
ports:
- "1514:1514" # for syslog-ng
- "9080:9080" # for http web interface
command: -config.file=/etc/promtail/config.yml
networks:
- loki
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: always
volumes:
- grafana_var:/var/lib/grafana
ports:
- "3000:3000"
networks:
- loki
volumes:
grafana_var:
Loki Config:
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
instance_addr: 127.0.0.1
kvstore:
store: inmemory
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
ruler:
alertmanager_url: http://localhost:9093
# https://grafana.com/docs/loki/latest/configuration/#limits_config
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h
ingestion_rate_mb: 12
ingestion_burst_size_mb: 24
per_stream_rate_limit: 12MB
chunk_store_config:
max_look_back_period: 336h
table_manager:
retention_deletes_enabled: true
retention_period: 2190h
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
chunk_encoding: snappy
Promtail Config
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: exchange
static_configs:
- targets:
- localhost
labels:
job: exchangelog
__path__: /var/log/*/*/*log
Issue was solved. Logs were stored on ZFS with compression enabled and were thus listed much smaller on the file system. Chunk size was actually accurate. My bad.

no suitable node - Unable to match constraints services with docker swarm nodes

I execute
sudo docker node ls
And this is my output
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
smlsbj3r6qjq7s22cl9cgi1f1 * ip-172-30-0-94 Ready Active Leader 20.10.2
b5e3w8nvrw3kw8q3sg1188439 ip-172-30-0-107 Ready Active 20.10.2
phjkfj09ydvgzztaib2zxcfv9 ip-172-30-0-131 Ready Active 20.10.2
m73z9ikte16klds06upruifji ip-172-30-0-193 Ready Active 20.10.2
So far I get, I know I have one manager and 3 workers. So, if I have a service which has a constraint that match with node.role property of worker nodes, some of them will be elected by docker swarm to execute the containers related to the service itself.
The info of my current service is this:
ID: 5p4hpxmvru9kbwz9y5oymoeq0
Name: elasbit_relay1
Service Mode: Replicated
Replicas: 1
Placement:
Constraints: [node.role!=manager]
UpdateConfig:
Parallelism: 1
On failure: pause
Monitoring Period: 5s
Max failure ratio: 0
Update order: stop-first
RollbackConfig:
Parallelism: 1
On failure: pause
Monitoring Period: 5s
Max failure ratio: 0
Rollback order: stop-first
ContainerSpec:
Image: inputoutput/cardano-node:latest#sha256:02779484dc23731cdbea6388920acc6ddd8e40c03285bc4f9c7572a91fe2ee08
Args: run --topology /configuration/testnet-topology.json --database-path /db --socket-path /db/node.socket --host-addr 0.0.0.0 --port 3001 --config /configuration/testnet-config.json
Init: false
Mounts:
Target: /configuration
Source: /home/ubuntu/cardano-docker-run/testnet
ReadOnly: true
Type: bind
Target: /db
Source: db
ReadOnly: false
Type: volume
Resources:
Endpoint Mode: vip
Ports:
PublishedPort = 12798
Protocol = tcp
TargetPort = 12798
PublishMode = ingress
The key part is [node.role!=manager]. It gives me no suitable node (unsupported platform on 3 nodes; scheduling constraints ….
I tried a lot of ways:
Use docker-compose format (yml) with a constraints list:
deploy:
replicas: 1
placement:
constraints: [node.role==worker]
restart_policy:
condition: on-failure
Use label to nodes.
In all of them I failed. The funny part is, that if I point with some constraint to the node manager, it works! Do I have some typo? Well, I don't see it.
Im using Docker version 20.10.2, build 20.10.2-0ubuntu1~18.04.2.

How to check Docker Swarm resources reservation via cli

I introduced Docker Swarm resources limit on a cluster (24 GB RAM and 12 VCPUs) and specified services limits with the following configurations:
redis:
image: redis
deploy:
replicas: 1
resources:
reservations:
cpus: '1'
memory: 300m
ports:
- "6379:6379"
Now the problem is that I get the error no suitable node (insufficient resources on 3 nodes) and I can't understand what resources are over and where exactly. Is there a way to understand resource reservation overall?

Resources