Bad gateway as Traefik fails to point to a new service instance after a failed health check - docker

I have a Docker Swarm web application that can deploy fine and can be reached from outside by a browser.
But it was not showing the client IP address to the HTTP service.
So I decided to add a Traefik service to the Docker Compose file to expose the client IP to the HTTP service.
I use a mode: host and a driver: overlay for this reason.
The complete configuration is described in two Docker Compose files that I run in sequence.
First I run the docker stack deploy --compose-file docker-compose-dev.yml common command on the file:
version: "3.9"
services:
traefik:
image: traefik:v2.5
networks:
common:
ports:
- target: 80
published: 80
mode: host
- target: 443
published: 443
mode: host
command:
- "--providers.docker.endpoint=unix:///var/run/docker.sock"
- "--providers.docker.swarmMode=true"
- "--providers.docker.exposedbydefault=false"
- "--providers.docker.network=common"
- "--entrypoints.web.address=:80"
# Set a debug level custom log file
- "--log.level=DEBUG"
- "--log.filePath=/var/log/traefik.log"
- "--accessLog.filePath=/var/log/access.log"
# Enable the Traefik dashboard
- "--api.dashboard=true"
deploy:
placement:
constraints:
- node.role == manager
labels:
# Expose the Traefik dashboard
- "traefik.enable=true"
- "traefik.http.routers.dashboard.service=api#internal"
- "traefik.http.services.traefik.loadbalancer.server.port=888" # A port number required by Docker Swarm but not being used in fact
- "traefik.http.routers.dashboard.rule=Host(`traefik.learnintouch.com`)"
- "traefik.http.routers.traefik.entrypoints=web"
# Basic HTTP authentication to secure the dashboard access
- "traefik.http.routers.traefik.middlewares=traefik-auth"
- "traefik.http.middlewares.traefik-auth.basicauth.users=stephane:$$apr1$$m72sBfSg$$7.NRvy75AZXAMtH3C2YTz/"
volumes:
# So that Traefik can listen to the Docker events
- "/var/run/docker.sock:/var/run/docker.sock:ro"
- "~/dev/docker/projects/common/volumes/logs/traefik.service.log:/var/log/traefik.log"
- "~/dev/docker/projects/common/volumes/logs/traefik.access.log:/var/log/access.log"
networks:
common:
name: common
driver: overlay
Then I run the docker stack deploy --compose-file docker-compose.yml www_learnintouch command on the file:
version: "3.9"
services:
www:
image: localhost:5000/www.learnintouch
networks:
common:
volumes:
- "~/dev/docker/projects/learnintouch/volumes/www.learnintouch/account/data:/usr/local/learnintouch/www/learnintouch.com/account/data"
- "~/dev/docker/projects/learnintouch/volumes/www.learnintouch/account/backup:/usr/local/learnintouch/www/learnintouch.com/account/backup"
- "~/dev/docker/projects/learnintouch/volumes/engine:/usr/local/learnintouch/engine"
- "~/dev/docker/projects/common/volumes/letsencrypt/certbot/conf/live/thalasoft.com:/usr/local/learnintouch/letsencrypt"
- "~/dev/docker/projects/common/volumes/logs:/usr/local/apache/logs"
- "~/dev/docker/projects/common/volumes/logs:/usr/local/learnintouch/logs"
user: "${CURRENT_UID}:${CURRENT_GID}"
deploy:
replicas: 1
restart_policy:
condition: any
delay: 5s
max_attempts: 3
window: 10s
labels:
- "traefik.enable=true"
- "traefik.http.routers.www.rule=Host(`dev.learnintouch.com`)"
- "traefik.http.routers.www.entrypoints=web"
- "traefik.http.services.www.loadbalancer.server.port=80"
healthcheck:
test: curl --fail http://127.0.0.1:80/engine/ping.php || exit 1
interval: 10s
timeout: 3s
retries: 3
networks:
common:
external: true
name: common
Here are the networks:
stephane#stephane-pc:~$ docker network ls
NETWORK ID NAME DRIVER SCOPE
6beaf0c3a518 bridge bridge local
ouffqdmdesuy common overlay swarm
17et43c5tuf0 docker-registry_default overlay swarm
1ae825c8c821 docker_gwbridge bridge local
7e6b4b7733ca host host local
2ui8s1yomngt ingress overlay swarm
460aad21ada9 none null local
tc846a14ftz5 verdaccio overlay swarm
The docker ps command shows that all containers are healthy.
But a request to http://dev.learnintouch.com/ responds with a Bad Gateway error MOST of the times except when rarely it does not and the web application displays fine.
As a side note, I would like any unhealthy service to be restarted and seen by Traefik. Just like Docker Swarm restarts unhealthy services I would like Traefik to restart unhealthy services too.
The service log:
{"level":"debug","msg":"Configuration received from provider docker: {\"http\":{\"routers\":{\"dashboard\":{\"service\":\"api#internal\",\"rule\":\"Host(`traefik.learnintouch.com`)\"},\"nodejs\":{\"entryPoints\":[\"web\"],\"service\":\"nodejs\",\"rule\":\"Host(`dev.learnintouch.com`)\"},\"traefik\":{\"entryPoints\":[\"web\"],\"middlewares\":[\"traefik-auth\"],\"service\":\"traefik\",\"rule\":\"Host(`common-reverse-proxy`)\"},\"www\":{\"entryPoints\":[\"web\"],\"service\":\"www\",\"rule\":\"Host(`dev.learnintouch.com`)\"}},\"services\":{\"nodejs\":{\"loadBalancer\":{\"servers\":[{\"url\":\"http://10.0.14.17:9001\"}],\"passHostHeader\":true}},\"traefik\":{\"loadBalancer\":{\"servers\":[{\"url\":\"http://10.0.14.8:888\"}],\"passHostHeader\":true}},\"www\":{\"loadBalancer\":{\"servers\":[{\"url\":\"http://10.0.14.18:80\"}],\"passHostHeader\":true}}},\"middlewares\":{\"traefik-auth\":{\"basicAuth\":{\"users\":[\"stephane:$apr1$m72sBfSg$7.NRvy75AZXAMtH3C2YTz/\"]}}}},\"tcp\":{},\"udp\":{}}","providerName":"docker","time":"2021-07-04T10:25:01Z"}
{"level":"info","msg":"Skipping same configuration","providerName":"docker","time":"2021-07-04T10:25:01Z"}
I also tried to have Docker Swarm doing the load balancing with adding the - "traefik.docker.lbswarm=true" property to my service but the Bad Gateway error remained.
I also restarted the Swarm manager:
docker swarm leave --force
docker swarm init
but the Bad Gateway error remained.
I also added the two labels:
- "traefik.backend.loadbalancer.sticky=true"
- "traefik.backend.loadbalancer.stickiness=true"
but the Bad Gateway error remained.
It feels like Traefik hits the web service before that one has a chance to be ready. Would there be any way to tell Traefik to wait a given amount of seconds before hitting the web service ?
UPDATE: I could find, not a solution, but a workaround the issue, by splitting the first above common stack file into two files, with one dedicated to the traefik stack. I could then start the 3 stacks in the following order: common, www_learnintouch and traefik
The important thing was to start the traefik stack after the others. If I then had to remove and start again the www_learnintouch stack for example, then I had to follow this by removing and starting again the traefik stack.
Also, if I removed the container of www_learnintouch with a docker rm -f CONTAINER_ID command then I also needed to remove and start again the traefik stack.

Related

Docker Swarm - Requests fail to reach a service on a different node

I've setup a Docker Swarm with Traefik v2 as the reverse proxy, and have been able to access the dashboard with no issues.
I am having an issue where I cannot get a response from any service that runs on a different node to the node Traefik is running on. I'm been testing and researching and presuming it's a network issue of some type.
I've done some quick testing with a empty Nginx image and was able to deploy another stack and get a response if the image was on the same node. Other stacks on the swarm which deploy across multiple nodes (but not including the Traefik node) are able to communicate to each other without issues).
Here is the test stack to provide some context of what I was using.
version: '3.8'
services:
test:
image: nginx:latest
deploy:
replicas: 1
placement:
constraints:
- node.role==worker
labels:
- "traefik.enable=true"
- "traefik.docker.network=uccser-dev-public"
- "traefik.http.services.test.loadbalancer.server.port=80"
- "traefik.http.routers.test.service=test"
- "traefik.http.routers.test.rule=Host(`TEST DOMAIN`) && PathPrefix(`/test`)"
- "traefik.http.routers.test.entryPoints=web"
networks:
- uccser-dev-public
networks:
uccser-dev-public:
external: true
The uccser-dev-public network is an overlay network across all nodes, with no encryption.
If I added a constraint to specify the Traefik node, then the requests worked with no issues. However, if I switched it to a different node, I get the Traefik 404 page.
The Traefik dashboard is showing it sees the service.
However the access logs show the following:
proxy_traefik.1.6fbx58k4n3fj#SWARM_NODE | IP_ADDRESS - - [21/Jul/2021:09:03:02 +0000] "GET / HTTP/2.0" - - "-" "-" 1430 "-" "-" 0ms
It's just blank, and I don't know where to proceed from here. The normal log shows no errors that I can see.
Traefik stack file:
version: '3.8'
x-default-opts:
&default-opts
logging:
options:
max-size: '1m'
max-file: '3'
services:
# Custom proxy to secure docker socket for Traefik
docker-socket:
<<: *default-opts
image: tecnativa/docker-socket-proxy
networks:
- traefik-docker
volumes:
- /var/run/docker.sock:/var/run/docker.sock
environment:
NETWORKS: 1
SERVICES: 1
SWARM: 1
TASKS: 1
deploy:
placement:
constraints:
- node.role == manager
# Reverse proxy for handling requests
traefik:
<<: *default-opts
image: traefik:2.4.11
networks:
- uccser-dev-public
- traefik-docker
volumes:
- traefik-public-certificates:/etc/traefik/acme/
ports:
- target: 80 # HTTP
published: 80
protocol: tcp
mode: host
- target: 443 # HTTPS
published: 443
protocol: tcp
mode: host
command:
# Docker
- --providers.docker
- --providers.docker.swarmmode
- --providers.docker.endpoint=tcp://docker-socket:2375
- --providers.docker.exposedByDefault=false
- --providers.docker.network=uccser-dev-public
- --providers.docker.watch
- --api
- --api.dashboard
- --entryPoints.web.address=:80
- --entryPoints.websecure.address=:443
- --log.level=DEBUG
- --global.sendAnonymousUsage=false
deploy:
placement:
constraints:
- node.role==worker
# Dynamic Configuration
labels:
- "traefik.enable=true"
- "traefik.http.routers.dashboard.rule=Host(`SWARM_NODE`) && (PathPrefix(`/api`) || PathPrefix(`/dashboard`))"
- "traefik.http.routers.dashboard.service=api#internal"
- "traefik.http.services.dummy-svc.loadbalancer.server.port=9999" # Dummy service for Swarm port detection. The port can be any valid integer value.
volumes:
traefik-public-certificates: {}
networks:
# This network is used by other services
# to connect to the proxy.
uccser-dev-public:
external: true
# This network is used for Traefik to talk to
# the Docker socket.
traefik-docker:
driver: overlay
driver_opts:
encrypted: 'true'
Any ideas?
Further testing showed other services were working on different nodes, so figured it must be an issue with my application. Turns out my Django application still had a bunch of settings configured for it's previous hosting location regarding HTTPS. As it wasn't passing the required settings it had denied the requests before the were processed. I needed to have the logging level for gunicorn (WSGI) lower to see more information too.
In summary, Traefik and Swarm were fine.
Another reason for this can be that Docker Swarm ports haven't been opened on all of the nodes. If you're using UFW that means running the following on every machine participating in the swarm:
ufw allow 2377/tcp
ufw allow 7946/tcp
ufw allow 7946/udp
ufw allow 4789/udp

Traefik Docker Swarm Unifi setup does is not reachable from domain

I have been trying to setup a Docker Swarm setup with Traefik:v2.x for some time and searched wide and broad on Google, but I still cannot connect to my reverse proxy from my outside domain.
My setup is as following:
Hardware (from outer to inner):
Technicolor MediaAccess TG799vac Xtream (modem)
|
Unifi Security Gateway (Unifi Controller is a Raspberry Pi)
|
x86_64 server where my (currently) single docker swarm node is running
Both domain and wildcard domain is pointing at my system and if I am running a single container with port 80 exposed it is working from the domain. As soon as I set it up for Traefik I can't reach my containers from outside, but my test container can be reach with curl commands from inside my network. Even if I curl the USG.
On the server I have installed Docker + Docker Swarm and running the following 2 stacks:
version: '3'
services:
reverse-proxy:
image: traefik:v2.3.4
command:
- "--providers.docker.endpoint=unix:///var/run/docker.sock"
- "--providers.docker.swarmMode=true"
- "--providers.docker.exposedbydefault=false"
- "--providers.docker.network=traefik-public"
- "--entrypoints.web.address=:80"
ports:
- 80:80
- 8080:8080
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
networks:
- traefik-public
deploy:
placement:
constraints:
- node.role == manager
networks:
traefik-public:
external: true
and
version: '3'
services:
helloworld:
image: nginx
networks:
- traefik-public
deploy:
labels:
- "traefik.enable=true"
- "traefik.http.routers.helloworld.rule=Host(`test.mydomain.com`)"
- "traefik.http.routers.helloworld.entrypoints=web"
- "traefik.http.services.helloworld.loadbalancer.server.port=80"
networks:
traefik-public:
external: true
I little update, it is possible to access a regular container with port 80 exposed on my domain, but as soon as I spin the container up with Docker Swarm, it is no longer exposed to the internet.
The network is created as follow, which I also used for a regular container:
docker network create -d overlay --attachable test
and the yml:
version: '3'
services:
nginx:
image: nginx
ports:
- 80:80
networks:
- test-swarm-network
networks:
test:
external: true
So the above does not work but the following is visible on my domain from the outside:
docker run -d -p 80:80 nginx

Hadoop Cluster on Docker Swarm - Datanodes Failed to Connect to Namenode

I am new to Docker and trying to build a Hadoop cluster with Docker Swarm. I tried to build it with docker compose and it worked perfectly. However, I would like to add other services like Hive, Spark, HBase to it in the future so a Swarm seems a better idea.
When I tried to run it with a version 3.7 yaml file, the namenode and datanodes started successfully. But when I visited the web UI, it showed that there is no nodes available at the "Datanodes" tab (neither at the "Overview" tab). It seems the datanodes failed to connect to the namenode. I had checked the port of each node with netstat -tuplen and both 7946 and 4789 worked fine.
Here is the yaml file I used:
version: "3.7"
services:
namenode:
image: flokkr/hadoop:latest
hostname: namenode
networks:
- hbase
command: ["hdfs","namenode"]
ports:
- target: 50070
published: 50070
- target: 9870
published: 9870
environment:
- NAMENODE_INIT=hdfs dfs -chmod 777 /
- ENSURE_NAMENODE_DIR=/tmp/hadoop-hadoop/dfs/name
env_file:
- ./compose-config
deploy:
mode: replicated
replicas: 1
restart_policy:
condition: on-failure
placement:
constraints:
- node.role == manager
datanode:
image: flokkr/hadoop:latest
networks:
- hbase
command: ["hdfs","datanode"]
env_file:
- ./compose-config
deploy:
mode: global
restart_policy:
condition: on-failure
volumes:
namenode:
datanode:
networks:
hbase:
name: hbase
Basically I just update the yaml file from this repo to version 3.7 and tried to run it on GCP. And here is my repo in case you want to replicate the case.
And this is the status of ports of the manager node:
the worker node:
Thank you for your help!
It seems to be a network related issue, the pods are up an running but they are not registering on your Web GUI maybe the network communication it's not reaching between them. Check your internal firewall rules and OS firewall, run some network test on the specific ports.

Multiple docker networks on a single container

I have a docker stack configuration with an overlay network called traefik. It contains my traefik reverse-proxy container and then several containers that serve various subdomains. I'm adding a new one that'll need to access a database server that I've created in another container, so I added something like this:
networks:
traefik:
driver: overlay
database:
driver: overlay
services:
traefik:
image: traefik
networks:
- traefik
ports:
- 80:80
- 443:443
- 8080:8080
volumes:
# ...
# ...
database:
image: postgres:9.6-alpine # match what's being used by heroku
networks:
- database
staging:
image: staging
networks:
- traefik
- database
deploy:
labels:
traefik.enable: "true"
traefik.frontend.rule: Host:lotto-ticket.example.com
traefik.docker.network: traefik
traefik.port: 3000
When I did this, the reverse proxy started returning Gateway Timeout codes, indicating that the staging container was no longer available to the traefik network. If I remove the database network from the staging container, the subdomain serves up as expected (although it obviously fails to access the database), and when I add it back in, the reverse-proxy starts getting timeouts from it.
Is there some kind of conflict between the two networks? Do I need to configure IP routing in order to use multiple docker networks on a single container?
EDIT: added more detail to the config excerpt
You need to tell traefik on which network to connect to your service using the label traefik.docker.network. In your case, that label would be set to ${stack_name}_traefik where ${stack_name} is the name of your stack.

Docker swarm containers connection issues

I am trying to use docker swarm to create simple nodejs service that lays behind Haproxy and connect to mysql. So, I created this docker compose file:
And I have several issues:
The backend service can't connect to the database using: localhost or 127.0.0.1, so, I managed to connect to the database using the private ip(10.0.1.4) of the database container.
The backend tries to connect to the database too soon even though it depends on it.
The application can't be reached from outside.
version: '3'
services:
db:
image: test_db:01
ports:
- 3306
networks:
- db
test:
image: test-back:01
ports:
- 3000
environment:
- SERVICE_PORTS=3000
- DATABASE_HOST=localhost
- NODE_ENV=development
deploy:
replicas: 1
update_config:
parallelism: 1
delay: 5s
restart_policy:
condition: on-failure
max_attempts: 3
window: 60s
networks:
- web
- db
depends_on:
- db
extra_hosts:
- db:10.0.1.4
proxy:
image: dockercloud/haproxy
depends_on:
- test
environment:
- BALANCE=leastconn
volumes:
- /var/run/docker.sock:/var/run/docker.sock
ports:
- 80:80
networks:
- web
deploy:
placement:
constraints: [node.role == manager]
networks:
web:
driver: overlay
db:
driver: bridge
I am running the following:
docker stack deploy --compose-file=docker-compose.yml prod
All the services are running.
curl http://localhost/api/test <-- Not working
But, as I mentioned above the issues I have.
Docker version 18.03.1-ce, build 9ee9f40
docker-compose version 1.18.0, build 8dd22a9
What do I missing?
The backend service can't connect to the database using: localhost or 127.0.0.1, so, I managed to connect to the database using the private ip(10.0.1.4) of the database container.
don't use IP addresses for connection. Use just the DNS name.
So you must change connection to DATABASE_HOST=db, because this is the service name you've defined.
Localhost is wrong, because the service is running in a different container as your test service.
The backend tries to connect to the database too soon even though it depends on it.
depends_on does not work as you expected. Please read https://docs.docker.com/compose/compose-file/#depends_on and the info box "There are several things to be aware of when using depends_on:"
TL;DR: depends_on option is ignored when deploying a stack in swarm mode with a version 3 Compose file.
The application can't be reached from outside.
Where is your haproxy configuration that it must request for http://test:3000 when something requests haproxy on /api/test?
For DATABASE_HOST=localhost - the localhost word means my local container. You need to use the service name where db is hosted. localhost is a special dns name always pointing to the application host. when using containers - it will be the container. In cloud development, you need to forget about using localhost (will point to the container) or IPs (they can change every time you run the container and you will not be able to use load-balancing), and simply use service names.
As for the readiness - docker has no possibility of knowing, if the application you started in container is ready. You need to make the service aware of the database unavailability and code/script some mechanisms of polling/fault tolerance.
Markus is correct, so follow his advice.
Here is a compose/stack file that should work assuming your app listens on port 3000 in the container and db is setup with proper password, database, etc. (you usually set these things as environment vars in compose based on their Docker Hub readme).
Your app should be designed to crash/restart/wait if it can't fine the DB. That's the nature of all distributed computing... that anything "remote" (another container, host, etc.) can't be assumed to always be available. If your app just crashes, that's fine and a normal process for Docker, which will re-create the Swarm Service task each time.
If you could attempt to make this with public Docker Hub images, I can try to test for you.
Note that in Swarm, it's likely easier to use Traefik for the proxy (Traefik on Swarm Mode Guide), which will autoupdate and route incoming requests to the correct container based on the hostname you give the labels... But note that you should test first just the app and db, then after you know that works, try adding in a proxy layer.
Also, in Swarm, all your networks should be overlay, and you don't need to specify as that is the default in stacks.
Below is a sample using traefik with your above settings. I didn't give the test service a specific traefik hostname so it should accept all traffic coming in on 80 and forward to 3000 on the test service.
version: '3'
services:
db:
image: test_db:01
networks:
- db
test:
image: test-back:01
environment:
- SERVICE_PORTS=3000
- DATABASE_HOST=db
- NODE_ENV=development
networks:
- web
- db
deploy:
labels:
- traefik.port=3000
- traefik.docker.network=web
proxy:
image: traefik
networks:
- web
volumes:
- /var/run/docker.sock:/var/run/docker.sock
ports:
- "80:80"
- "8080:8080" # traefik dashboard
command:
- --docker
- --docker.swarmMode
- --docker.domain=traefik
- --docker.watch
- --api
deploy:
placement:
constraints: [node.role == manager]
networks:
web:
db:

Resources