Timeouts accessing services on swarm published ports - docker

We're using Docker in Swarm mode to host a number of services. Recently we've hit an issue where we get connection timeouts intermittently (sometimes as much as every other request) when trying to access some services.
We've upgraded the environment to the latest version of Docker (currently Docker version 17.03.0-ce, build 3a232c8), done a staggered reboot of all servers (trying to maintain uptime if possible even though this environment is technically a test environment) and tried stopping / starting services as well, but the issue still persists.
I'm confident the issue is not related to the service that's running in Docker, as we're seeing it on various services which have until recently been running without issue, I think it's more likely an environmental issue, or some problem with Docker's internal routing in the overlay network, but not sure how to prove / solve this.
Any advice on how to diagnose or solve this would be greatly appreciated!

Related

All docker stack are restarting automatically

I have a multi-services environment that is hosted with docker swarm. There are multiple stacks that are created. All the docker containers which are running have an inbuild Spring Boot application. The issue is coming that all my stacks get restarted on their own. Now I know that in compose file I have mentioned that restart_policy as on failure. Hence it auto restarted. The issue comes that when services are restarted, I get errors from a particular service and this breaks everything.
I am not able to figure out what actually happens.
I did quite a lot of research and found out about these things.
Docker daemon is not restarted. I double-checked this with the uptime of the docker daemon.
I checked the docker service ps <Service_ID> and there I can see service showing shutdown and starting. No other information.
I checked the docker service logs <Service_ID> but no error in there too.
I checked for resource crunch. I can assure you that there was quite a good resource available at the host as well as each container level.
Can someone help where exactly to find logs for this even? Any other thoughts on this?
My host is actually a VM hosted on VMWare Vcenter.
After a lot of research and going through all docker logs, I could not find the solution. Later on, I discovered that there was a memory snapshot taken for backup every 24 hours.
Here is what I observe:
Whenever we take a snapshot, all docker services running on the host restart automatically. There will be no errors in that but they will just restart gracefully.
I found some questions already having this problem with VMware snapshots.
As far as I know, when we take a snapshot, it points to a different memory location and saves the previous one. I am not able to find why it's happening but yes Root cause of the problem was this. If anyone is a VMWare snapshots expert, please let us know.

what's the purpose of the zabbix officially provided docker image of the zabbix agent?

I used the zabbix official docker-compose yaml to set up a set of zabbix system and I found the server as a monitoring target was not available. I searched the Internet and found there are people also encountered such problem.Someone said the agent container's IP or DNS name should be used as the server's. I tried and found it works. But I'm confused by the agent. Does it monitor the server container,the agent container or the host machine? If it only monitors the agent container itself,what's the purpose of it?
Does it monitor the server container,the agent container or the host machine?
Agent container.
If it only monitors the agent container itself,what's the purpose of it?
For testing. And for monitoring external stuff, with custom commands. Or you can connect stuff from host and monitor it, so just in all the cases you do not want or can't install agent on the host.
Everybody who configures a Dockerized Zabbix installation like yourself bumps into to this issue- and of course find themselves on StackExchange looking for the answers that should have been in the documentation.
The reason that the Zabbix Agent in the docker-compose install you're referring to can't initially connect is that both it and server it monitors both run in isolated containers. Separate containers cannot talk to each other on 127.0.0.1 (localhost) addresses. And that is actually a good thing!
I've reviewed the documentation in the repo you're talking about and it's sparse to say the least; it certainly could be better. But to be fair to Zabbix, their docker-compose install DOES work great when you get it running and can achieve pretty fair results quickly with little effort (and a bit of Googling ;-> ).
I actually found FURTHER pain connecting to containerized Zabbix Agents raised on different hosts outside of the docker-compose install you're referring to. Connectivity was being busted because the host the docker-compose install was raised on was NAT'ing out the traffic and presenting the wrong IP address. I've documented this issue HERE.
Dockerized Zabbix is a good thing; there is a purpose to it. I agree with you though that the documentation could be better though. Stick with it!

HTTP 503 errors from Cloud Run app in one GCP projects but not the other

The issue
I am using the same container (similar resources) on 2 projects -- production and staging. Both have custom domains setup with cloud flare DNS and are on the same region. Container build is done in a completely different project and IAM is used to handle the access to these containers. Both project services have 80 concurrency and 300 seconds time out for all 5 services.
All was working good 3 days back but from yesterday almost all cloud run services on staging (thankfully) started throwing 503 randomly and for most requests. Some services were not even deployed for a week. The same containers are running fine on production project, no issues.
Ruled out causes
anything to do with Cloudflare (I tried the URL cloud run gives it has the issue of 503)
anything with build or containers (I tried the demo hello world container with go - it has the issue too)
Resources: I tried giving it 1 GB ram and 2 cpus but the problem persisted
issues on deployment (deploy multiple branches - didn't work)
issue in code (just routed traffic to old 2-3 days old revision but still issue was there)
Issue on service level ( I used the same container to create a completely new service, it also had the issue)
Possible causes
something on cloud run or cloud run load balancer
may some env vars but that also doesn't seem to be the issue
Response Codes
I just ran a quick check with vegeta (30 secs with 10 rps) same container on staging and production for a static file path and below are the responses:
Staging
Production
If anyone has any insights on this it would help greatly.
Based on your explanation, I cannot understand what's going on. You explained what doesn't work but didn't point out what works (does your app run locally? are you able to run a hello world sample application?)
So I'll recommend some debugging tips.
If you're getting a HTTP 5xx status code, first, check your application's logs. Is it printing ANY logs? Is there logs of a request? Does your application have and deployed with "verbose" logging setting?
Try hitting your *.run.app domain directly. If it's not working, then it's not a domain or dns or cloudflare issue. Try debugging and/or redeploying your app. Deploy something that works first. If *.run.app domain works, then the issue is not in Cloud Run.
Make sure you aren't using Cloudflare in proxy mode (e.g. your DNS points to Cloud Run; not Cloudflare) as there's a known issue about certificate issuance/renewals when domains are behind Cloudflare, right now.
Beyond these, if a redeploy seems to solve your problem, maybe try redeploying. It could be very likely some configuration recently became different two different projects.
See Cloud Run Troubleshooting
https://cloud.google.com/run/docs/troubleshooting
Do you see 503 errors under high load?
The Cloud Run (fully managed) load balancer strives to distribute incoming requests over the necessary amount of container instances. However, if your container instances are using a lot of CPU to process requests, the container instances will not be able to process all of the requests, and some requests will be returned with a 503 error code.
To mitigate this, try lowering the concurrency. Start from concurrency = 1 and gradually increase it to find an acceptable value. Refer to Setting concurrency for more details.

Why do Kubernetes' containers fail when ran on runsc (gVisor) as a runtime in Docker?

I am running a single master Kubernetes cluster with Docker. I wanted to try runsc (gVisor) on Kubernetes. I just wanted to start each container in a separate sandbox. So I set runsc as the default runtime and restarted the Docker service. To my surprise, all the Kubernetes' containers were failing (checked with docker ps). What is the exception that causes this? Is there any other way to use gVisor+Docker+Kubernetes?
I am using the right requirements to run each of them.
PS: I am just a beginner.
Thanks for trying gVisor! Sorry it isn't working for you.
Running a Kubernetes Pod inside gVisor is still fairly experimental. It can be made to work, but is a bit difficult to configure right now. We are working to make this easier.
Can you run gVisor with Docker (not Kubernetes)? See the instructions here:
https://github.com/google/gvisor#configuring-docker
If that fails, please file a bug report:
https://github.com/google/gvisor/issues
If you can include debug logs, that will help us diagnose any failure.
https://github.com/google/gvisor#debugging

Windows 10 Docker Network DNS doesn't work after reboot

I'm not sure if this is an issue with the current version of Windows Docker network or poor configuration and misunderstanding on my part, but I have the following setup:
2 Docker containers (built using the Microsoft/ASP.NET image as a base) running a .NET MVC application in each.
1 Docker container running SQL server (built using the Microsoft/mssql-server-windows image)
When I create all 3 containers everything works great, I can attach and ping all other the other containers using their names without any issue. The applications run and can communicate with each other as I hoped.
However, when I reboot my machine and start all the containers again they can no longer ping/communicate with each other using their names (using IP addresses is fine).
I've tried this on the default NAT network and also tried replacing the NAT network with my own custom NAT network.
To resolve the issue I have to run the force network disconnect command for each container as such:
docker network disconnect nat <containername> --force
And then I have to reconnect each container to the network before starting them up. All containers can then ping/communicate with each other using their names as well as their IP addresses.
FYI, this is a development environment but I was hoping to do something similar in Azure using a Windows Server 2016 VM, although I don't quite know what the best network configuration is for live production yet as I need to have multiple applications (in separate containers) on the same node accessed via their own subdomains.
Any help or guidance would be great.
I'm not sure, in part because this question was asked several months before any other example I've run into, but this sounds very similar to the problem described at https://github.com/docker/for-win/issues/1038.
Basically, there appears to be a problem introduced with the 1709 update to Windows 10 which results in a scenario where Hyper-V networking doesn't work the way it ought to.
There appear to be two common ways of working around this problem: Turning off "Fast Start" in the Control Panel => Power Options => System Settings, or restarting Docker for Windows and any containers after booting. I also thought I saw something on a Microsoft blog post indicating that the underlying problem has now been resolved and will be included in an update to Windows 10, but alas I can no longer find that information or the specific version number in which the problem was (theoretically) resolved. It may well be the delayed 1803 "Spring Creators Update" release.

Resources