In our environment we use docker swarm to run some jobs heavily consuming ressources. Those get started and end after some time.
We start the the containers using docker run --rm --memory=18000m --memory-swap=18000m --oom-score-adj=900 -v /some/mount/point:/opt/workspace/mount-point --network someOverlayNetwork --name someName privateRegistry/imagename:version -c "do what to do"
Sometimes this call fails with
docker: Error response from daemon: Could not attach to network someOverlayNetwork: context deadline exceeded.
As far I found out context deadline exceeded is the generic way of go to say some timeout happened. I also see the error happens quite exactly 20s after the docker run command. And it makes sense, there might be quite a lot of things going on in the cluster in terms of load and network load.
I have no problem to wait more time until the next job run starts, but breaking up the job start causes problems for us.
So the question: Is it possible to increase the timeout docker run has to enter a network?
Related
I would like to stop my running docker container after a specific time, let's say 2 hrs after startup. So far my research has led to the following solutions. I just wanted to know if there were better ways to do it.
Use a cron job to stop the container by calling the docker stop command.
Use an entry point like sleep 5000, but this does not suit my use case.
Using --stop-timeout in the docker run command ; I believe this is just the maximum timeout given for the container to gracefully shutdown. Am I missing something
here?
You can use the timeout command that's part of the coreutils package which is already installed in the debian images (and probably many others).
This will run the container for 30 seconds and then stop
docker run debian timeout 30 tail -f /dev/null
Basically, add timeout 7200 in front of the command you want to run in the container, and it'll be killed after 2 hours.
I'm running nvprof to profile GPU usage of a TensorRT server-client model.
Here's what I'm doing:
Run nvprof on terminal 1 within a docker container with TensorRT enabled, nvprof --profile-all-processes -o results%p.nvvp
Run TensorRT server on terminal 2 within the same docker container as the first step
Request a service on terminal 3 within a different docker container as the first two steps.
When the third step finishes, the client exists normally but the server and nvprof are kept running. So naturally, I closed the TensorRT server with ctrl-c. When I do this, on terminal 1 (running nvprof) it tells me that the application has had an internal profiling error, and the resulting output file does not have any timeline information on it. (It is only a 380KB big, whereas other files run about the same duration, 2-3 minutes, are about a few MB big at least)
It seemed like ending TensorRT server with ctrl-C is the problem, so I tried to give nvprof a timeout option, namely nvprof --profile-all-processes -o results%p.nvvp --timeout 200 in the first step (200 seconds is more than enough for the whole process to finish) But while this does make nvprof raise this message: Execution timeout, stopping the application..., it does not actually stop the TensorRT server.
Basically, I'd like to know if there's any way to stop a running TensorRT server exit normally without using ctrl-C, or if there is a workaround with this issue using nvprof and TensorRT together.
Any help or push in the right direction would be greatly appreciated. Thanks!
P.S. Original question was posted here about 3 hours ago.
So it turns out, TensorRT was not the problem.
When creating and first running the docker container for the server, I have not added the privileged option.
Running docker container with docker run --rm -it -d --gpus all --privileged ... helps nvprof profile the server behavior even when the server program is killed with Ctrl-C.
I started running a celery beat worker in a dedicated container. This works fine sometimes, but now I get the following error trying to remove or re-deploy my containers:
An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).
Also, I cannot access the container anymore and the following commands just get stuck:
docker restart beat
docker logs beat
docker exec beat bash
I am running Docker in swarm mode with several nodes in the cluster.
According to the documentation written here: https://docs.docker.com/engine/reference/commandline/service_update/ and here: https://docs.docker.com/engine/reference/commandline/service_create/, --stop-grace-period command sets the time to wait before force killing a container.
Expected behavior -
My expectation was that Docker would wait this period of time until it tries to stop a running container, during a rolling update.
Actual behavior -
Docker sends the termination signal after several seconds the new container with the new version of the image starts.
Steps to reproduce the behavior
docker service create --replicas 1 --stop-grace-period 60s --update-delay 60s --update-monitor 5s --update-order start-first --name nginx nginx:1.15.8
Wait for the service to start up the container (aprox. 2 minutes)
docker service update --image nginx:1.15.9 nginx
docker ps -a
As you can see, the new container started and after a second, the
old one was killed by Docker.
Any idea why?
I also opened an issue on Github, here: https://github.com/docker/for-linux/issues/615
The --stop-grace-period value is the amount of time that Docker will wait after sending a sigterm and give up waiting for the container to exit gracefully. Once the grace period is complete, it will kill the container with a sigkill.
The sequence of events seem to happen as is designed based on your description of your setup. Your container exits cleanly and quickly when it gets its sigterm so Docker never needs to send a sigkill.
I see you also specified --update-delay 60 but that won't take effect since you only have one replica. The update delay will tell docker to wait 60 seconds after cycling the first task, so it is only helpful for 2 or more replicas.
It seems like you want your single-replica service to run a new task and an old task concurrently for 60 seconds, but swarm mode is happy to get rid of old containers with sigterm as soon as the new container is up.
I think you can close the issue on GitHub.
stop-grace-period this is the period between stop (SIGTERM) and kill (SIGKILL).
Of course, you can change SIGTERM to another signal by using --stop-signal switch. The behavior of application into a container, when a stop signal is received, is your responsibility.
Here good article explaining this kitchen.
Docker's documentation says that --rm and -d cannot be used together: https://docs.docker.com/engine/reference/run/#detached-d
Why? I seem to be misunderstanding what "detached" means; it seems entirely orthogonal to what --rm does. Why are they mutually exclusive?
By way of analogy, if I start a process in the background (e.g. start my-service), and the process exits, the process's resources are freed automatically (by init). It doesn't stick around, waiting for me to manually remove it. Why doesn't docker allow me to combine -d with --rm so that my container works in an analogous way?
I think that would address a very common use case. Seems that it would very nicely obviate the following work around: https://thraxil.org/users/anders/posts/2015/11/03/Docker-and-Upstart/
What am I missing???
Because --rm is implemented as a client-side option: when you specify --rm, the docker client waits around for the container to exit, and then removes it.
When you specify -d, the docker client exits. The container is running and is managed by the Docker server. There is no longer any client running to implement the --rm functionality.
As a way of answering, lets imagine i launch a container using -d and --rm and this is allowed. docker run -d --rm --name=my_app my_container
If my app works as expected it will run and when it come time to die, it dies and quietly removes itself, meaning I can rerun this command with little hassle. This seems ideal, and your question was one I faced myself while setting up some docker automation for my project.
What if, however, something goes wrong, the process running in the container encounters a fatal error and crashes, causing the container to die. The problem is that, to any outside observer, be they human or monitoring software, will not be able to tell the difference between these two scenarios, except maybe by how long the container was alive.
In cases where -d is not used, running a command in CLI or using upstart/initd/systemd/other, the container writes output, which will remain even if the container was given --rm, allowing an error or crash to be noticed and resolved.
In cases where -d is used, not binding container output to any output stream or file, --rm is not allowed to ensure that there is evidence left behind, in the form of a dead container, of an error/crash.
To wrap up/TL;DR: I believe this conscious choice was made by docker developers to prevent cases in which containers were completely unaccounted for with the trade-off being the need to add two more commands to your automation script.
As of version 1.13, you now can use --rm with background jobs. The remove operation has been moved from the client to the server to enable this.
See the following pull request for more details: https://github.com/docker/docker/pull/20848
Here's an example of this new behavior:
$ docker run -d --rm --name sleep busybox sleep 10
be943302d668f6416b083b8b6fa74e254b8e4200d14f6d7743d48691db1a4b18
$ docker ps | grep sleep
be943302d668 busybox "sleep 10" 4 seconds ago Up 3 seconds sleep
$ sleep 10; docker ps | grep sleep