I'm running two binary services (each written in C++) through docker-compose 1.22.0. There's nothing particularly special about the docker-compose.yml. Both services use host networking, have init: true, use user: "0", and have large volumes on the host container, but otherwise their commands are pretty much just running binary files. (unfortunately, for software licensing reasons, I cannot give much more information, but I know of nothing else relevant)
What's happening is that occasionally and inconsistently (I'd say at least 75% of the time, it works fine), both services will stop printing to the console after perplexingly outputting the following:
first_service_1 | second_service_1 |
first_service_1 is colored differently than second_service_1, and they look identical to the other log lines except without a newline between them.
After that, nothing gets printed. The services (which expose services on ports) will still respond to requests, but they won't print what they're supposed to/what they normally print.
After some time (it's unclear how much exactly), first_service will quit like so:
my-dockerfiles_first_service_1 exited with code 1
Assuming docker-compose isn't causing the exit (which I'm unsure of because it seems to mostly crash only when it fails to print), I'd be able to debug this more easily. I know of no reason printing should ever fail or it should ever print a log line without a newline at the end.
Can anyone advise me on why docker compose might print that odd line or stop printing entirely?
The host environment is Ubuntu 16.04 on an aarch64 NVidia TX-2 running L4T 28.2. docker-compose was installed recently (within the last couple of weeks) through pip. Docker is on version 18.06.0-ce.
Related
I've got a docker container running a service, and I need that service to send logs to rsyslog. It's an ubuntu image running a set of services in the container. However, the rsyslog service cannot start inside this container. I cannot determine why.
Running service rsyslog start (this image uses upstart, not systemd) returns only the output start: Job failed to start. There is no further information provided, even when I use --verbose.
Furthermore, there are no error logs from this failed startup process. Because rsyslog is the service that can't start, it's obviously not running, so nothing is getting logged. I'm not finding anything relevant in Upstart's logs either: /var/log/upstart/ only contains the logs of a few things that successfully started, as well as dmesg.log which simply contains dmesg: klogctl failed: Operation not permitted. which from what I can tell is because of a docker limitation that cannot really be fixed. And it's unknown if this is even related to the issue.
Here's the interesting bit: I have the exact same container running on a different host, and it's not suffering from this issue. Rsyslog is able to start and run in the container just fine on that host. So obviously the cause is some difference between the hosts. But I don't know where to begin with that: There are LOTS of differences between the hosts (the working one is my local windows system, the failing one is a virtual machine running in a cloud environment), so I wouldn't know where to even begin about which differences could cause this issue and which ones couldn't.
I've exhausted everything that I know to check. My only option left is to come to stackoverflow and ask for any ideas.
Two questions here, really:
Is there any way to get more information out of the failure to start? start itself is a binary file, not a script, so I can't open it up and edit it. I'm reliant solely on the output of that command, and it's not logging anything anywhere useful.
What could possibly be different between these two hosts that could cause this issue? Are there any smoking guns or obvious candidates to check?
Regarding the container itself, unfortunately it's a container provided by a third party that I'm simply modifying. I can't really change anything fundamental about the container, such as the fact that it's entrypoint is /sbin/init (which is a very bad practice for docker containers, and is the root cause of all of my troubles). This is also causing some issues with the docker logging driver, which is why I'm stuck using syslog as the logging solution instead.
Technical details:
Docker versions: This has happened on docker 18, 19 and 20, at several minor versions.
19.03.12 and 20.10.3 are the most recent installs I've got running.
OS: Ubuntu 16.04/18.04/20.04, all display this behavior.
All installs use overlay2.
Description:
Running lsof | grep -i 'deleted reveals log files being held on to, from processes running inside docker containers. Grepping the PID in the output of ps -aef --forest reveals the processes are running inside a docker container.
Running the same process outside docker has thus far not reproduced this behavior.
This started to become apparent when tens of gigabytes of diskspace could not be found.
This happens both for java and nodejs processes, using logback/winston respectively.
These logger libraries take care of their own logrotating and are not being tracked by a distinct logrotate service.
I have no clue what causes this behavior, and why it doesn't happen when not dockerized.
Googling "docker logrotate files being held" just gives results on logrotating the log files of docker itself.
I'm not finding what I'm looking for with general searching, so I came here, to ask if maybe someone recognizes this and it's actually an easy fix? Some simple thing I forgot but has to be pointed out?
Here's hoping someone knows what's going on.
That container is built when deploying the application.
Looks like its purpose is to share dependencies across modules.
It looks like it is started as a container but nothing is apparently running, a bit like an init container.
Console says it starts/stops that component when using respective wolkenkit start and wolkenkit stop command.
On startup:
On shutdown:
When you docker ps, that container cannot be found:
Can someone explain these components?
When starting a wolkenkit application, the application is boxed in a number of Docker containers, and these containers are then started along with a few other containers that provide the infrastructure, such as databases, a message queue, ...
The reason why the application is split into several Docker containers is because wolkenkit builds upon the CQRS pattern, which suggests separating the read side of an application from the application's write side, and hence there is one container for the read side, and one for the write side (actually there are a few more, but you get the picture).
Now, since you may develop on an operating system other than Linux, the wolkenkit application may run under a different operating system than when you develop it, as within Docker it's always Linux. This means that the start command can not simply copy over the node_modules folder into the containers, as they may contain binary modules, which are then not compatible (imagine installing on Windows on the host, but running on Linux within Docker).
To avoid issues here, wolkenkit runs an npm install when starting the application inside of the containers. The problem now is that if wolkenkit did this in every single container, the start would be super slow (it's not the fastest thing on earth anyway, due to all the Docker building and starting that's happening under the hood). So wolkenkit tries to optimize this as much as possible.
One concept here is to run npm install only once, inside of a container of its own. This is the node-modules container you encountered. This container is then linked as a volume to all the containers that contain the application's code. This way you only have to run npm install once, but multiple containers can use the outcome of this command.
Since this container now contains data, but no code, it only has to be there, it doesn't actually do anything. This is why it gets created, but is not run.
I hope this makes it a little bit clearer, and I was able to answer your question :-)
PS: Please note that I am one of the core developers of wolkenkit, so take my answer with a grain of salt.
I wrote a simple go application and added a flock system to prevent being running twice at the same time:
import "github.com/nightlyone/lockfile"
lock, err := lockfile.New(filepath.Join(os.TempDir(), "pagerduty-read-api.lock"))
if err != nil {
panic(err)
}
if err = lock.TryLock(); err != nil {
fmt.Println("Already running.")
return
}
defer lock.Unlock()
It works well on my host. On docker, I tried to run it with volume sharing of tmp:
docker run --rm -it -v /tmp:/tmp my-go-binary
But it does not work. I suppose it's because the flock system is not ported on volume sharing.
My question: Does Docker have option to make flock working between running instance? If not, what are my other options to have the same behavior?
Thanks.
This morning I wrote a little Python test program that just writes one million consecutive integers to a file, with flock() locking, obtaining and releasing the lock once for each number appended. I started up 5 containers, each running that test program, and each writing to the same file in a docker volume.
With the locking enabled, the numbers were all written without interfering with each other, and there were exactly 5 million integers in the file. They weren't consecutive when written this way, but that's expected and consistent with flock() working.
Without locking, many of the numbers were written in a manner that indicates the numbers were running afoul of the multitasking sans locking. There were only 3,167,546 numbers in the file and there were 13,357 blank lines. That adds up to the 3,180,903 lines in the file - substantially different than the desired 5,000,000.
While a program cannot definitively prove that there will never be problems just by testing many times, to me that's a pretty convincing argument that Linux flock() works across containers.
Also, it just kinda makes sense that flock would work across containers; containers are pretty much just a shared kernel, distinct pid, distinct file (other than volumes) and distinct IP port space.
I ran my test on a Linux Mint 19.1 system with Linux kernel 4.15.0-20-generic, Docker 19.03.0 - build aeac949 and CPython 3.6.8.
Go is a cool language, but why flock() didn't appear to be working across volumes in your Go program I do not know.
HTH.
I suppose that you want to use docker volume or you may need some other docker volume plugins.
According to this article, Docker Volume File permission and locking, docker volumes only provides a way to define a volume to use by multi containers or use by a container after restarting.
In docker volume plugins, flocker may meet your requirements.
Flocker is an open-source Container Data Volume Manager for your Dockerized applications.
BTW, if you are using kubernetes, you may need to learn more about persistent volume, persistent volume claim, storage class.
I've been doing some research on this myself recently, and the issue is that nightlyone/lockfile doesn't actually use the flock syscall. Instead, the lockfile it writes is a PIDfile, a file that just contains the PID (Process IDentifier) of the file that created it.
When checking if a lock is locked, lockfile checks the PID stored in the lockfile, and if it's different to the PID of the current process, it sees it as locked.
The issue is that lockfile doesn't have any special logic to know if it's in a docker container or not, and PIDs get a little muddled when working with docker; the PID of a process when viewed from inside the container will be different from the PID of a process outside the container.
Where this often ends up is that we have two containers running your code above, and they both have PIDs of 1 within their containers. They'll each try to create a lockfile, writing what they think their PID is (1). They then both think they hold the lock - after all, their PID is the one that wrote it! So the lockfile is ignored.
My advice is to switch to a locking implementation that uses flock. I've switched to flock, and it seems to work okay.
I am trying to troubleshoot a Dockerfile I found on the web. As it is failing in a weird way, I am wondering whether failed docker builds or docker runs from various subsets of that file or other files that I have been experimenting with might corrupt some part of Docker's own state.
In other words, would it possibly help to restart Docker itself, Reboot the computer, or do some other Docker command, to eliminate that possibility?
Sometimes just rebooting things helps and it's not wrong to try restarting Docker for Mac or do a full reboot, but I can't think of a specific symptom it would fix and it's not something I need to do routinely.
I've only really run into two classes of problems that sound like what you're describing.
If you have a Dockerfile step that consistently succeeds, but produces inconsistent results:
RUN curl http://may.not.exist.example.com/ || true
You can wind up in a situation where the underlying command failed or produced the wrong output, but the RUN step as a whole succeeded. docker build --no-cache will re-run a build ignoring this, and an extremely aggressive docker rmi sequence (deleting every build, current and past, of the image in question) will clean it up too.
The other class of problem I've encountered involves some level of corruption in /var/lib/docker. This usually has very obvious symptoms generally involving "file not found" or "failed mounting directory" type errors on a setup that you otherwise know works. I've encountered it more on native Linux than Docker for Mac, probably because the DfM Linux installation is a little more controlled and optimized for Docker (it definitely isn't running a 3-year-old kernel with arbitrary vendor patches). On Linux you can work around this by stopping Docker, deleting everything in /var/lib/docker, and starting Docker again; in Docker for Mac, on the preferences window, there's a "Reset" page with various destructive cleanup options and "Reset to factory defaults" is closest to this.
I would first attempt using the Docker 'Diagnose and Feedback option. This generally runs tests on the health of Docker and the Docker engine.
Docker desktop also has options for various troubleshooting scenarios under 'Preferences' > 'Reset' (if you're using Docker Desktop) which have helped me in the past.
A brief look through the previous Docker Release notes.
It certainly looks like it has been possible in the past to corrupt the Docker Engine; there is evidence suggesting the engine has been iteratively fixed since.