Our application run in a Docker Container. The application retrieves files from a directory outside our network. The application needs to process the files and save them somewhere for a year.
What is the best approach? Do we use the writable part of the Docker or a directory on the hosting system? what is the addvantages and drawbacks?
Preferably you have your container as dumb as possible. Also if your container gets killed or crashes, you would lose your data if you have it stored within the container.
A volume mount would be a possible solution, what would make it possible to store files directly to the host; but my question would then be: why using Docker?
Another advantage of having a quite dumb container is that you would be able to scale more easily. Having a queue, you could scale the amount of containers needed upto the a certain threshold to process the queue. :-)
Therefore I would advise you to store the data somewhere else; another FTP-server, hosted and managed by yourself.
Related
I'm trying to find a way to identify what container created a volume as well as where it wants to mount to & also If it will be reused when the container restarts for volumes that are currently not in use.
I know I can see what container is currently using a volume, & where it's mounted to in said container, but that isn't enough. I need to identify containers that are no longer running.
The Situation
I've noticed a frequently reoccurring problem with Docker, I create a container to test it out, make some adjustments, restart it, make some more, restart it, until I get it working how I want it to.
In the process, many times, I come across containers that create worthless volumes. These, after the fact, I can identify as 8K volumes not currently in use & just delete them.
But many times these volumes aren't even persistent, as the container will create a new one each time it runs.
At times I look at my Volumes list & see over 100 volumes, none of which are currently in use. The 8KB ones I'll delete without a second thought, but the ones that are 12KB or 24KB or 100KB or 5Mb, etc, etc I don't want to just delete.
I use a Portainer agent inside Portainer solely for the ability to quickly browse these volumes & decide if it needs to be kept, transferred to a Bind mount, or just discarded, but it's becoming more & more of a problem & I figure there has to be some way to identify the container they came from. I'm sure it will require some sort of code exploration, but where? is there not a tool to do this? If I know where the information is I should be able to write a script or even make a container just for this purpose, I just don't know where to begin.
The most annoying thing is when a container creates a second container & that container, that I have no control over, is using a volume, but it creates a new one each time it starts.
Some examples
adoring_hellman created by VS Code Server container linuxserver/code-server
datadog/agent creates a container I believe is called st-vector or something similar
Both which have access to /var/run/docker.sock
In a current project I have to perform the following tasks (among others):
capture video frames from five IP cameras and stitch a panorama
run machine learning based object detection on the panorama
stream the panorama so it can be displayed in a UI
Currently, the stitching and the streaming runs in one docker container, and the object detection runs in another, reading the panorama stream as input.
Since I need to increase the input resolution for the the object detector while maintaining the stream resolution for the UI, I have to look for alternative ways of getting the stitched (full resolution) panorama (~10 MB per frame) from the stitcher container to the detector container.
My thoughts regarding potential solutions:
shared volume. Potential downside: One extra write and read per frame might be too slow?
Using a message queue or e.g. redis. Potential downside: yet another component in the architecture.
merging the two containers. Potential downside(s): Not only does it not feel right, but the two containers have completely different base images and dependencies. Plus I'd have to worry about parallelization.
Since I'm not the sharpest knife in the docker drawer, what I'm asking for are tips, experiences and best practices regarding fast data exchange between docker containers.
Usually most communication between Docker containers is over network sockets. This is fine when you're talking to something like a relational database or an HTTP server. It sounds like your application is a little more about sharing files, though, and that's something Docker is a little less good at.
If you only want one copy of each component, or are still actively developing the pipeline: I'd probably not use Docker for this. Since each container has an isolated filesystem and its own user ID space, sharing files can be unexpectedly tricky (every container must agree on numeric user IDs). But if you just run everything on the host, as the same user, pointing at the same directory, this isn't a problem.
If you're trying to scale this in production: I'd add some sort of shared filesystem and a message queueing system like RabbitMQ. For local work this could be a Docker named volume or bind-mounted host directory; cloud storage like Amazon S3 will work fine too. The setup is like this:
Each component knows about the shared storage and connects to RabbitMQ, but is unaware of the other components.
Each component reads a message from a RabbitMQ queue that names a file to process.
The component reads the file and does its work.
When it finishes, the component writes the result file back to the shared storage, and writes its location to a RabbitMQ exchange.
In this setup each component is totally stateless. If you discover that, for example, the machine-learning component of this is slowest, you can run duplicate copies of it. If something breaks, RabbitMQ will remember that a given message hasn't been fully processed (acknowledged); and again because of the isolation you can run that specific component locally to reproduce and fix the issue.
This model also translates well to larger-scale Docker-based cluster-computing systems like Kubernetes.
Running this locally, I would absolutely keep separate concerns in separate containers (especially if individual image-processing and ML tasks are expensive). The setup I propose needs both a message queue (to keep track of the work) and a shared filesystem (because message queues tend to not be optimized for 10+ MB individual messages). You get a choice between Docker named volumes and host bind-mounts as readily available shared storage. Bind mounts are easier to inspect and administer, but on some platforms are legendarily slow. Named volumes I think are reasonably fast, but you can only access them from Docker containers, which means needing to launch more containers to do basic things like backup and pruning.
Alright, Let's unpack this:
IMHO Shared Volume works just fine, but gets way too messy over time. Especially if you're handling Stateful services.
MQ: This seems like a best option in my opinion. Yes, it's another component in your architecture, but it makes sense to have it rather than maintaining messy shared Volumes or handling massive container images (if you manage to combine 2 container images)
Yes, You could potentially do this, but not a good idea. Considering your use case, I'm going to go ahead and make an assumption that you have a massive list of dependencies which could potentially lead to a conflict. Also, lot of dependencies = larger image = Larger attack surface - which from a security perspective is not a good thing.
If you really want to run multiple processes in one container, it's possible. There are multiple ways to achieve that, however I prefer supervisord.
https://docs.docker.com/config/containers/multi-service_container/
I want to take a holistic approach backing up multiple machines running multiple Docker containers. Some might run, for example, Postgres databases. I want to back up this system, without having to have specific backup commands for different types of volumes.
It is fine to have a custom external script that sends e.g. signals to containers or runs Docker commands, but I strongly want to avoid anything specific to a certain image or type of image. In the example of Postgres, the documentation suggests running postgres-specific commands to backup databases, which goes against the design goals for the backup solution I am trying to create.
It is OK if I have to impose restrictions on the Docker images, as long as it is reasonably easy to implement by starting from existing Docker images and extending.
Any thoughts on how to solve this?
I just want to stress that I am not looking for a solution for how to back up Postgres databases under Docker, there are already many answers explaining how to do so. I am specifically looking for a way to back up any volume, without having to know what it is or having to run specific commands for its data.
(I considered whether this question belonged on SO or Serverfault, but I believe this is a problem to be solved by developers, hence it belongs here. Happy to move it if consensus is otherwise)
EDIT: To clarify, I want do something similar to what is explained in this question
How to deal with persistent storage (e.g. databases) in docker
but using the approach in the accepted answer is not going to work with Postgres (and I am sure other database containers) according to documentation.
I'm skeptical that there is a custom solution, holistic, multi machine, multi container, application/container agnostic approach. From my point of view there is a lot of orchestration activities necessary in the first place. And I wonder if you wouldn't use something like Kubernetes anyways that - supposedly - comes with its own backup solution.
For single machine, multi container setup I suggest to store your container's data, configuration, and eventual build scripts within one directory tree (e.g. /docker/) and use a standard file based backup program to backup the root directory.
Use docker-compose to managed your containers. This lets you store the configuration and even build options in a file(s). I have an individual compose file for each service, but a single one would also work.
Have a subdirectory for each service. Mount bind-mount directories aka volumes of the container there. If you need to adapt the build process more thoroughly you can easily store scripts, sources, Dockerfiles, etc. in there as well.
Since containers are supposed to be ephemeral, all persistent data should be in bind-mount and therefore in the main docker directory.
I'm new to Docker and as I understand, Docker uses the same libs/bins for multiple containers where possible.
How can I tell Docker to don't do that - so using a new lib or bin even if the same lib/bin already exists?
To be concrete:
I use this image and I want to start multiple instances of geth-testnet but all of them shall use their own blockchain.
I don't believe you need to worry about this. Docker uses hashing of the layers under the image to maximize reuse. These layers are all read only, and mounted with the union fs under a container specific read-write layer. The result is very efficient on the filesystem and transparent to the user who sees them as writable in their isolated container. However, if you modify them in one container, the change will not be visible in any other container and will be lost when the container is removed and replaced with a new instance.
I'm really struggling to grasp the workflow of Docker. The issue is: where exactly are the deliverables?
One would expect the developers image to be the same one as the one used for testing, production.
But how can one develop use auto-reload and such(probably by some shared volumes) without building the image again and again?
The image for testers should be just fire and you are ready to go. How are the images split?
I heard something about data-container which holds probably the app deliverables. So does it mean that I will have one container for DB, one for App. Server and one versioned image for my code itself?
The issue is ,where exactly are the deliverables.
static deliverables (which never changes) are directly copied in the image.
dynamic deliverables (which are generated during a docker run session, or which are updated) are in volumes (either host mounted volume or data container volume), in order to have persistence across container life-cycle.
does it mean that I will have one container for DB, one for App.
Yes, in addition of your application container (which is what docker primarily is: it puts applications in container), you would have data container in order to isolate the data that needs persistence.