I'm getting going with Docker, and I've found that I can put the main image repository on a different disk (symlink /var/lib/docker to some other location).
However, now I'd like to see if there is a way to split that across multiple disks.
Specifically, I have an old SSD that is blazingly fast to read from, but doesn't have too many writes left until it kicks the can. It would be awesome if I could store the immutable images on here, then have my writeable images on some other location that can handle the writes.
Is this something that is possible? How do you split up the repository?
Maybe you could do this using the AUFS driver and some trickery such as moving layers to the SSD after initially creating them and pointing symlinks at them - I'm not sure, I never had a proper look at how that storage driver worked.
With devicemapper thinp, btrfs and OverlayFS this isnt possible AFAICT:
The Docker dm-thinp and btrfs drivers both build layers one on top of the other using block device snapshot mechanisms. Your best bet here would be to include the SSD in the storage pool and rely on some ability to migrate the r/o snapshots to a specific block device that is part of the pool. Doubt this exists though.
The OverlayFS driver stacks layers by hard-linking files in independent directory structures. Hard-links only work within a filesystem.
Related
How can I measure the efficiency of a container image, in terms of what portion of its contents are actually used (accessed) for the processes therein?
There are various forms of wastage that could contribute to excessively large images, such as layers storing files that are superseded in later layers (which can be analysed using dive), or binaries interlaced with unstripped debug information, or the inclusion of extraneous files (or data) that are simply not needed for the process which executes in the container. Here I'm asking about the latter.
Are there docker-specific tools (analogous to dive) for estimating/measuring this kind of wastage/efficiency, or should I just apply general Linux techniques? Can the filesystem access time (atime) be relied upon inside a container (to distinguish which files have/haven't been read since the container was instantiated) or do I need to instrument the image with tools like the Linux auditing system (auditd)?
In a current project I have to perform the following tasks (among others):
capture video frames from five IP cameras and stitch a panorama
run machine learning based object detection on the panorama
stream the panorama so it can be displayed in a UI
Currently, the stitching and the streaming runs in one docker container, and the object detection runs in another, reading the panorama stream as input.
Since I need to increase the input resolution for the the object detector while maintaining the stream resolution for the UI, I have to look for alternative ways of getting the stitched (full resolution) panorama (~10 MB per frame) from the stitcher container to the detector container.
My thoughts regarding potential solutions:
shared volume. Potential downside: One extra write and read per frame might be too slow?
Using a message queue or e.g. redis. Potential downside: yet another component in the architecture.
merging the two containers. Potential downside(s): Not only does it not feel right, but the two containers have completely different base images and dependencies. Plus I'd have to worry about parallelization.
Since I'm not the sharpest knife in the docker drawer, what I'm asking for are tips, experiences and best practices regarding fast data exchange between docker containers.
Usually most communication between Docker containers is over network sockets. This is fine when you're talking to something like a relational database or an HTTP server. It sounds like your application is a little more about sharing files, though, and that's something Docker is a little less good at.
If you only want one copy of each component, or are still actively developing the pipeline: I'd probably not use Docker for this. Since each container has an isolated filesystem and its own user ID space, sharing files can be unexpectedly tricky (every container must agree on numeric user IDs). But if you just run everything on the host, as the same user, pointing at the same directory, this isn't a problem.
If you're trying to scale this in production: I'd add some sort of shared filesystem and a message queueing system like RabbitMQ. For local work this could be a Docker named volume or bind-mounted host directory; cloud storage like Amazon S3 will work fine too. The setup is like this:
Each component knows about the shared storage and connects to RabbitMQ, but is unaware of the other components.
Each component reads a message from a RabbitMQ queue that names a file to process.
The component reads the file and does its work.
When it finishes, the component writes the result file back to the shared storage, and writes its location to a RabbitMQ exchange.
In this setup each component is totally stateless. If you discover that, for example, the machine-learning component of this is slowest, you can run duplicate copies of it. If something breaks, RabbitMQ will remember that a given message hasn't been fully processed (acknowledged); and again because of the isolation you can run that specific component locally to reproduce and fix the issue.
This model also translates well to larger-scale Docker-based cluster-computing systems like Kubernetes.
Running this locally, I would absolutely keep separate concerns in separate containers (especially if individual image-processing and ML tasks are expensive). The setup I propose needs both a message queue (to keep track of the work) and a shared filesystem (because message queues tend to not be optimized for 10+ MB individual messages). You get a choice between Docker named volumes and host bind-mounts as readily available shared storage. Bind mounts are easier to inspect and administer, but on some platforms are legendarily slow. Named volumes I think are reasonably fast, but you can only access them from Docker containers, which means needing to launch more containers to do basic things like backup and pruning.
Alright, Let's unpack this:
IMHO Shared Volume works just fine, but gets way too messy over time. Especially if you're handling Stateful services.
MQ: This seems like a best option in my opinion. Yes, it's another component in your architecture, but it makes sense to have it rather than maintaining messy shared Volumes or handling massive container images (if you manage to combine 2 container images)
Yes, You could potentially do this, but not a good idea. Considering your use case, I'm going to go ahead and make an assumption that you have a massive list of dependencies which could potentially lead to a conflict. Also, lot of dependencies = larger image = Larger attack surface - which from a security perspective is not a good thing.
If you really want to run multiple processes in one container, it's possible. There are multiple ways to achieve that, however I prefer supervisord.
https://docs.docker.com/config/containers/multi-service_container/
According to Docker docs, every Dockerfile instruction create a layer, and all the layers are kept when you create new image based on an old one. Then when I create my own image, I might have hundreds of layers involved because of the recursive inherit of layers of base image.
In my understand, file lookup in container work this way:
process want to access file a, lookup starts from the container layer(thin w/r layer) .
UnionFS check whether this layer have a record for it (have it or marked as deleted). If yes, return it or say not found respectively, ending the lookup. If no, pass the task to the layer below.
the lookup end at the bottom layer.
If that is the way, consider a file that resides in the bottom layer and unchanged by other layers, /bin/sh maybe, would need going through all the layers to the bottom. Though the layers might be very light-weight, a lookup still need 100x time than a regular one, noticeable. But from my experience, Docker is pretty fast, almost same as a native OS. Where am I wrong?
This is all thanks to UnionFS and Union mounts!
Straight from wikipedia:
It allows files and directories of separate file systems, known as
branches, to be transparently overlaid, forming a single coherent file
system.
And from an interesting article:
In the kernel, the filesystems are stacked in order of their mount
sequence, the first mounted filesystem is at the bottom of the mount
stack, and the latest mount is at the top of the stack. Only the files
and directories of the top of the mount stack are visible. With union
mounts, directory entries from the lower filesystems are merged with
the directory entries of upper filesystem, thus making a logical
combination of all mounted filesystems. Files with the same name in a
lower filesystem are masked, as the upper one takes precedence.
So it doesn't "go through layers" in the conventional sense (e.g one at a time) but rather it knows (at any given time) which file resides on which disk.
Doing this in the filesystem layer also means none of the software has to worry about where the file resides, it knows to ask for /bin/sh and the filesystem knows where to get it.
More info can be found in this webinar.
So to answer your question:
Where am I wrong?
You are thinking that it has to look through the layers one at a time while it doesn't have to do that. (UnionFS is awesome!)
To add to the correct prior answer, copy-on-write (CoW) and union filesystem implementors want to have near-native performance, so, of course, have tuned their implementations and "API" to have best possible lookup/filesystem performance.
That said, it's good to be aware that Docker does not operate on top of only a single 'type' of union/CoW filesystem, but has a small array of available options, with defaults depending on the Linux distro on which it is installed.
AUFS and overlay(fs) are the most common, but Docker also supports devicemapper (Red Hat contributed and supported on Fedora/RHEL/CentOS), btrfs, and zfs. I have a blog post comparing and contrasting the various options that may be of interest.
Can we share a common/single named volume across multiple hosts in docker engine swarm mode, what's the easiest way to do it ?
If you have an NFS server setup you can use use some nfs folder as a volume from docker compose like this:
volumes:
grafana:
driver: local
driver_opts:
type: nfs
o: addr=192.168.xxx.xx,rw
device: ":/PathOnServer"
In the grand scheme of things
The other answers are definitely correct. If you feel like you're still missing something or are coming to the conclusion that things might never really improve in this space, then you might want to reconsider the use of the typical POSIX-like hierarchical filesystem abstraction. Not all applications really need it (I might go as far as to say that few do). Maybe yours doesn't either.
In defense of filesystems
It is still very common in many circles, but usually these people know their remote/distributed filesystems very well and know how to set them up and leverage them properly (and they might be very good systems too, though often not with existing Docker volume drivers). Sometimes it's also in part because they're simply forced to (codebases that can't or shouldn't be rewritten to support other storage backends). Using, configuring or even writing arbitrary Docker volume drivers would be a secondary concern only.
Alternatives
If you have the option however, then evaluate other persistence solutions for your applications. Many implementations won't use POSIX filesystem interfaces but network interfaces instead, which pose no particular infrastructure-level difficulties in clusters such as Docker Swarm.
Solutions managed by third-parties (e.g. cloud providers)
Should you succeed in removing all dependencies to filesystems for persistent and shared data (it's still fine for transient local state), then you might claim to have fully "stateless" applications. Of course there is often always state persisted somewhere still, but the idea is that you don't handle it yourself. Many cloud providers (if that's where you're hosting things) will offer fully managed solutions for handling persistent state such that you don't have to care about it at all. If you're going this route, do consider managed services that use APIs compatible with implementations that you can use locally for testing (for example by running a Docker container based on an image for that implementation that is provided by a third-party or that you can maintain yourself).
DIY solutions
If you do want to manage persistent state yourself within a Docker Swarm cluster, then the filesystem abstraction is often inevitable (and you'd probably have more difficulties targeting block devices directly anyway). You'll want to play with node and service constraints to ensure the requirements of whatever you use to persist data are fulfilled. For certain things like a central DBMS server it could be easy ("always run the task on that specific node only"), for others it could be way more involved.
The task of setting up, scaling and monitoring such a setup is definitely not trivial, which is why many application developers are happy to let somebody else (e.g. cloud providers) do it. It's still a very cool space to explore however, though given you had to ask that question it's likely not something you should focus on if you're on a deadline.
Conclusion
As always, use the right abstraction for the job, and pause to think about what your strengths are and where to spend your resources.
From scratch, Docker does not support this by itself. You must use additional components either a docker plugin which would provide you with a new layer type for your volumes, or a sync tool directly on your FS which will sync the data for you.
From my point of view, the easiest solution is rsync or more accurately lsyncdn the daemon version of rsync. But I never tried it for docker volumes, so I can't tell if it handle it fine.
Other solutions are offered using Infinit.sh. It basically does the same thing as lsyncd does. It's a one way sync. So if your docker container are RW in their volumes it won't match your expectations. I tried this solution, and it works pretty well for RO operations. And not in production. It's still an alpha version. Infinit is also on the way to provide a docker driver. Not released yet. So I didn't even tried it. Too risky.
Other solutions I found but was unable to install (and so to try) are flocker and glusterFS. Both are designed to create FS Volume based on several HDD from several machines. But none of their repositories were working these past weeks.
Sorry for giving you only weak solutions, but I'm facing the same problem and haven't find yet a perfect solution.
Cheers,
Olivier
The Docker vfs storage backend is in several places mentioned as not being a production backend (see for example this Docker GitHub issue comment by Michael Crosby). What makes it not suitable for production?
Project Atomic's description of storage backends says:
The vfs backend is a very simple fallback that has no copy-on-write support. Each layer is just a separate directory. Creating a new layer based on another layer is done by making a deep copy of the base layer into a new directory.
Since this backend doesn’t share diskspace use between layers, and since creating a new layer is a slow operation this is not a very practical backend. However, it still has its uses, for instance to verify other backends against, or if you need a super robust (if slow) backend that works everywhere.
According to that description it sounds like the only downside is that more disk space might be used and creating layers might be slower. But there are no mentions of downsides during runtime when accessing files, and it is even described as "robust". The disk space issue alone does not seem like a blocker for production use.
Indeed, you could use the vfs driver in production, however, be aware that as it is a 'regular' copy, you won't benefit from the features that devicemapper or btrfs can provide and you rely exclusively on the underlying file system.
The runtime downside is that it is much slower to run. Once started, if you have the same underlying file system, it will be the same thing.
In short, I would recommend against because:
It has been implemented first for tests then used for volumes. Never meant to be used for runtime
It relies on the underlying file system so you give less control to Docker over your files. It might (or might not) cause issues with future upgrade. The very purpose of Docker is to abstract the host, so you are better off delegating this kind of thing to Docker.
It takes a lot of disk space
It takes a lot of time to run or commit