I have a node app that allows people to upload their profile picture. Profile pictures are stored on the file system.
I now what to turn my node app into a docker container.
I would like to be able to deploy it pretty much anywhere (Amazon, etc.) and realise that storing files within the file system is a no-go.
So:
Option 1: store files on Amazon's S3 (or something equivalent)
Option 2: creating a "data volume. This makes me wonder: if I deploy this remotely, will this work? Would this be a long-term way to go about it?
Are volumes what I want to do here? Is this how you use docker volumes in Amazon?
(Damn this stuff is hard to crack...)
The answer is: that depends hehehe
Option 1 is good, resilient, and works just "out-of-the-box", but creates a vendor lock-in. Meaning that if you ever decide to stop using AWS, you'll have some code to refactor. Plus bills for S3 will be high if you perform lots and LOTs of requests.
Option 2 works partially: your docker containers will likely run on VMs on AWS, Azure, etc. which are also ephemeral. Meaning your data can disappear just as quick as if they were on containers, unless you backup them.
Other options I know:
Option 3: AWS have a NFS service (sorry, can't remember its name right now), which seems VERY interesting. In theory, it would be like plugging an USB drive storage to VM instances, which can be mounted as volumes inside the containers. I have never done this myself, but seems viable to reduce S3 costs. However this also generates vendor lock-in.
Option 4: AWS S3 with a caching mechanism for files in the VMs.
If you are just testing, option 1 sounds good! But later on you might have to work on that.
Related
In a current project I have to perform the following tasks (among others):
capture video frames from five IP cameras and stitch a panorama
run machine learning based object detection on the panorama
stream the panorama so it can be displayed in a UI
Currently, the stitching and the streaming runs in one docker container, and the object detection runs in another, reading the panorama stream as input.
Since I need to increase the input resolution for the the object detector while maintaining the stream resolution for the UI, I have to look for alternative ways of getting the stitched (full resolution) panorama (~10 MB per frame) from the stitcher container to the detector container.
My thoughts regarding potential solutions:
shared volume. Potential downside: One extra write and read per frame might be too slow?
Using a message queue or e.g. redis. Potential downside: yet another component in the architecture.
merging the two containers. Potential downside(s): Not only does it not feel right, but the two containers have completely different base images and dependencies. Plus I'd have to worry about parallelization.
Since I'm not the sharpest knife in the docker drawer, what I'm asking for are tips, experiences and best practices regarding fast data exchange between docker containers.
Usually most communication between Docker containers is over network sockets. This is fine when you're talking to something like a relational database or an HTTP server. It sounds like your application is a little more about sharing files, though, and that's something Docker is a little less good at.
If you only want one copy of each component, or are still actively developing the pipeline: I'd probably not use Docker for this. Since each container has an isolated filesystem and its own user ID space, sharing files can be unexpectedly tricky (every container must agree on numeric user IDs). But if you just run everything on the host, as the same user, pointing at the same directory, this isn't a problem.
If you're trying to scale this in production: I'd add some sort of shared filesystem and a message queueing system like RabbitMQ. For local work this could be a Docker named volume or bind-mounted host directory; cloud storage like Amazon S3 will work fine too. The setup is like this:
Each component knows about the shared storage and connects to RabbitMQ, but is unaware of the other components.
Each component reads a message from a RabbitMQ queue that names a file to process.
The component reads the file and does its work.
When it finishes, the component writes the result file back to the shared storage, and writes its location to a RabbitMQ exchange.
In this setup each component is totally stateless. If you discover that, for example, the machine-learning component of this is slowest, you can run duplicate copies of it. If something breaks, RabbitMQ will remember that a given message hasn't been fully processed (acknowledged); and again because of the isolation you can run that specific component locally to reproduce and fix the issue.
This model also translates well to larger-scale Docker-based cluster-computing systems like Kubernetes.
Running this locally, I would absolutely keep separate concerns in separate containers (especially if individual image-processing and ML tasks are expensive). The setup I propose needs both a message queue (to keep track of the work) and a shared filesystem (because message queues tend to not be optimized for 10+ MB individual messages). You get a choice between Docker named volumes and host bind-mounts as readily available shared storage. Bind mounts are easier to inspect and administer, but on some platforms are legendarily slow. Named volumes I think are reasonably fast, but you can only access them from Docker containers, which means needing to launch more containers to do basic things like backup and pruning.
Alright, Let's unpack this:
IMHO Shared Volume works just fine, but gets way too messy over time. Especially if you're handling Stateful services.
MQ: This seems like a best option in my opinion. Yes, it's another component in your architecture, but it makes sense to have it rather than maintaining messy shared Volumes or handling massive container images (if you manage to combine 2 container images)
Yes, You could potentially do this, but not a good idea. Considering your use case, I'm going to go ahead and make an assumption that you have a massive list of dependencies which could potentially lead to a conflict. Also, lot of dependencies = larger image = Larger attack surface - which from a security perspective is not a good thing.
If you really want to run multiple processes in one container, it's possible. There are multiple ways to achieve that, however I prefer supervisord.
https://docs.docker.com/config/containers/multi-service_container/
I have this idea for what I think should happen with my project, but I want to check in and see if this works on a theoretical level first. Basically I am working on a Django site that is run on Kubernetes, but am struggling a little bit about which way I should set up my replicationSet/statefulSet to manage uploaded content (images).
My biggest concern is trying to find out how to scale and maintain uploaded content. My first idea is that I need to have a single volume that has these files written to it, but can I have multiple pods write to the same volume that way while scaling?
From what I have gathered, it doesn't seem to work that way. It sounds more like each pod, or at least each node, would have it's own volume. But would a request for an image reach the volume it is stored on? Or should I create a custom backend program to move things around so that it is served off of a NGINX server like my other static content?
FYI - this its my first scalable project lol. But I am really just trying to find the best way to manage uploads... or a way in general. I would appreciate any explanations, thoughts, or fancy diagrams on how something like this might work!
Hello I think you should forget kubernetes a bit and think of the architecture and capabilities of your Django application. I guess you have built a web app, that offers some 'upload image' functionality, and then you have code that 'stores' this image somewhere. On the very simple scenario if you run your app on your laptop, the you web app, is configured to save this content to a local folder, a more advanced example is that you deploy your application to a VM or a cloud vm e.g an AWS EC2 instance, and your app is saving the files to the local storage of this EC2 instance. The question is twofold - what happens if we have 2 instances of your web app deployed - can the be configured and run - so that they 'share' the same folder to save the images? I guess this is what you want, other wise your app would not scale horizontally , each user would have to hit each individual instance - in order to upload or retrieve specific images. So having that in mind that is a design decision of your application, which I am pretty sure you have already worked it out, the you need to think - how can I share a folder? a bucket so that all my instances of my web app can save files? If you spinned 3 different vms, on any cloud, you would have to use some kind of clour storage, so that all three instances point to the same physical storage location, or an NFS drive or you could save your data to a cloud storage service S3!
Having all the above in mind, and clearly understanding that you need to decouple your application from the notion of locale storage especially if you want to make it as as stateless as it gets (whatever that means to you), having your web app, which is packaged as a docker container and deployed in a kubernetes cluster as a pod - and saving files to the local storage is not going to get any far, since each pod, each docker container will use the underlying kubernetes worker (vm) storage to save files, so another instance will be saving files on some other vm etc etc.
Kubernetes provides this kind of abstraction for applications (pods) that want to 'share' within the kubernetes cluster, some local storage and of course persist it. Something that I did not add above is that pod and worker storage (meaning if you save files in the kubernetes worker or pod) once this vm / instance is restarted you will loose your data. So you want something durable.
To cut a long story short,
1) you can either to deploy your application / pod along with a Persistent Volume Claim assuming that your kubernetes cluster supports it. What is happening is that you can mount to your pod some kind of folder / storage which will be backed up by whatever is available to your cluster - some kind of NFS store. https://kubernetes.io/docs/concepts/storage/persistent-volumes/
2) You can 'outsource' this need to share a common local storage to some external provider, e.g a common case use an S3 bucket, and not tackle the problem on kubernetes - just keep and provision the app within kubernetes.
I hope I gave you some basic ideas.
Note: Kubernetes 1.14 now (March 2019) comes with Durable Local Storage Management is Now GA, which:
Makes locally attached (non-network attached) storage available as a persistent volume source.
Allows users to take advantage of the typically cheaper and improved performance of persistent local storage kubernetes/kubernetes: #73525, #74391, #74769 kubernetes/enhancements: #121 (kep)
That might help securing a truly persistent storage for your case.
As noted by x-yuri in the comments:
See more with "Kubernetes 1.14: Local Persistent Volumes GA", from Michelle Au (Google), Matt Schallert (Uber), Celina Ward (Uber).
you could use ipfs https://pypi.org/project/django-ipfs-storage/
creating a container with this image https://hub.docker.com/r/ipfs/go-ipfs/ in the same pod you can ref as 'localhost'
I want to take a holistic approach backing up multiple machines running multiple Docker containers. Some might run, for example, Postgres databases. I want to back up this system, without having to have specific backup commands for different types of volumes.
It is fine to have a custom external script that sends e.g. signals to containers or runs Docker commands, but I strongly want to avoid anything specific to a certain image or type of image. In the example of Postgres, the documentation suggests running postgres-specific commands to backup databases, which goes against the design goals for the backup solution I am trying to create.
It is OK if I have to impose restrictions on the Docker images, as long as it is reasonably easy to implement by starting from existing Docker images and extending.
Any thoughts on how to solve this?
I just want to stress that I am not looking for a solution for how to back up Postgres databases under Docker, there are already many answers explaining how to do so. I am specifically looking for a way to back up any volume, without having to know what it is or having to run specific commands for its data.
(I considered whether this question belonged on SO or Serverfault, but I believe this is a problem to be solved by developers, hence it belongs here. Happy to move it if consensus is otherwise)
EDIT: To clarify, I want do something similar to what is explained in this question
How to deal with persistent storage (e.g. databases) in docker
but using the approach in the accepted answer is not going to work with Postgres (and I am sure other database containers) according to documentation.
I'm skeptical that there is a custom solution, holistic, multi machine, multi container, application/container agnostic approach. From my point of view there is a lot of orchestration activities necessary in the first place. And I wonder if you wouldn't use something like Kubernetes anyways that - supposedly - comes with its own backup solution.
For single machine, multi container setup I suggest to store your container's data, configuration, and eventual build scripts within one directory tree (e.g. /docker/) and use a standard file based backup program to backup the root directory.
Use docker-compose to managed your containers. This lets you store the configuration and even build options in a file(s). I have an individual compose file for each service, but a single one would also work.
Have a subdirectory for each service. Mount bind-mount directories aka volumes of the container there. If you need to adapt the build process more thoroughly you can easily store scripts, sources, Dockerfiles, etc. in there as well.
Since containers are supposed to be ephemeral, all persistent data should be in bind-mount and therefore in the main docker directory.
Okay, another question.
In AWS I have EBS, which allows me to create volumes, define iops/size for them, mount to desired EC2 machines and take snapshots.
How can I achieve same features in Jelastic? I have option to create "Storage Container" but it belongs only to one environment. How can I backup this volume?
Also, what's the best practice of managing storage devices for things like databases? Use separate storage container?
I have option to create "Storage Container" but it belongs only to one environment.
Yes the Storage Container belongs to 1 environment (either part of one of your other environments, or its own), but you can mount it in 1+ other containers (i.e. inside containers of other environments).
You can basically consider a storage container to be similar to AWS EBS: it can be mounted anywhere you like (multiple times even) in containers within environments in the same region.
How can I backup this volume?
Check your hosting provider's backup policy. In our case we perform backups of all containers for our customers for free. Customers do not need to take additional backups themselves. No need for those extra costs and steps... It might be different at some other Jelastic providers so please check this with your chosen provider(s).
If you wish to make your own backups, you can define a script to do it and set it in cron for example. That script can transfer archives to S3 or anything you wish.
what's the best practice of managing storage devices for things like databases?
Just like with AWS, you may experience performance issues if you use remote storage for database access. Jelastic should generally give you lower latency than EBS, but even so I recommend to keep your database storage local (not via storage containers).
Unlike AWS EC2, you do not have the general risk of local storage disappearing (i.e. your Jelastic containers local storage is not ephemeral; you can safely write data there and expect it to be persistent).
If you need multiple database nodes, it is recommended to use database software level clustering features (master-master or master-slave replication for example) instead of sharing the filesystem.
Remember that any shared filesystem is a shared (single) point of failure. What you gain in application / software convenience you may also lose in reliability / high availability. It's often worth making the extra steps in your application to handle this issue another way, or perhaps consider using lsyncd (there are Jelastic marketplace addons for this) to replicate parts of your filesystem instead of mounting a shared storage container.
I was wondering if it is possible to offer Docker images, but not allow any access to the internals of the built containers. Basically, the user of the container images can use the services they provide, but can't dig into any of the code within the containers.
Call it a way to obfuscate the source code, but also offer a service (the software) to someone on the basis of the container, instead of offering the software itself. Something like "Container as a Service", but with the main advantage that the developer can use these container(s) for local development too, but with no access to the underlying code within the containers.
My first thinking is, the controller of the Docker instances controls everything down to root access. So no, it isn't possible. But, I am new to Docker and am not aware of all of its possibilities.
Is this idea in any way possible?
An obfuscation-based only solution would not be enough, as "Encrypted and secure docker containers" details.
You would need full control of the host your containers are running in order to prevent any "poking". And that is not the case in your scenario, where a developer does have access to the host (ie his/her local development machine) where said container would run.
What is done sometimes is to have some piece of "core" code to run on a remote location (remote server, usb device), in a way that the external piece of code on the one hand can do some client authentication but also and more importantly run some business core code in order to guarantee that the externally located code "has" to be executed to have the things done. If it were only some check that is not actually core code, a cracker could just override it and avoid calling it on the client side. But if the code is actually required to be run and its not then the software won't be able to finish its processing. Of course there is an overhead for all of this, both in complexity and probably computation times, but that's one way you could deploy something that will unfailingly be required to contact your server/external device.
Regards,
Eduardo