How to separate application and data syncing implementations in Kubernetes? - docker

I want to build an application/service which uses a static application data which can get updated over time. Currently, I implemented this by having both the application and the data within the same container. However, this requires redeployment when either application or data change. I want to separate the app and data volume implementation so that I can update the application and the data independently, meaning I won't have to rebuild the application layer when the application data is updated and vice versa.
Here are the characteristics of the Application Data and its Usage:
Data is not frequently updated, but read very frequently by the application
Data is not a database, it's a collection of file objects with size ranging from 100MB to 4GB which is initially stored in a cloud storage
Data stored in the cloud storage serves as a single source of truth for the application data
The application will only read from the Data. The process of updating data in cloud storage is an external process outside the application scope.
So here, we are interested in sync-ing the data in cloud storage to the volume in Kubernetes deployment. What's the best way to achieve this objective in Kubernetes?
I have several options in mind:
Using one app container in one deployment, in which the app will also include the logic for data loading and update which pulls data
from cloud storage to the container --> simple but tightly coupled with the storage read-write implementation
Using the cloud store directly from the app --> this doesn't require container volume, but I was concerned with the huge file size because the app is an interactive service which requires a quick response
Using two containers in one deployment sharing the same volume --> allow great flexibility for the storage read-write implementation
one container for application service reading from the shared volume
one container for updating data and listening to update data request which writes data to the shared volume --> this process will pull data from cloud storage to the shared volume
Using one container with a Persistent Disk
an external process which writes to the persistent disk (not sure how to do this yet with cloud storage/file objects, need to find a way to sync gcs to persistent disk)
one container for application service which reads from the mounted volume
Using Fuse mounts
an external process which writes to cloud storage
a container which uses fuse mounts
I am currently leaning towards option 3, but I am not sure if it's the common practice of achieving my objective. Please let me know if you have better solutions.

Yes. 3. is the most common option but make sure you use an initContainer to copy the data from your cloud storage to a local volume. That local volume could be any of the types supported by Kubernetes.

Related

What purpose to ephemeral volumes serve in Kubernetes?

I'm starting to learn Kubernetes recently and I've noticed that among the various tutorials online there's almost no mention of Volumes. Tutorials cover Pods, ReplicaSets, Deployments, and Services - but they usually end there with some example microservice app built using a combination of those four. When it comes to databases they simply deploy a pod with the "mongo" image, give it a name and a service so that other pods can see it, and leave it at that. There's no discussion of how the data is written to disk.
Because of this I'm left to assume that with no additional configuration, containers are allowed to write files to disk. I don't believe this implies files are persistent across container restarts, but if I wrote a simple NodeJS application like so:
const fs = require("fs");
fs.writeFileSync("test.txt", "blah");
const value = fs.readFileSync("test.txt", "utf8");
console.log(value);
I suspect this would properly output "blah" and not crash due to an inability to write to disk (note that I haven't tested this because, as I'm still learning Kubernetes, I haven't gotten to the point where I know how to put my own custom images in my cluster yet -- I've only loaded images already on Docker Hub so far)
When reading up on Kubernetes Volumes, however, I came upon the Ephemeral Volume -- a volume that:
get[s] created and deleted along with the Pod
The existence of ephemeral volumes leads me to one of two conclusions:
Containers can't write to disk without being granted permission (via a Volume), and so every tutorial I've seen so far is bunk because mongo will crash when you try to store data
Ephemeral volumes make no sense because you can already write to disk without them, so what purpose do they serve?
So what's up with these things? Why would someone create an ephemeral volume?
Container processes can always write to the container-local filesystem (Unix permissions permitting); but any content that goes there will be lost as soon as the pod is deleted. Pods can be deleted fairly routinely (if you need to upgrade the image, for example) or outside your control (if the node it was on is terminated).
In the documentation, the types of ephemeral volumes highlight two major things:
emptyDir volumes, which are generally used to share content between containers in a single pod (and more specifically to publish data from an init container to the main container); and
injecting data from a configMap, the downward API, or another data source that might be totally artificial
In both of these cases the data "acts like a volume": you specify where it comes from, and where it gets mounted, and it hides any content that was in the underlying image. The underlying storage happens to not be persistent if a pod is deleted and recreated, unlike persistent volumes.
Generally prepackaged versions of databases (like Helm charts) will include a persistent volume claim (or create one per replica in a stateful set), so that data does get persisted even if the pod gets destroyed.
So what's up with these things? Why would someone create an ephemeral volume?
Ephemeral volumes are more of a conceptual thing. The main need for this concept is driven from microservices and orchestration processes, and also guided by 12 factor app. But why?
Because, one major use case is when you are deploying a number of microservices (and their replicas) using containers across multiple machines in a cluster you don't want a container to be reliant on its own storage. If containers rely on their on storage, shutting them down and starting new ones affects the way your app behaves, and this is something everyone wants to avoid. Everyone wants to be able to start and stop containers whenever they want, because this allows easy scaling, updates, etc.
When you actually need a service with persistent data (like DB) you need a special type of configuration, especially if you are running on a cluster of machines. If you are running on one machine, you could use a mounted volume, just to be sure that your data will persist even after container is stopped. But if you want to just load balance across hundreds of stateless API services, ephemeral containers is what you actually want.

Kubernetes: Managing uploaded user content

I have this idea for what I think should happen with my project, but I want to check in and see if this works on a theoretical level first. Basically I am working on a Django site that is run on Kubernetes, but am struggling a little bit about which way I should set up my replicationSet/statefulSet to manage uploaded content (images).
My biggest concern is trying to find out how to scale and maintain uploaded content. My first idea is that I need to have a single volume that has these files written to it, but can I have multiple pods write to the same volume that way while scaling?
From what I have gathered, it doesn't seem to work that way. It sounds more like each pod, or at least each node, would have it's own volume. But would a request for an image reach the volume it is stored on? Or should I create a custom backend program to move things around so that it is served off of a NGINX server like my other static content?
FYI - this its my first scalable project lol. But I am really just trying to find the best way to manage uploads... or a way in general. I would appreciate any explanations, thoughts, or fancy diagrams on how something like this might work!
Hello I think you should forget kubernetes a bit and think of the architecture and capabilities of your Django application. I guess you have built a web app, that offers some 'upload image' functionality, and then you have code that 'stores' this image somewhere. On the very simple scenario if you run your app on your laptop, the you web app, is configured to save this content to a local folder, a more advanced example is that you deploy your application to a VM or a cloud vm e.g an AWS EC2 instance, and your app is saving the files to the local storage of this EC2 instance. The question is twofold - what happens if we have 2 instances of your web app deployed - can the be configured and run - so that they 'share' the same folder to save the images? I guess this is what you want, other wise your app would not scale horizontally , each user would have to hit each individual instance - in order to upload or retrieve specific images. So having that in mind that is a design decision of your application, which I am pretty sure you have already worked it out, the you need to think - how can I share a folder? a bucket so that all my instances of my web app can save files? If you spinned 3 different vms, on any cloud, you would have to use some kind of clour storage, so that all three instances point to the same physical storage location, or an NFS drive or you could save your data to a cloud storage service S3!
Having all the above in mind, and clearly understanding that you need to decouple your application from the notion of locale storage especially if you want to make it as as stateless as it gets (whatever that means to you), having your web app, which is packaged as a docker container and deployed in a kubernetes cluster as a pod - and saving files to the local storage is not going to get any far, since each pod, each docker container will use the underlying kubernetes worker (vm) storage to save files, so another instance will be saving files on some other vm etc etc.
Kubernetes provides this kind of abstraction for applications (pods) that want to 'share' within the kubernetes cluster, some local storage and of course persist it. Something that I did not add above is that pod and worker storage (meaning if you save files in the kubernetes worker or pod) once this vm / instance is restarted you will loose your data. So you want something durable.
To cut a long story short,
1) you can either to deploy your application / pod along with a Persistent Volume Claim assuming that your kubernetes cluster supports it. What is happening is that you can mount to your pod some kind of folder / storage which will be backed up by whatever is available to your cluster - some kind of NFS store. https://kubernetes.io/docs/concepts/storage/persistent-volumes/
2) You can 'outsource' this need to share a common local storage to some external provider, e.g a common case use an S3 bucket, and not tackle the problem on kubernetes - just keep and provision the app within kubernetes.
I hope I gave you some basic ideas.
Note: Kubernetes 1.14 now (March 2019) comes with Durable Local Storage Management is Now GA, which:
Makes locally attached (non-network attached) storage available as a persistent volume source.
Allows users to take advantage of the typically cheaper and improved performance of persistent local storage kubernetes/kubernetes: #73525, #74391, #74769 kubernetes/enhancements: #121 (kep)
That might help securing a truly persistent storage for your case.
As noted by x-yuri in the comments:
See more with "Kubernetes 1.14: Local Persistent Volumes GA", from Michelle Au (Google), Matt Schallert (Uber), Celina Ward (Uber).
you could use ipfs https://pypi.org/project/django-ipfs-storage/
creating a container with this image https://hub.docker.com/r/ipfs/go-ipfs/ in the same pod you can ref as 'localhost'

Is there a way to have a shared (temp) folder between apps or multiple instances of apps on Bluemix?

I am running a Rails app on Bluemix and want to use carrierwave for file uploads. So far no problem as I am using external storage to persist the files (ftp, s3, webdav etc.). However, in order to keep performance well I need to enable caching with carrierewave_backgrounder - and here it starts to get tricky. Thing is that I need to specify a temp folder for backgrounding the upload process (temp folder where the file remains before it is persisted on the actual storage), which is shared between all possible workers and app instances. If so at all, how can this be achieved?
Check out Object Storage - you can store files and then delete them when you no longer have a need for them. Redis is another option, as are any of the noSQL databases available on Bluemix.
typically in any cloud you never store on a file system of your VM or PaaS environment - reason being when you scale out, you have multiple VMS and a file written on one VM will not be available when 100s of VMs come up. The recommend practice is to look for storage services that the cloud platform provides. In Bluemix you have Storage options such as Cloud Object Storage, File Storgae and Block Storage.
As suggested before - you can take a look at the cloud object storage and utilize the service. Here is the documentation for Cloud Object Storage: https://ibm-public-cos.github.io/crs-docs/?&cm_mc_uid=06526771022514957173672&cm_mc_sid_50200000=1498597403&cm_mc_sid_52640000=1498599343. This contains quick start guide, storing, retrieving and API usage. Hope this helps.

How to properly manage storage in Jelastic

Okay, another question.
In AWS I have EBS, which allows me to create volumes, define iops/size for them, mount to desired EC2 machines and take snapshots.
How can I achieve same features in Jelastic? I have option to create "Storage Container" but it belongs only to one environment. How can I backup this volume?
Also, what's the best practice of managing storage devices for things like databases? Use separate storage container?
I have option to create "Storage Container" but it belongs only to one environment.
Yes the Storage Container belongs to 1 environment (either part of one of your other environments, or its own), but you can mount it in 1+ other containers (i.e. inside containers of other environments).
You can basically consider a storage container to be similar to AWS EBS: it can be mounted anywhere you like (multiple times even) in containers within environments in the same region.
How can I backup this volume?
Check your hosting provider's backup policy. In our case we perform backups of all containers for our customers for free. Customers do not need to take additional backups themselves. No need for those extra costs and steps... It might be different at some other Jelastic providers so please check this with your chosen provider(s).
If you wish to make your own backups, you can define a script to do it and set it in cron for example. That script can transfer archives to S3 or anything you wish.
what's the best practice of managing storage devices for things like databases?
Just like with AWS, you may experience performance issues if you use remote storage for database access. Jelastic should generally give you lower latency than EBS, but even so I recommend to keep your database storage local (not via storage containers).
Unlike AWS EC2, you do not have the general risk of local storage disappearing (i.e. your Jelastic containers local storage is not ephemeral; you can safely write data there and expect it to be persistent).
If you need multiple database nodes, it is recommended to use database software level clustering features (master-master or master-slave replication for example) instead of sharing the filesystem.
Remember that any shared filesystem is a shared (single) point of failure. What you gain in application / software convenience you may also lose in reliability / high availability. It's often worth making the extra steps in your application to handle this issue another way, or perhaps consider using lsyncd (there are Jelastic marketplace addons for this) to replicate parts of your filesystem instead of mounting a shared storage container.

how nuodb manages the storage size increase

Say my data store is going to increase in size, if the data increases how storage manager would manage the data. Does storage manager split the data with different domain machines ( definitely that is not the case)?
How exactly would the process work? What is the recommendation in this area, key-value store?
If you have a storage manager that is soon to run out of disk space, you can startup a new storage manager with a larger disk subsystem or that points to extensible cloud storage such as Amazon S3. Once the new storage manager is up-to-date the old one can be taken offline. This entire operation can be done while the database is running. Generally, we also recommend that you always run with at least 2 storage managers for redundancy.
If you have more questions, feel free to direct them to the NuoDB forum:
http://www.nuodb.com/community
NuoDB supports multiple back-end storage devices, including the Hadoop Distributed File System (HDFS). If you start a broker configured for HDFS, you can use HDFS tools to expand distributed storage on-the-fly and there's no need for any NuoDB admin operations. As Duke described, you can transition from a file-based Storage Manager to an HDFS one without interrupting services.
NuoDB interfaces with the storage layer using filesystem semantics for discrete storage units called "atoms". These map easily into the HDFS directory structure, simplifying administration on that end.

Resources