I am trying to build an application which in essence consists of two parts.
Django based api
SQLite database.
Api interacts with the SQLite database, but has only read only access. However, the SQLite database needs to be updated every N minutes. So my plan was to make two docker container. First one, for the api. The second one, for the script which is executed every N minutes using CRON (Ubuntu) to update the database.
I am using Kubernetes to serve my applications. So I was wondering if there is a way to achieve what I want here?
I've researched about Persistent Volumes in Kubernetes, but still do not see how I can make it work.
EDIT:
So I have figured that I can use one pod two share two containers on Kubernetes and this way make use of the emptyDir. My question is then, how do I define the path to this directory in my python files?
Thanks,
Lukas
Take into account that emptyDir is erased every time the pod is stopped/killed (you do a deployment, a node crash, etc.). See the doc for more details: https://kubernetes.io/docs/concepts/storage/volumes/#emptydir
Taking that into account, if that solves your problem, then you just need to put the mountPath to the directory you want, as in the link above shows the example.
Take into account that the whole directory will be empty, so if you have other things there they won't be visible if you set up and emptyDir (just typical unix mount semantics, nothing k8s specific here)
Related
I have a cronjob which is inside my golang application code.
now, this code is inside a container which is inside a pod
What happens:
Suppose I have a cronjob to send emails every Sunday.
The application starts to run and the cronjobs are created as soon as the application starts.
Now, If I have 3 such pods, the applications starts thrice in each pod and would have it's own cronjob, so the emails will be sent three times.
What I want:
The email should be sent only once i.e. all cronjobs should run only once independent of how many replicas I create
How can I achieve this?
Preferably: I would like to have the jobs inside the application because if I separate them out, I will have to call the API endpoint instead of the service directly.
TL;DR: Perhaps you need to rethink the value of co-locating the cronjobs with the function exposed via the API.
i.e. put the cronjobs in a separate pod with no replicas.
From the information available, that would seem to solve your problem most easily.
The question then arises, what value was gained or problem solved (other than convenience) by co-locating the cronjobs in the first place?
If there was no other problem, or that problem is more easily solved in other ways than the additional complexity involved in solving the problem that the co-location solution has created, then you have your answer.
Another test to apply would be which solution architecture would be easier for someone to understand (and in the future extend, modify or maintain):
separate cronjobs, running only once, and doing their work via an API
multiple cronjobs seemingly intentionally placed in a replicaset but with some complex coordination mechanism that contrives to ensure that of these multiple jobs only one instance is actually effective and the others rendered essentially inoperative
I'm working on a scientific computing project. For this work, I need many Python modules as well as C++ packages. The C++ packages require specific versions of other software, so setting up the environment should be done carefully, and after the setup the dependencies should not be updated. So, I thought it should be good to make a Docker container and work inside it, in order to make the work reproducible in the future. However, I didn't understand why people in the internet recommend to use different Docker containers for different processes. For me it seems more natural that I setup the environment, which is a pain, and then use it for the entire project. Can you please explain what I have to be worried about in this case?
It's important that you differentiate between a Docker image and a Docker container.
People recommend using one process per container because this results in a more flexible, scalable environment: if you need to scale out your frontend web servers, or upgrade your database, you can do that without bringing down your entire stack. Running a single process per container also allows Docker to manage those processes in a sane fashion, e.g. by restarting things that have unexpectedly failed. If you're running multiple processes in a container, you end up having to hide this information from Docker by running some sort of process manager, which complicates your containers and can make it difficult to orchestrate a complex application.
On the other hand, it's quite common for someone to use a single image as the basis for a variety of containers all running different services. This is particularly true if you're build a project where a single source tree results in several commands; in that case, it makes sense to have bundle that all into a single image, and then choose which command to run when you start the container.
The only time this becomes a problem is when someone decides to do something like bundle, say, MySQL and Apache into a single image: which is a problem because there are already well maintained official images for those projects, and by building your own you've taking on the burden of properly configuring those services and maintaining the images going forward.
To summarize:
One process/service per container tends to make your life easier
Bundling things together in a single image can be okay
Right now I have a Python Application which runs 50 threads to process data. It takes an xlsx file and will process a list of values, and will output a simple csv.
I said to myself, since this is a simple Python App with 50 threads, How can I create a cluster to distribute data-processing even more? FOR EXAMPLE: Have each Worker node process a subset given to it by the master. Well that sounds easy, just take the master app slice up the dataset generated and then push it to the workers with load balancing.
How do I get the results though? I would want to take all results (out.csv in this case) and return them to the master and merge them to create 1 master_out.csv
At first I was thinking a Docker swarm, but no one i know uses them, everything beyond a simple docker container is offloaded to K8.
Right now, i have a simple file structure:
app/
__init__.py (everything is in this file)
dataset.xlxs
out.csv
I was thinking to create a docker image so that way I could move this app into the image, update/upgrade, install python3 if it isnt already, and then just run this application.
I started getting deeper into processing, and realized that there is likely some built in ways to handle this. create a flask app to handle ingestion, and then a flask app on master to accept files at completion, etc.... But then master needs to know all the workers etc.
I was thinking to create a cluster.
Master node has access to a volume which contains the file i need to process.
Load balancing pushes parts of each file ( ROWS / NUM_WORKERS) to each node.
After WORKERS FINISH, Master Aggregates the resulting csv files to make a master file.
Master_OUT.csv will exist in the folder for consumption.
So the cluster would turn on and when ready will run everything, then tare down at the end. Since they want the cluster to likely be distributed, I am not sure how that would work though as processing has IP Address limitations. It seems like this will not work on a local cluster because to machines being used to reference will hit a cloudflare (or similar) wall after enough requests, so im trying to think of a UNIQUE IP Solution.
I have an idea for architecture, but im not sure if i should create a dockerfile for this, and then figure out the way kube can handle all of this for me. Though i think in the kube config files we can put remote aws instance login creds so it will spin up all the remote servers.
While I have been doing some stuff with Swarms, It seems that kube is where the real work is done, as swarms seem to be better suited for other things.
Im trying to think of how I would approach this from a kube (or swarm) perspective.
Given the information, this concept reminds me less of load balancing because of the data aggregation and more of like Kubeflow, where you create a CLOUD specifically for ML, but instead of ML it would be ANY distributed processing.
The interesting problems in this question have nothing to do with Docker; let's put that aside for now.
You expect you'll have a bunch of computers that are all processing a chunk of this big data set. You've already structured the problem so that you can do work on small pieces of the input and produce small pieces of the output. The main problems you need to design around are:
Where do you keep the input so that the tasks can read it, if they need to?
How do you pass on units of work to the workers? What happens if a worker fails?
How do you communicate the outputs? Where do you store them? Do they need to be in the same order as the input?
A useful tool here is a work queue; RabbitMQ is a popular open-source implementation. You'd run this as a separate server, and workers can connect to it and read and write messages from queues. So long as everyone can contact the RabbitMQ server, none of the individual workers or other processes in the system actually need to know about each other.
For some scales of problem, a straightforward approach is to say the original input and final output is single files on a single system. You break this up into pieces that are small enough that they can fit in a message payload, and the responses also fit in message payloads. Run one process to read the input and populate the work queues; run some number of workers, and run a process to read back the outputs.
Input handler +------+ --> worker --> +------+
dataset.xlsx ---> +------+ --> worker --> +------+ --> Output handler
+------+ --> worker --> +------+ out.csv
+ ... + ... + ... +
If you're using Python as an implementation language, also consider Celery as a framework to manage this.
To run this, you need to run three separate processes.
export RABBITMQ_HOST=localhost RABBITMQ_PORT=5672
./input_handler.py dataset.xlsx
./output_handler.py out.csv
./worker.py
You can run multiple workers; RabbitMQ will take care of ensuring that tasks get distributed across the workers, and that a task gets retried if a worker fails. There's no particular requirement that all of these run on the same host, so long as they can all reach the RabbitMQ broker.
If you can't keep the inputs or outputs in the message, you'll need some sort of shared storage that all of the nodes can reach. If you're in a cloud environment an object-store service like Amazon's S3 is a popular choice. In the input and output messages you would then put the path of the relevant file in S3 instead of the data.
How would Docker or Kubernetes fit into this picture? It's important to note that neither technology provides anything like a work queue, and shared filesystems can be spotty. Still, where I referred to the three different processes above, you could package those into three Docker images, and you could deploy those in Kubernetes. Where I said you don't have to run just one worker, a Kubernetes Deployment will let you run 5 or 10 or 50 identical copies of the worker, and RabbitMQ will take responsibility for making sure they all have work to do.
I have an API server which writes some data to the DB and should eventually generate other containers - according to the different parameters it gets.
How should I do that? both in development and in production.
You need to work on a dockerfile generator. Like here.
Thats a lot of work for your cause, but worth doing it. Friendly advise, have a control on the number of containers you create by reusing them for similiar functions.
I've spent a fair amount of time researching and I've not found a solution to my problem that I'm comfortable with. My app is working in a dockerized environment:
one container for the database;
one or more containers for the APP itself. Each container holds a specific version of the APP.
It's a multi-tenant application, so each client (or tenant) may be related to only one version at a time (migration should be handle per client, but that's not relevant).
The problem is I would like to have another container to handle scheduling jobs, like sending e-mails, processing some data, etc. The scheduler would then execute commands in app's containers. Projects like Ofelia offer a great promise but I would have to know the container to execute the command ahead of time. That's not possible because I need to go to the database container to discover which version the client is in, to figure it out what container the command should be executed in.
Is there a tool to help me here? Should I change the structure somehow? Any tips would be welcome.
Thanks.
So your question is you want to get the APP's version info in the database container before scheduling jobs,right?
I think this is relate to the business, not the dockerized environment,you may have ways to slove the problem:
Check the network ,make sure the network of the container can connect to each other
I think the database should support RPC function,you can use it to get the version data
You can use some RPC supported tools,like SSH