Job-based cloud processing solution - docker

I would like to do some cloud processing on a very small cluster of machines (<5).
This processing should be based on 'jobs', where jobs are parameterized scripts that run in a certain docker environment.
As an example for what a job could be:
Run in docker image "my_machine_learning_docker"
Download some machine learning dataset from an internal server
Train some neural network on the dataset
Produce a result and upload it to a server again.
My use cases are not limited to machine learning however.
A job could also be:
Run in docker image "my_image_processing_docker"
Download a certain amount of images from some folder on a machine.
Run some image optimization algorithm on each of the images.
Upload the processed images to another server.
Now what I am looking for is some framework/tool, that keeps track of the compute servers, that receives my jobs and dispatches them to an available server. Advanced priorization, load management or something is not really required.
It should be possible to query the status of jobs and of the servers via an API (I want to do this from NodeJS).
Potentially, I could imagine this framework/tool to dynamically spin up these compute servers in in AWS, Azure or something. That would not be a hard requirement though.
I would also like to host this solution myself. So I am not looking for a commercial solution for this.
Now I have done some research, and what I am trying to do has similarities with many, many existing projects, but I have not "quite" found what I am looking for.
Similar things I have found were (selection):
CI/CD solutions such as Jenkins/Gitlab CI. Very similar, but it seems to be tailored very much towards the CI/CD case, and I am not sure whether it is such a good idea to abuse a CI/CD solution for what I am trying to do.
Kubernetes: Appears to be able to do this somehow, but is said to be very complex. It also looks like overkill for what I am trying to do.
Nomad: Appears to be the best fit so far, but it has some proprietary vibes that I am not very much a fan of. Also it still feels a bit complex...
In general, there are many many different projects and frameworks, and it is difficult to find out what the simplest solution is for what I am trying to do.
Can anyone suggest anything or point me in a direction?
Thank you

I would use Jenkins for this use case even if it appears to you as a “simple” one. You can start with the simplest pipeline which can also deal with increasing complexity of your job. Jenkins has API, lots of plugins, it can be run as container for a spin up in a cloud environment.

Its possible you're looking for something like AWS Batch flows: https://aws.amazon.com/batch/ or google datalflow https://cloud.google.com/dataflow. Out of the box they do scaling, distribution monitoring etc.
But if you want to roll your own ....
Option A: Queues
For your job distribution you are really just looking for a simple message queue that all of the workers listen on. In most messaging platforms, a Queue supports deliver once semantics. For example
Active MQ: https://activemq.apache.org/how-does-a-queue-compare-to-a-topic
NATS: https://docs.nats.io/using-nats/developer/receiving/queues
Using queues for load distribution is a common pattern.
A queue based solution can use both with manual or atuomated load balancing as the more workers you spin up, the more instances of your workers you have consuming off the queue. The same messaging solution can be used to gather the results if you need to, using message reply semantics or a dedicated reply channel. You could use the resut channel to post progress reports back and then your main application would know the status of each worker. Alternatively they could drop status in database. It probably depends on your preference for collecting results and how large the result sets would be. If large enough, you might even just drop results in an S3 bucket or some kind of filesystem.
You could use something quote simple to mange the workers - Jenkins was already suggested is in defintely a solution I have seen used for running multiple instances accross many servers as you just need to install the jenkins agent on each of the workers. This can work quote easily if you own or manage the physical servers its running on. You could use TeamCity as well.
If you want something cloud hosted, it may depend on the technology you use. Kubernetties is probably overkill here, but certiabnly could be used to spin up N nodes and increase/decrease those number of workers. To auto scale you could publish out a single metric - the queue depth - and trigger an increase in the number of workers based on how deep the queue is and a metric you work out based on cost of spinning up new nodes vs. the rate at which they are processed.
You could also look at some of the lightweight managed container solutions like fly.io or Heroku which are both much easier to setup than K8s and would let you scale up easily.
Option 2: Web workers
Can you design your solution so that it can be run as a cloud function/web worker?
If so you could set them up so that scaling is fully automated. You would hit the cloud function end point to request each job. The hosting engine would take care of the distribution and scaling of the workers. The results would be passed back in the body of the HTTP response ... a json blob.
Your workload may be too large for these solutions, but if its actually fairly light weight quick it could be a simple option.
I don't think these solutions would let you query the status of tasks easily.
If this option seems appealing there are quite a few choices:
https://workers.cloudflare.com/
https://cloud.google.com/functions
https://aws.amazon.com/lambda/
Option 3: Google Cloud Tasks
This is a bit of a hybrid option. Essentially GCP has a queue distribution workflow where the end point is a cloud function or some other supported worker, including cloud run which uses docker images. I've not actually used it myself but maybe it fits the bill.
https://cloud.google.com/tasks

When I look at a problem like this, I think through the entirity of the data paths. The map between source image and target image and any metadata or status information that needs to be collected. Additionally, failure conditions need to be handled, especially if a production service is going to be built.
I prefer running Python, Pyspark with Pandas UDFs to perform the orchestration and image processing.
S3FS lets me access s3. If using Azure or Google, Databricks' DBFS lets me seamlessly read and write to cloud storage without 2 extra copy file steps.
Pyspark's binaryFile data source lets me list all of the input files to be processed. Spark lets me run this in batch or an incremental/streaming configuration. This design optimizes for end to end data flow and data reliability.
For a cluster manager I use Databricks, which lets me easily provision an auto-scaling cluster. The Databricks cluster manager lets users deploy docker containers or use cluster libraries or notebook scoped libraries.
The example below assumes the image is > 32MB and processes it out of band. If the image is in the KB range then dropping the content is not necessary and in-line processing can be faster (and simpler).
Pseudo code:
df = (spark.read
.format("binaryFile")
.option("pathGlobFilter", "*.png")
.load("/path/to/data")
.drop("content")
)
from typing import Iterator
def do_image_xform(path:str):
# Do image transformation, read from dbfs path, write to dbfs path
...
# return xform status
return "success"
#pandas_udf("string")
def do_image_xform_udf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
for path in iterator:
yield do_image_xform(path)
df_status = df.withColumn('status',do_image_xform_udf(col(path)))
df_status.saveAsTable("status_table") # triggers execution, saves status.

Related

Should I use a Container/Service Fabric Guest Executable for a scheduled daily workload?

This is a more general question about which types of payloads to host in a Container. In our case we will use Service Fabric guest executables. For this post I will only use the word Container to refer to both. The reason I do this is they have similar properties and think more people may understand a container than a SF Guest Exe.
WebAPIs/Services that needs to scale are a good fit for containers, but this question is related to what we call a "Batch" job. This nomenclature comes out of the old .bat files, but in our case we are using a .NET Framework or Core .exe (console apps).
Currently Windows Task Scheduler kicks off the batch running under a service account on a VM. We want the processing to happen on a certain time of day or day of the week and not before or after. There is not any real scaling here. There is one instance which may or may not be multithreaded and on average they generally run between 2-15 minutes and then stop. Some run longer some run shorter. I understand there are limitations to this approach but this is the type of payload I'm discussing here.
As we modernize the Technology stack we are looking to use the Orchestrator as much as possible. As a technologist I've always tried to understand the different tools in our tool belts and not use a tool just because that's the one I used last, instead use the correct tool for the task.
We started out by not writing any more .net console apps. Instead we put the business logic of these "batches" into WebApi's. Then having the task scheduler call the API when it needed to perform its action. If I put this into Service Fabric and host it my concern is that the system resources are consumed for 23 hours and 45 minutes a day when they are not being used. That seems to be opposite of what you would expect when using a container.
Now if I could spin up a Service Fabric Guest Exe/Container on demand and then after it finishes destroy the instance of the app that could fit the need. Then I could have the benefits of the orchestrator without the determent of having it consume resources all the time. I would hope to retire the Batch Server (VM) as the hardware is usage is not optimized and instead add resources to the cluster.
UPDATE
Looking at Vaclav's Scalability Doco I think there might be a use case in here? https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-concepts-scalability He uses a "Workload Manager Service" combined with CreateServiceAsync, to spin up an instance of the service on demand. I guess I would deploy the app to the image store but not create an instance of the app until needed. Then I need to figure out how to end it, is it as simple as changing the infinite loop in Program.cs? The thing is it doesn't look like there is a Program.cs in a Guest Executable.
This looks like a way to run a package until completion, which was releases as part of 7.1. But how do we start a second execution of the service? I want to execute based on a request coming in.
https://learn.microsoft.com/en-us/azure/service-fabric/run-to-completion
Thoughts?

Trying to distribute data processing across a cluster and then aggregate it in master

Right now I have a Python Application which runs 50 threads to process data. It takes an xlsx file and will process a list of values, and will output a simple csv.
I said to myself, since this is a simple Python App with 50 threads, How can I create a cluster to distribute data-processing even more? FOR EXAMPLE: Have each Worker node process a subset given to it by the master. Well that sounds easy, just take the master app slice up the dataset generated and then push it to the workers with load balancing.
How do I get the results though? I would want to take all results (out.csv in this case) and return them to the master and merge them to create 1 master_out.csv
At first I was thinking a Docker swarm, but no one i know uses them, everything beyond a simple docker container is offloaded to K8.
Right now, i have a simple file structure:
app/
__init__.py (everything is in this file)
dataset.xlxs
out.csv
I was thinking to create a docker image so that way I could move this app into the image, update/upgrade, install python3 if it isnt already, and then just run this application.
I started getting deeper into processing, and realized that there is likely some built in ways to handle this. create a flask app to handle ingestion, and then a flask app on master to accept files at completion, etc.... But then master needs to know all the workers etc.
I was thinking to create a cluster.
Master node has access to a volume which contains the file i need to process.
Load balancing pushes parts of each file ( ROWS / NUM_WORKERS) to each node.
After WORKERS FINISH, Master Aggregates the resulting csv files to make a master file.
Master_OUT.csv will exist in the folder for consumption.
So the cluster would turn on and when ready will run everything, then tare down at the end. Since they want the cluster to likely be distributed, I am not sure how that would work though as processing has IP Address limitations. It seems like this will not work on a local cluster because to machines being used to reference will hit a cloudflare (or similar) wall after enough requests, so im trying to think of a UNIQUE IP Solution.
I have an idea for architecture, but im not sure if i should create a dockerfile for this, and then figure out the way kube can handle all of this for me. Though i think in the kube config files we can put remote aws instance login creds so it will spin up all the remote servers.
While I have been doing some stuff with Swarms, It seems that kube is where the real work is done, as swarms seem to be better suited for other things.
Im trying to think of how I would approach this from a kube (or swarm) perspective.
Given the information, this concept reminds me less of load balancing because of the data aggregation and more of like Kubeflow, where you create a CLOUD specifically for ML, but instead of ML it would be ANY distributed processing.
The interesting problems in this question have nothing to do with Docker; let's put that aside for now.
You expect you'll have a bunch of computers that are all processing a chunk of this big data set. You've already structured the problem so that you can do work on small pieces of the input and produce small pieces of the output. The main problems you need to design around are:
Where do you keep the input so that the tasks can read it, if they need to?
How do you pass on units of work to the workers? What happens if a worker fails?
How do you communicate the outputs? Where do you store them? Do they need to be in the same order as the input?
A useful tool here is a work queue; RabbitMQ is a popular open-source implementation. You'd run this as a separate server, and workers can connect to it and read and write messages from queues. So long as everyone can contact the RabbitMQ server, none of the individual workers or other processes in the system actually need to know about each other.
For some scales of problem, a straightforward approach is to say the original input and final output is single files on a single system. You break this up into pieces that are small enough that they can fit in a message payload, and the responses also fit in message payloads. Run one process to read the input and populate the work queues; run some number of workers, and run a process to read back the outputs.
Input handler +------+ --> worker --> +------+
dataset.xlsx ---> +------+ --> worker --> +------+ --> Output handler
+------+ --> worker --> +------+ out.csv
+ ... + ... + ... +
If you're using Python as an implementation language, also consider Celery as a framework to manage this.
To run this, you need to run three separate processes.
export RABBITMQ_HOST=localhost RABBITMQ_PORT=5672
./input_handler.py dataset.xlsx
./output_handler.py out.csv
./worker.py
You can run multiple workers; RabbitMQ will take care of ensuring that tasks get distributed across the workers, and that a task gets retried if a worker fails. There's no particular requirement that all of these run on the same host, so long as they can all reach the RabbitMQ broker.
If you can't keep the inputs or outputs in the message, you'll need some sort of shared storage that all of the nodes can reach. If you're in a cloud environment an object-store service like Amazon's S3 is a popular choice. In the input and output messages you would then put the path of the relevant file in S3 instead of the data.
How would Docker or Kubernetes fit into this picture? It's important to note that neither technology provides anything like a work queue, and shared filesystems can be spotty. Still, where I referred to the three different processes above, you could package those into three Docker images, and you could deploy those in Kubernetes. Where I said you don't have to run just one worker, a Kubernetes Deployment will let you run 5 or 10 or 50 identical copies of the worker, and RabbitMQ will take responsibility for making sure they all have work to do.

One microservice doing many transformations or many microservices doing one transformation?

I have a service that transforms XML documents.
It receives a messages from N queues and depending on from which queue it gets the message from it runs one of the transformations.
Now I'm refactoring it to be a microservice running in docker container.
I think I can do it in two ways but I don't know which of them would be a better practice when using containers.
I'm doing this in .NET Core 2.2. I don't know yet whether we will use docker-swarm or kubernetes on production.
I could just leave most of the code like it is now and run it as one container. AppSettings.config would contain settings for every queue-transformation pair.
The pros I see:
Less code changes needed and simpler docker-compose - one service instead of N. Only one config file.
The cons:
If there is a higher demand for one of transformations or if it has
higher priority I can't just scale it's container. New replicas
would be pulling messages from all queues instead of just one that I want to scale for.
I could refactor the code and build N different container images.
Each of them would listen to one queue and run one transformation. I can refactor it so that
the only difference in code is in Startup when registering services. Based on configuration I could
register one of IXmlTransformers.
Pros:
I can scale for one queue-transformation.
If there is a problem with one of the transformations and container restarts all other transformations will work normally.
I think it's cleaner - one image is doing one transformation - Single responsibility rule.
Cons:
In some environments there will be around 10 different transformations therefore
There will be a lot of configuration files - probably added to each running container using a volume?
Docker compose will be very long.
What would be a better way to do it? Or maybe there are other better options?
Edit:
I think I wasn't clear with my question so I try to explain it better.
In both options I would have one container image. The difference would be at runtime. Only configuration would be different.
Let's say I'm transforming XML to CSV or XML to XML using XSLT.
CSV format and XSLT are part of configuration. I can have many CSV formats (many configurations) and many different XSLTs (again many configurations)
So I have two ITransformer implementations (CSV, XSLT) and they read configuration to run their transformations.
And my question is whether it is better to have one running container instance for one config or put all those configs into one container which monitors N queues or monitors one queue and reads some kind of metadata to decide which transformation to run.
The cons: If there is a higher demand for one of transformations or if
it has higher priority I can't just scale it's container. New replicas
would be pulling messages from all queues instead of just one that I
want to scale for.
Why is it an issue that all queues are being pulled? Is it because you usually have a large backlog in the queues or are you afraid of providing overcapacity? When all queues are up to date with processing only the one queue that you wanted to scale should be picked up anyway.
In case you can't do that I would still go with generic service that can pull from all queues, but try to incorporate a weighting or other rule set based on queue length, timestamps of queue entries or whatever is appropriate in your case to regulate what queue is pulled with what priority. If the right result parameter of this logic is exposed it can also be used to auto-scale your cluster if are are too behind.
The alternative of manual scaling based on queues sounds like it could turn into an admin nightmare.
When you have the context of which job should run in the name of the queue or event, it shouldn't be a problem to use the same code in multiple containers also with the same config. I don't know your setup, but I would avoid binding one container to one queue. Containers shouldn't be deployed with state, when you can avoid it.
Depending on amount of files and money you have, you could also use an approach with amazon lambda and S3. No need for any message queueing or own infrastructure when you go with S3 and different bucket names, which are triggering a lambda function on upload. It's not a better approach it's just a different one and it hardly depends on your needs. It can be cheaper, or much more expensive than your approach, which also depends on your use case.

Continuous delivery of a docker container to Google Cloud

Goal:
I have an application based inside a docker container. I want to be able to use continuous integration for that with push to deploy using Bitbucket Pipelines to Google Cloud. I need access to an SQL database (MariaDB preferably), and some kind of caching system (be it memcache, redis, or something else).
Problem:
I'm entirely unsure of what services I need from said cloud provider to facilitate this as simply as possible, while still being cost effective. I looked into using Google's AppEngine, but I don't know if I'm doing something weird or odd, but for 1 vCPU w/ 1 GB of RAM and 10GB of storage, it was $55/month USD. Which is far more than I really want to pay. But I'm not even sure this is what I need for this. I don't need much power (these are very small apps, used by a very small amount of people). Also, again, not sure if I'm doing something wrong, but I was unable to find a caching solution with Google that wasn't insanely expensive (MemoryStore). Basically, I'm completely overwhelmed by the # of options, and am just looking for a cost effective / simple solution for continuous delivery of a docker application to Google Cloud
Using the tools discussed here , you can set up an end-to-end continuous delivery pipeline covering code, build, test, deploy, and monitor phases of software development across multi-cloud, hybrid, or on-premise environment. Likely this is a way to start.

Run a large amount of tasks on a cluster

I'm looking for a solution to running a large amount of tasks and monitoring their status on a cluster.
In detail: Each task consists of 3-4 processes which are docker contained (each process is a docker run command). All of the processes have to run on the same server.
The amount of tasks we're talking about is bursts of several hundreds of tasks at a time.
I've looked into several solutions all of them based on Mesos:
Chronos - Seems like it would falter under high load and in any case is more directed towards recurring (cron) jobs. While I need one-time (heavy) job.
Custom Mesos FW - Seems to low-level for my needs would require me to write scheduling and retrying mechanisms, I'd save this for last resort.
Aurora - This seems promising as each task is run on the same node and comprised of several processes. I am missing a couple of this here though: Aurora seems to not be able to run several tasks as a part of a single job. Since my tasks are all similar with different input I could use a single job with many (say 400) instances and the first process of each task (whose role is to download the input from S3) could download a different set based on the instance ID. Which brings me to another problem: I can't find a working example of using {{ mesos.instance }} in .aurora files can anyone give me an example?
Thanks for all the fish people
You could also have a look on Kubernetes (which also can be run as a framework in Mesos). Kubernetes has the concept of Pods which are basically a set of co-located containers. So in your case a pod would consist of your 3-4 processes/containers and then these pods can be scaled up/down.
Short comments regarding the other solutions you mentioned:
Chronos: Not really targeting your use case
Custom FW: Actually not so difficult, but good call to save this as last resort.
Aurora: Very powerful but also complex framework
Marathon (which you didn't mention): targeted for long running applications which can be easily scaled up and down.
In addition to the excellent other answer, you could check out Two Sigma's Cook which they have only recently open sourced but have been using in prod at scale for a while.

Resources