I have a Docker image that will execute different logic depending on the environment variables that it is run with. You can imagine that running it with VAR=A will produce slightly different logic compared to running it with VAR=B.
We have an application that is meant to allow users to kick off tasks that are defined within this Docker image. Depending on the user attributes, different environment variables will need to be passed into the Docker container when it is run. The task is meant to run each time an event is generated through user action and then the container should shut down/be removed.
I'm trying to determine if GCP has any container services that best match what I'm looking for. My understanding of some of the services is:
Cloud Functions - can work well for consuming events and taking specific actions each time an event is triggered, but it is not suited for containerized workloads.
Cloud Run - a serverless way of deploying containers. As I understand it, a deployment on cloud run spins up a "service", and the environment variables must be passed in as part of the service definition. Because we may have a large number of combinations of environment variables (many of which may need to be running at once), it seems that this would end up creating a large number of services, which feels potentially clunky. This approach seems better for deploying a single service with static environment variables that needs to be auto-scaled by GCP.
GKE - another container orchestration platform. This is what I'm considering at the moment. The idea is that we would define a single job definition that can vary according to environment variables that are passed into it. The problem is that these jobs would need to be kicked off dynamically via code. This is a fairly straightforward process with kubectl, but the Kubernetes REST API seems fairly underdeveloped (or at least not that well documented). And the lack of information online on how to start jobs on-demand through the Kubernetes API makes me question whether this is the best approach.
Are there any tools that I'm missing that would be useful in spinning up containers on-demand with dynamic sets of environment variables and removing them when done?
Related
I would like to do some cloud processing on a very small cluster of machines (<5).
This processing should be based on 'jobs', where jobs are parameterized scripts that run in a certain docker environment.
As an example for what a job could be:
Run in docker image "my_machine_learning_docker"
Download some machine learning dataset from an internal server
Train some neural network on the dataset
Produce a result and upload it to a server again.
My use cases are not limited to machine learning however.
A job could also be:
Run in docker image "my_image_processing_docker"
Download a certain amount of images from some folder on a machine.
Run some image optimization algorithm on each of the images.
Upload the processed images to another server.
Now what I am looking for is some framework/tool, that keeps track of the compute servers, that receives my jobs and dispatches them to an available server. Advanced priorization, load management or something is not really required.
It should be possible to query the status of jobs and of the servers via an API (I want to do this from NodeJS).
Potentially, I could imagine this framework/tool to dynamically spin up these compute servers in in AWS, Azure or something. That would not be a hard requirement though.
I would also like to host this solution myself. So I am not looking for a commercial solution for this.
Now I have done some research, and what I am trying to do has similarities with many, many existing projects, but I have not "quite" found what I am looking for.
Similar things I have found were (selection):
CI/CD solutions such as Jenkins/Gitlab CI. Very similar, but it seems to be tailored very much towards the CI/CD case, and I am not sure whether it is such a good idea to abuse a CI/CD solution for what I am trying to do.
Kubernetes: Appears to be able to do this somehow, but is said to be very complex. It also looks like overkill for what I am trying to do.
Nomad: Appears to be the best fit so far, but it has some proprietary vibes that I am not very much a fan of. Also it still feels a bit complex...
In general, there are many many different projects and frameworks, and it is difficult to find out what the simplest solution is for what I am trying to do.
Can anyone suggest anything or point me in a direction?
Thank you
I would use Jenkins for this use case even if it appears to you as a “simple” one. You can start with the simplest pipeline which can also deal with increasing complexity of your job. Jenkins has API, lots of plugins, it can be run as container for a spin up in a cloud environment.
Its possible you're looking for something like AWS Batch flows: https://aws.amazon.com/batch/ or google datalflow https://cloud.google.com/dataflow. Out of the box they do scaling, distribution monitoring etc.
But if you want to roll your own ....
Option A: Queues
For your job distribution you are really just looking for a simple message queue that all of the workers listen on. In most messaging platforms, a Queue supports deliver once semantics. For example
Active MQ: https://activemq.apache.org/how-does-a-queue-compare-to-a-topic
NATS: https://docs.nats.io/using-nats/developer/receiving/queues
Using queues for load distribution is a common pattern.
A queue based solution can use both with manual or atuomated load balancing as the more workers you spin up, the more instances of your workers you have consuming off the queue. The same messaging solution can be used to gather the results if you need to, using message reply semantics or a dedicated reply channel. You could use the resut channel to post progress reports back and then your main application would know the status of each worker. Alternatively they could drop status in database. It probably depends on your preference for collecting results and how large the result sets would be. If large enough, you might even just drop results in an S3 bucket or some kind of filesystem.
You could use something quote simple to mange the workers - Jenkins was already suggested is in defintely a solution I have seen used for running multiple instances accross many servers as you just need to install the jenkins agent on each of the workers. This can work quote easily if you own or manage the physical servers its running on. You could use TeamCity as well.
If you want something cloud hosted, it may depend on the technology you use. Kubernetties is probably overkill here, but certiabnly could be used to spin up N nodes and increase/decrease those number of workers. To auto scale you could publish out a single metric - the queue depth - and trigger an increase in the number of workers based on how deep the queue is and a metric you work out based on cost of spinning up new nodes vs. the rate at which they are processed.
You could also look at some of the lightweight managed container solutions like fly.io or Heroku which are both much easier to setup than K8s and would let you scale up easily.
Option 2: Web workers
Can you design your solution so that it can be run as a cloud function/web worker?
If so you could set them up so that scaling is fully automated. You would hit the cloud function end point to request each job. The hosting engine would take care of the distribution and scaling of the workers. The results would be passed back in the body of the HTTP response ... a json blob.
Your workload may be too large for these solutions, but if its actually fairly light weight quick it could be a simple option.
I don't think these solutions would let you query the status of tasks easily.
If this option seems appealing there are quite a few choices:
https://workers.cloudflare.com/
https://cloud.google.com/functions
https://aws.amazon.com/lambda/
Option 3: Google Cloud Tasks
This is a bit of a hybrid option. Essentially GCP has a queue distribution workflow where the end point is a cloud function or some other supported worker, including cloud run which uses docker images. I've not actually used it myself but maybe it fits the bill.
https://cloud.google.com/tasks
When I look at a problem like this, I think through the entirity of the data paths. The map between source image and target image and any metadata or status information that needs to be collected. Additionally, failure conditions need to be handled, especially if a production service is going to be built.
I prefer running Python, Pyspark with Pandas UDFs to perform the orchestration and image processing.
S3FS lets me access s3. If using Azure or Google, Databricks' DBFS lets me seamlessly read and write to cloud storage without 2 extra copy file steps.
Pyspark's binaryFile data source lets me list all of the input files to be processed. Spark lets me run this in batch or an incremental/streaming configuration. This design optimizes for end to end data flow and data reliability.
For a cluster manager I use Databricks, which lets me easily provision an auto-scaling cluster. The Databricks cluster manager lets users deploy docker containers or use cluster libraries or notebook scoped libraries.
The example below assumes the image is > 32MB and processes it out of band. If the image is in the KB range then dropping the content is not necessary and in-line processing can be faster (and simpler).
Pseudo code:
df = (spark.read
.format("binaryFile")
.option("pathGlobFilter", "*.png")
.load("/path/to/data")
.drop("content")
)
from typing import Iterator
def do_image_xform(path:str):
# Do image transformation, read from dbfs path, write to dbfs path
...
# return xform status
return "success"
#pandas_udf("string")
def do_image_xform_udf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
for path in iterator:
yield do_image_xform(path)
df_status = df.withColumn('status',do_image_xform_udf(col(path)))
df_status.saveAsTable("status_table") # triggers execution, saves status.
Where I work we've been adding microservices for different purposes and the local development environment is becoming difficult to setup. Services have too many environment variables to configure and usually there's not enough memory avaiable to run them.
We plan to fix these issues. I understand it's a matter of architecture and DevOps mostly. One way we've thought of is to create a proper service registry that allows easier setup and opens the door to, for example, have some services running locally and others in the cloud. All wired together with the service registry.
Another option could be to stub some of the dependencies with something like https://wiremock.org/ but it seems too limited and difficult (?).
I wanted to ask, what other strategies are there to manage growing development environments?
I'm working on a scientific computing project. For this work, I need many Python modules as well as C++ packages. The C++ packages require specific versions of other software, so setting up the environment should be done carefully, and after the setup the dependencies should not be updated. So, I thought it should be good to make a Docker container and work inside it, in order to make the work reproducible in the future. However, I didn't understand why people in the internet recommend to use different Docker containers for different processes. For me it seems more natural that I setup the environment, which is a pain, and then use it for the entire project. Can you please explain what I have to be worried about in this case?
It's important that you differentiate between a Docker image and a Docker container.
People recommend using one process per container because this results in a more flexible, scalable environment: if you need to scale out your frontend web servers, or upgrade your database, you can do that without bringing down your entire stack. Running a single process per container also allows Docker to manage those processes in a sane fashion, e.g. by restarting things that have unexpectedly failed. If you're running multiple processes in a container, you end up having to hide this information from Docker by running some sort of process manager, which complicates your containers and can make it difficult to orchestrate a complex application.
On the other hand, it's quite common for someone to use a single image as the basis for a variety of containers all running different services. This is particularly true if you're build a project where a single source tree results in several commands; in that case, it makes sense to have bundle that all into a single image, and then choose which command to run when you start the container.
The only time this becomes a problem is when someone decides to do something like bundle, say, MySQL and Apache into a single image: which is a problem because there are already well maintained official images for those projects, and by building your own you've taking on the burden of properly configuring those services and maintaining the images going forward.
To summarize:
One process/service per container tends to make your life easier
Bundling things together in a single image can be okay
I have a docker-compose consisting of four containers, all of which perform a single function:
An nginx proxy that forwards UI and API requests to the corresponding containers (node container, flask container), as depicted in the image below.
There is also a separate container which executes long running python scripts and works independent of the other containers. I'd now like to create the ability to execute scripts in the "long running scripts" (LRS) container via the API:
What is the best way to do this?
I've seen a few other questions that are somewhat similar to this, but raise more questions than they answer. Amongst the suggestions I've seen are:
Pass docker.sock into the API container; from the API container, exec into LRS and execute the intended script
Doesn't this create serious security vulnerabilities?
Doesn't this require that docker be installed on the API container in order to exec, violating the separation of concerns principle of docker?
HTTP listener on the LRS container, listening for commands from the API in order to execute the script on LRS
Again, doesn't this violate separation of concerns, since I'll now essentially need a light weight API in the LRS container to listen to actions from the principal API?
None of this solutions seem ideal. Am I missing something? How do I achieve the intended functionality?
Generally the solution to run long-running scripts has been a pub-sub model. Your API would drop a message onto an execution Message-Queue. The worker instance would subscribe to that queue, and when messages appear, would execute your long-running script/query/etc. When the execution is complete, either a message will go back on a different queue, or results will be placed in a predetermined location (url).
This has a couple of advantages:
The two solutions are effectively isolated from each other
You can scale out the LRS (worker) solution if you need more capacity by adding additional workers
if the LRS instance goes down the API will not depend on it being up. Work will be queued for when an instance becomes available.
I have a Docker container that reads a variable which I provide to it during the execution. Then I though that I would like to run many of those and pass a different value as variable for each one of those. Then, I just created a simple text file which contains all the values I want to pass into (they are about 20k different ones) and I am using gnu-parallel to spawn multiple Dockers in parallel.
My question is how I could do something like that in a Kubernetes environment?
Sounds like you want you want to do can be achieved using kubernetes jobs.
I would advise against using gnu parallel on kubernetes unless you can fit all the jobs in one node. If this is the case I think it's ok, just set the cpu request in the job template.