Is there a tool that notifies immediately about new docker image versions - docker

I have a private docker registry that I'm using for my own images. I would like that the container that run this images (via docker-compose) get updated immediately, when I push a new version.
I know that there are Watchtower (https://containrrr.dev/watchtower/) and Diun (https://crazymax.dev/diun/), but these containers are only polling in a defined interval (I'm using watchtower now, but it is not as fast as I like even with a poll every minute).
I found that the docker registry is sending notifications when a container is updated (https://docs.docker.com/registry/notifications/) and was looking for a service that uses this. But I didn't found any tool, expect for a Jenkins Plugin (https://github.com/jenkinsci/dockerhub-notification-plugin). Am I looking at the wrong places or is there just no tool that works with the notifications from the registry?

Since webhooks aren't part of OCI's distribution-spec, any solution that implements this will be registry specific. That means you won't be able to easily change registries in the future without breaking this implementation. Part of the value of registries and the standards they use is allowing portability, both of the images, and of tooling that works with those registries. What works on distribution/distribution specific APIs may not work on Nexus, Artifactory, or any of the cloud/SaaS solutions like Docker Hub, Quay, ECR, ACR, GCR, etc. (I'm leaving Harbor out of this list because they are based on distribution/distribution, so that's the one registry where it will probably work.)
If you want the solution to be portable, then it either needs to be designed into a higher level workflow (e.g. the same CI pipeline that does the push of the image triggers the deploy), or you are left with polling which a lot of GitOps solutions implement. Also realize that the S3 backend many registries use is eventually consistent, so with a distributed/HA registry implementation, trying to pull the image before the data store has had a chance to replicate may trigger race conditions.

I think you have to look at the problem from a different angle. If you shift your focus from containers, you will notice that GitOps might be the perfect fit. You can achieve the same thing with your CI/CD pipeline that trigger a redeployment.
If you want to stick with containers only, I can recommend taking a look at Harbor that can call a Webhooks after a push. See docs (https://container-registry.com/docs/user-manual/projects/configuration/webhooks/)

Related

Job-based cloud processing solution

I would like to do some cloud processing on a very small cluster of machines (<5).
This processing should be based on 'jobs', where jobs are parameterized scripts that run in a certain docker environment.
As an example for what a job could be:
Run in docker image "my_machine_learning_docker"
Download some machine learning dataset from an internal server
Train some neural network on the dataset
Produce a result and upload it to a server again.
My use cases are not limited to machine learning however.
A job could also be:
Run in docker image "my_image_processing_docker"
Download a certain amount of images from some folder on a machine.
Run some image optimization algorithm on each of the images.
Upload the processed images to another server.
Now what I am looking for is some framework/tool, that keeps track of the compute servers, that receives my jobs and dispatches them to an available server. Advanced priorization, load management or something is not really required.
It should be possible to query the status of jobs and of the servers via an API (I want to do this from NodeJS).
Potentially, I could imagine this framework/tool to dynamically spin up these compute servers in in AWS, Azure or something. That would not be a hard requirement though.
I would also like to host this solution myself. So I am not looking for a commercial solution for this.
Now I have done some research, and what I am trying to do has similarities with many, many existing projects, but I have not "quite" found what I am looking for.
Similar things I have found were (selection):
CI/CD solutions such as Jenkins/Gitlab CI. Very similar, but it seems to be tailored very much towards the CI/CD case, and I am not sure whether it is such a good idea to abuse a CI/CD solution for what I am trying to do.
Kubernetes: Appears to be able to do this somehow, but is said to be very complex. It also looks like overkill for what I am trying to do.
Nomad: Appears to be the best fit so far, but it has some proprietary vibes that I am not very much a fan of. Also it still feels a bit complex...
In general, there are many many different projects and frameworks, and it is difficult to find out what the simplest solution is for what I am trying to do.
Can anyone suggest anything or point me in a direction?
Thank you
I would use Jenkins for this use case even if it appears to you as a “simple” one. You can start with the simplest pipeline which can also deal with increasing complexity of your job. Jenkins has API, lots of plugins, it can be run as container for a spin up in a cloud environment.
Its possible you're looking for something like AWS Batch flows: https://aws.amazon.com/batch/ or google datalflow https://cloud.google.com/dataflow. Out of the box they do scaling, distribution monitoring etc.
But if you want to roll your own ....
Option A: Queues
For your job distribution you are really just looking for a simple message queue that all of the workers listen on. In most messaging platforms, a Queue supports deliver once semantics. For example
Active MQ: https://activemq.apache.org/how-does-a-queue-compare-to-a-topic
NATS: https://docs.nats.io/using-nats/developer/receiving/queues
Using queues for load distribution is a common pattern.
A queue based solution can use both with manual or atuomated load balancing as the more workers you spin up, the more instances of your workers you have consuming off the queue. The same messaging solution can be used to gather the results if you need to, using message reply semantics or a dedicated reply channel. You could use the resut channel to post progress reports back and then your main application would know the status of each worker. Alternatively they could drop status in database. It probably depends on your preference for collecting results and how large the result sets would be. If large enough, you might even just drop results in an S3 bucket or some kind of filesystem.
You could use something quote simple to mange the workers - Jenkins was already suggested is in defintely a solution I have seen used for running multiple instances accross many servers as you just need to install the jenkins agent on each of the workers. This can work quote easily if you own or manage the physical servers its running on. You could use TeamCity as well.
If you want something cloud hosted, it may depend on the technology you use. Kubernetties is probably overkill here, but certiabnly could be used to spin up N nodes and increase/decrease those number of workers. To auto scale you could publish out a single metric - the queue depth - and trigger an increase in the number of workers based on how deep the queue is and a metric you work out based on cost of spinning up new nodes vs. the rate at which they are processed.
You could also look at some of the lightweight managed container solutions like fly.io or Heroku which are both much easier to setup than K8s and would let you scale up easily.
Option 2: Web workers
Can you design your solution so that it can be run as a cloud function/web worker?
If so you could set them up so that scaling is fully automated. You would hit the cloud function end point to request each job. The hosting engine would take care of the distribution and scaling of the workers. The results would be passed back in the body of the HTTP response ... a json blob.
Your workload may be too large for these solutions, but if its actually fairly light weight quick it could be a simple option.
I don't think these solutions would let you query the status of tasks easily.
If this option seems appealing there are quite a few choices:
https://workers.cloudflare.com/
https://cloud.google.com/functions
https://aws.amazon.com/lambda/
Option 3: Google Cloud Tasks
This is a bit of a hybrid option. Essentially GCP has a queue distribution workflow where the end point is a cloud function or some other supported worker, including cloud run which uses docker images. I've not actually used it myself but maybe it fits the bill.
https://cloud.google.com/tasks
When I look at a problem like this, I think through the entirity of the data paths. The map between source image and target image and any metadata or status information that needs to be collected. Additionally, failure conditions need to be handled, especially if a production service is going to be built.
I prefer running Python, Pyspark with Pandas UDFs to perform the orchestration and image processing.
S3FS lets me access s3. If using Azure or Google, Databricks' DBFS lets me seamlessly read and write to cloud storage without 2 extra copy file steps.
Pyspark's binaryFile data source lets me list all of the input files to be processed. Spark lets me run this in batch or an incremental/streaming configuration. This design optimizes for end to end data flow and data reliability.
For a cluster manager I use Databricks, which lets me easily provision an auto-scaling cluster. The Databricks cluster manager lets users deploy docker containers or use cluster libraries or notebook scoped libraries.
The example below assumes the image is > 32MB and processes it out of band. If the image is in the KB range then dropping the content is not necessary and in-line processing can be faster (and simpler).
Pseudo code:
df = (spark.read
.format("binaryFile")
.option("pathGlobFilter", "*.png")
.load("/path/to/data")
.drop("content")
)
from typing import Iterator
def do_image_xform(path:str):
# Do image transformation, read from dbfs path, write to dbfs path
...
# return xform status
return "success"
#pandas_udf("string")
def do_image_xform_udf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
for path in iterator:
yield do_image_xform(path)
df_status = df.withColumn('status',do_image_xform_udf(col(path)))
df_status.saveAsTable("status_table") # triggers execution, saves status.

Are the problems with using a big Docker container for multiple tasks?

I'm working on a scientific computing project. For this work, I need many Python modules as well as C++ packages. The C++ packages require specific versions of other software, so setting up the environment should be done carefully, and after the setup the dependencies should not be updated. So, I thought it should be good to make a Docker container and work inside it, in order to make the work reproducible in the future. However, I didn't understand why people in the internet recommend to use different Docker containers for different processes. For me it seems more natural that I setup the environment, which is a pain, and then use it for the entire project. Can you please explain what I have to be worried about in this case?
It's important that you differentiate between a Docker image and a Docker container.
People recommend using one process per container because this results in a more flexible, scalable environment: if you need to scale out your frontend web servers, or upgrade your database, you can do that without bringing down your entire stack. Running a single process per container also allows Docker to manage those processes in a sane fashion, e.g. by restarting things that have unexpectedly failed. If you're running multiple processes in a container, you end up having to hide this information from Docker by running some sort of process manager, which complicates your containers and can make it difficult to orchestrate a complex application.
On the other hand, it's quite common for someone to use a single image as the basis for a variety of containers all running different services. This is particularly true if you're build a project where a single source tree results in several commands; in that case, it makes sense to have bundle that all into a single image, and then choose which command to run when you start the container.
The only time this becomes a problem is when someone decides to do something like bundle, say, MySQL and Apache into a single image: which is a problem because there are already well maintained official images for those projects, and by building your own you've taking on the burden of properly configuring those services and maintaining the images going forward.
To summarize:
One process/service per container tends to make your life easier
Bundling things together in a single image can be okay

Github actions docker caching

I think this will be useful for many others.
I am using https://github.com/phips28/gh-action-bump-version to automatically bump NPM versions in Github Actions.
Is there a way to cache the docker image of this action so it doesn't have to build each time? It takes a long time to run and it runs upfront before the rest of the steps. I am sure this is common for similar types of Github Actions that pull docker images.
The docker image looks pretty slim so I am not sure there will be any benefit of trying to optimise the image itself. More to do with how to configure Github Actions.
Any suggestions?
TLDR
Somewhat! You can change the GitHub workflow file to pull an image from a repository instead of building each run. While this doesn't cache the image, it is significantly faster. This can be achieved by editing your flow to look like the following:
- name: 'Automated Version Bump'
id: version-bump
uses: 'docker://phips28/gh-action-bump-version:master'
with:
tag-prefix: 'v'
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Please note the docker:// prefix to the uses: statement, coupled with the change from #master to :master in order to convert the name to a valid image name.
I have opened up a PR on that repository with a suggested fix :^)
Original Response
A very good question, and one that little information can be found on in the official documents (Though GitHub acknowledge the delay in their docs).
GitHub Staff response
I had a search for you and managed to find an article from September 2019 on the GitHub community forum about this exact topic. It inevitably linked to this article from July 2019.
There is a wonderful explanation about how building each time will still utilise the docker build cache, reducing times, yet allow for flexibility in terms of using the latest version of the base image etc, etc.
There is a proposed solution if you aren't bothered about the flexibility of updates, and just want the shortest build time possible, though I am not certain if that syntax is still valid currently:
But let’s say that I don’t want my Action to even evaluate the Dockerfile every time the Action runs, because I want the absolute fastest runtime possible. I want a predefined Docker container to be spun up and get right to work. You can have that too! If you create a Docker image and upload it to Docker Hub or another public registry, you can instruct your Action to use that Docker image specifically using the docker:// form in the uses key. See the GitHub Actions documentation 72 for specifics. This requires a bit more up-front work and maintenance, but if you don’t need the flexibility that the above Dockerfile evaluation system provides, then it may be worth the tradeoff.
Unfortunately the link to the Github actions documentation is broken, but this does suggest that the author of the action could allow this behaviour if they modified their action
Alternate Ideas
If you require the ability to control the cache on the executor host (to truly cache an image), then you may think of considering hosting your own GitHub runner, as you would have total control over the images there. Though I imagine this is potentially a detterent given that GitHub actions is largely a free service (with limits, and this is perhaps one of them!)
You might want to consider adding a task that utilises the file cache action, and attempting to export the gh-action-bump-version to a file contents through docker commit or docker save, and then reinflate it on your next run. However, this introduces complexity and may not save you time in the long run.. Edit: This is a horrible idea now that we know actions can support pulling images from registries.
I hope this helps you, and anyone else searching for more information 👍
There is an excellent article on the docker blog explaining how to cache docker images with actions/cache and buildx (which allows you to specify a custom cache path).
It can be found here: https://www.docker.com/blog/docker-github-actions/.

What services should I use for autobuild of computationally intensive dockers?

I have a repo with a piece of software, and a docker for users who have installation problems. I need to re-build the docker every time I publish a new version, and also want to use automated testing after it. DockerHub has such functionality, but builds are too long and are killed by timeout. Also I can't use tests there, as some tests use ~8 Gb RAM.
Are there any other services for these tasks? I'm fine with paying for it, but don't want to spend time for long configuration and maintenance (e.g. for having my own build server).
TravisCI.
It's fairly easy to start, hosted CI service which is free as long as you're keeping the repo public.
It's well known, common and you will find thousands of helpful questions and answers under the [travisci] tag
I'm adding a link to their documentation with example on how to build Dockerfile.
https://docs.travis-ci.com/user/docker/#building-a-docker-image-from-a-dockerfile
Also, I've tried to find memory and time limitations but couldn't find any in quick search.

What to use to orchestrate a few long running web services on few machines?

Investigating the possibilities, i am quite confused what is the best tool for us.
We want to deploy a few web services, for start a gitlab and a wiki.
The plan is to use docker images for these services and to store the data externally.
This services need to be accessible from outside.
I looked into Marathon and kubernetes and both seemed like overkill.
A problem we face as academics is that most people only stay for about three years and it's not our main job to administrate stuff. So we would like an easy to use, easy to maintain solution.
We have 3-4 nodes we want to use for this, we'd like it to be fault tolerant (restarting the service on another node if one dies for example).
So to sum up:
3-4 nodes
gitlab with CI and runners
a wiki
possibly one or two services more
auto deployment, load balancing
as failsafe as possible
What would you recommend?
I would recommend a managed container service like https://aws.amazon.com/ecs/
Running your own container manager swarm/kubernetes comes with a whole host of issues that it sounds like you should avoid.

Resources