Imagine we have FROM python:3.6.4 in our Dockerfile. It appears to be quite specific, so we may expect that every time Docker downloads this image as a part of docker build in a fresh environment, we'll get the same base image.
But that's actually not the case. At the time of writing, this Dockerfile (generated two days ago) was used for the image. The build itself was presumably two days ago too (so apt-get'd packages in the image are from that time), although neither https://hub.docker.com/_/python/ nor https://store.docker.com/images/python shows build details. But, for example, https://hub.docker.com/r/aslushnikov/latex-online/builds/ lists builds.
So two images built from the same Dockerfile may be different. A minor example of why this matters: an image built more than two days ago may generate a warning during pip install (because it has pip 9.0.2 but 9.0.3 is available), while an image built today may not (because it already has 9.0.3). Of course, this concrete issue (which is the discrepancy, not the warning itself) can be fixed using pip install --disable-pip-version-check, but more issues are possible.
As far as I understand, almost the whole point of Docker is repeatability, so it's a bit strange to see a leak in such a common place as specification of the base image. Sometimes this may be preferable (when we want latest fixes) but sometimes not (when we want repeatability).
Theoretically, every image can be tracked in git and so on, but that's a last resort. An ID in a Dockerfile, docker-compose.yml or an argument for docker would obviously be better. The question is: from where to get this ID and where to put it?
Docker has two mechanisms of identifying images.
The first one is the well-known tagging mechanism, which gives no guarantee of deterministic deployments because a tag can be reused to point to other images.
The second one is the much less familiar immutable identifier mechanism. The immutable identifier is essentially a SHA256 hash digest. Tags can be reused, but immutable identifiers never will.
See how to pull an image by its immutable identifier: https://docs.docker.com/engine/reference/commandline/pull/#pull-an-image-by-digest-immutable-identifier
Related
I'm new to Docker so I want to find best practices for my specific problem.
PROBLEM:
I have 6 python web-scraping scripts that run on same libraries (same requiraments.txt).
My scripts would need frequent updating (few times per week).
Also, my scripts have excel files from which they read and write stuff to, and I need to be able to update that excel files from time to time.
SOLUTIONS?
Do I really need 6 images and 6 containers even doe my containers will have same libraries? I find it time consuming to delete container and image every time I update my code.
For accessing files my excel files, I read about VOLUMES and I intend to implement them. Is that good solution?
Do I really need 6 images and 6 containers even doe my containers will have same libraries?
It depends on technical possibility and personal preference. If you find a good, maintainable way to run all scripts in one Docker container, there's no reason you cannot do it. You could easily use a cron-like solution such as this image.
There are advantages to keeping Docker images single-purpose, though. One of them is clear isolation. If one of your scripts fails to run, you'll have one failing container only and five others that still run successfully. Plus you have full transparency over what exactly fails where.
I find it time consuming to delete container and image every time I update my code.
I would propose to use some CI pipeline to do things like this. The pipeline would automatically build the images on a push, publish them to a registry and recreate the containers/services on your server.
For accessing files my excel files, I read about VOLUMES and I intend to implement them. Is that good solution?
Yes, that's what volumes were made for: Accessing and storing data that isn't part of your image.
Fairly new Docker user here.
Say I have several versions of a docker image, with the latest two being myimg:1.0.0 and myimg:2.0.0. myimg:2.0.0 is also tagged as myimg:latest, whereas before that one was created, myimg:1.0.0 was also tagged as myimg:latest.
Is there a way to point to the second-to-last :latest tag, without specifying the exact version (i.e. :1.0.0 in this example)? Or for that matter, is there some history of which image corresponds to :latest that one can go through to find a the image that corresponded to :latest N versions ago?
The reason I am asking is that I would like to run two succeeding versions of an image alongside each other, so the latest and second-to-latest are always used. I know this is possible by specifying the exact versions, but I'm hoping to make use of :latest somehow.
:latest is just a name. There is absolutely nothing special with it. It only gets updated when the tag of the Docker image you built was latest or was omitted. There is NO guarantee that :latest always points to the latest version of an image.
So it's simple to imagine that there's no simple way to get the second last image like git's HEAD~1. The only way you could do it is via version control.
My personal opinion: tag your images with a version tag.
Is it possible to list all files that get copied into the image from the build context, or affect the final contents of the image in any other way?
I need this for dependency tracking; I am sculpting a build system for a project that involves building multiple images and running containers from them in the local dev environment. I need this to be optimized for rapid code-build-debug cycle, and therefore I need to be able to avoid invoking docker build as tightly as possible. Knowing the exact set of files in the build context that end up affecting the image will allow me to specify those as tracked dependencies for the build step that invokes docker build, and avoid unnecessary rebuilds.
I don't need to have this filelist generated in advance, though that is prefereable. If no tool exists to generate it in advance, but there is a way to obtain it from a built image, then that's OK too; the build tool I use is capable of recording dynamic dependencies discovered by a post-build step.
Things that I am acutely aware of, and I still make an informed decision that pursuing this avenue is worthwile:
I know that the number of dependencies thus tracked can be huge-ish. I believe the build tool can handle it.
I know that there are other kinds of dependencies for a docker image besides files in the build context. This is solved by also tracking those dependencies outside of docker build. Unlike files from the build context, those dependencies are either much fewer in number (i.e. files that the Dockerfile's RUN commands explicitly fetch from the internet), or the problem of obtaining an exhaustive list of such dependencies is already solved (e.g. dependencies obtained using a package manager like apt-get are modeled separately, and the installing RUNs are generated into the Dockerfile from the model).
Nothing is copied to the image unless you specifically say so. So, check your Dockerfile for COPY statements and you will know what files from the build context are added to the image.
Notice that, in the event you have a COPY . ., you might have a .dockerignore file in the build context with files you don't want to copy.
I don't think what's you're looking for would be useful even if it was possible. A list of all files in the previously built image wouldn't factor in new files, and it would be difficult to differentiate new files that affect the build from new files that would be ignored.
It's possible that you could parse the Dockerfile, extract every COPY and ADD command, run the current files through a hashing process to identify if they changed from the hash in the image history (you would need to match docker's hashing algorithm which includes details like file ownership and permissions), and then when that hash doesn't match you would know the build needs to run again. You could look at creating a custom buildkit syntax parser, or reuse the low level buildkit code to build your own context processor.
But before you spend too much time trying to implement the above code, realize that it already exists, as docker build. Rather than trying to avoid running a build, I'd focus on getting the build to utilize the build cache so new builds skip all unchanged steps, possibly generating the exact same image id.
This docker file's goal is to:
Goal: provide a thrift-compiler Docker image
I was just wondering why does this image need to install golang
It appears to download the Golang binary package but only copies over gofmt. Looking at https://github.com/apache/thrift/blob/19baeefd8c38d62085891d7956349601f79448b3/compiler/cpp/src/thrift/generate/t_go_generator.cc it seems that at one point they were running gofmt on the Golang generated code.
The comment for that part of code links to https://issues.apache.org/jira/browse/THRIFT-3893 which references pull request https://github.com/apache/thrift/pull/1061 where the feature was actually removed.
The specific commit (https://github.com/apache/thrift/commit/2007783e874d524a46b818598a45078448ecc53e) appears to be in 0.10 but not 0.9. So, along with the disabling of gofmt they probably just forgot to remove it from the image or decided it was just worth leaving as the feature could be fixed and re-enabled at a later date.
It might be worth opening an issue to ask the Thrift team about it and if it can be removed.
I'm working on creating some docker images to be used for testing on dev machines. I plan to build one for our main app as well as one for each of our external dependencies (postgres, elasticsearch, etc). For the main app, I'm struggling with the decision of writing a Dockerfile or compiling an image to be hosted.
On one hand, a Dockerfile is easy to share and modify over time. On the other hand, I expect that advanced configuration (customizing application property files) will be much easier to do in vim before simply committing an new image.
I understand that I can get to the same result either way, but I'm looking for PROS, CONS, and gotchas with either direction.
As a side note, I plan on wrapping this all together using Fig. My initial impression of this tool has been very positive.
Thanks!
Using a Dockerfile:
You have an 'audit log' that describes how the image is built. For me this is fundamental if it is going to be used in a production pipeline where more people are working and maintainability should be a priority.
You can automate the building process of your image, being an easy way of updating the container with system updates, or if it has to take part in a continuous delivery pipeline.
It is a cleaner way of create the layers of your container (each Dockerfile command is a different layer)
Changing a container and committing the changes is great for testing purposes and for fast development for a conceptual test. But if you plan to use the result image for some time, I would definitely use Dockerfiles.
Apart from this, if you have to modify a file and doing it using bash tools (awk, sed...) results very tedious, you can add any file you wish from outside during the building process.
I totally agree with Javier but you need to understand that one image created with a dockerfile can be different with an image build with the same version of the dockerfile 1 day after.
maybe in your build process you retrieve automatically last updates of an app or the os etc …
And at this time if you need to reproduce a crash or whatever you can’t rely on the dockerfile.