How to use big file only to build the container without adding it? - docker

I have a big tar/executable (over 30GB) I COPY/ADD it but this is used only for the installation. Once the application is installed I don't need it anymore.
How can I do? I am trying to use it but:
Everytime I run a build, it takes minutes to define the build context.
I'd like to share this image, if I create a tar with docker save, Is the final version or each layer included in it?
I found some solutions that said I can use RUN wget tar ... && rm tar but I don't want to create webserver for that.
Why isn't possible to mount a volume during build process?! It would be very useful.

Use Docker's multi-stage builds. This mechanism allows you to drop intermediate artifacts and therefore achieve a lightweight image.
Example:
FROM alpine:latest as build
# copy large file
# build
FROM alpine:latest as output
# copy necessary files built in the previous stage
COPY --from=build app /app
Anything built in the build stage will not be included in the final image, unless you explicitly COPY them.
Docs: https://docs.docker.com/develop/develop-images/multistage-build/

This is solvable using 2 different context.
Please follow these steps as mentioned below.
Objective is to create a
docker image that will have you large-build file.
docker image that will have you real codebase/executables.
For this you have to create 2 folders (Build & CodeBase) as follow.
Application<br/>
|---> BUILD <br/>
|======|--->Large-File<br/>
|======|--->Dockerfile<br/>
|--->CodeBase<br/>
|======|--->SRC+Other stuff<br/>
|======|--->Dockerfile<br/>
Build & Codebase both folders will have individual Dockerfile and arrange files accordingly.
Dockerfile(Build)
FROM **Base-Image**
COPY Large-File /tmp/Large-File
Build this and tag it with some name like (base-build-app-image)
#>cd Application <==Application root folder as mentioned above==>
#>docker build -t base-build-app-image BUILD <==path of your build-folder==>
Dockerfile(Codebase)
FROM base-build-app-image
RUN *****
CMD *****
RUN rm -f **/tmp/Large-File**
RUN rm -f **Remove installation files that is not required**
ENTRYPOINT *****
Build this-code-base and base-build-app-image is already in your local docker-repository and your large iso file is not in the current-buid-context
#>cd Application <==Application root folder as mentioned above==>
#>docker build CodeBase <==path of your code-base==>
This time since the context size is only your code base and since this doesn't include that Large file - it will definitely reduce your build time.
You can also take an advance of using docker-compose to do both operations together so you will not have to execute 2 separate commands.
If you need help on preparing this docker-compose file then do let me know in comments.
If anything is not clear then leave a comment or come over chat to fix this issue.

Related

Share go modules with docker builder stage

[EDIT - added clarity]
Here is my current env setup :
$GOPATH = /home/fzd/go
projectDir = /home/fzd/go/src/github.com/fzd/amazingo
amazingo has a go.mod file that lists several (let's say thousands) dependencies.
So far, I used to go build -t bin/amazingo cmd/main.go, but I want to share this with other people and have a build command that is environment-independent. Using go build has the advantage of downloading each dependency once -- and then using those in ${GOPATH}/pkg/mod, which saves time and bandwidth.
I want to build in a multistage docker image, so I go with
> cat /home/fzd/go/src/github.com/fzd/amazingo/Dockerfile
FROM golang:1.17 as builder
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /bin/amazingo cmd/main.go
FROM alpine:latest
COPY --from=builder /bin/amazingo /amazingo
ENTRYPOINT ["/amazingo"]
As you can expect it, the builder is "naked" when I start it, so it has to download all my dependencies when I docker build -t amazingo:0.0.1 . . But it will do so everytime I call it, which can be several times a day.
Fortunately, I already have most of these dependencies on my disk. I would be happy to share these files (that are located in my $GOPATH/pkg/mod) with the builder, and help it build faster on my machine.
So the question is: how can I share my ${GOPATH} (or ${GOPATH}/mod/pkg) with the builder ?
I tried adding the following to the builder
ARG SRC_GOPATH
COPY ${SRC_GOPATH} /go
and call docker build --build-arg SRC_GOPATH=${GOPATH} -o amazingo:0.0.1 ., but it wasn't good enough - I got an error (COPY failed: file not found in build context or excluded by .dockerignore: stat home/fzd/go: file does not exist)
I hope this update brings a bit more clarity to the problem.
=======
I have a project with a go.mod file.
I want to build that project using a multistage docker image.
(this article is a perfect example)
The issue is that I have "lots" of dependencies, and each of them will be downloaded inside my Docker builder stage.
Is there a way to "share" my GOPATH/pkg/mod with the docker build... command (in some ways, having a local cache) ?
Your end goal isn't completely clear, but the way that I use a multistage build would look something like this for a (dirt-simple) go app, assuming that you ultimately want the docker container to run your go app. You will need to get your source into the build container somehow as well - that is not shown here:
FROM golang:1.17.2-alpine3.14 as builder
WORKDIR /my/app/source/dir
RUN go get && go build -o /path/to/my/app/binary
FROM alpine3.14 AS release
# install runtime deps, if any
# create necessary files and folders, if any
COPY --from=builder /path/to/my/app/binary /usr/local/bin
ENTRYPOINT /usr/local/bin/binary --options
In this way, the source of your application and all dependencies will not be present in the released image, only the compiled binary.
Of course you don't have to specify an output path for that, I think it just makes it a little clearer in this example. And of course you can use whatever base image/images you want to - I'm treating this as though you don't need the go runtime on your release image.

Creating a dockerfile to compile source code

I am trying to follow the 2 steps mentioned below:
1) Downloaded source code of
https://sourceforge.net/projects/hunspell/files/Hyphen/2.8/hyphen-2.8.8.tar.gz/download
2) Compiled it and you will get binary named example:
hyphen-2.8.8$ ./example ~/dev/smc/hyphenation/hi_IN/hyph_hi_IN.dic
~/hi_sample.text
I have downloaded and uncompressed the tar file. My question is how to create a dockerfile to automate this?
There are only 3 commands involved:
./configure
make all-recursive
make install
I can select the official python image as a base container. But how do I write the commands in a docker file?
You can do that with a RUN command:
FROM python:<version number here>
RUN ./configure && make-recursive && make install
CMD ['<some command here>']
what you use for <some command here> depends on what the image is meant to do. Remember that docker containers only run as long as that command is executing, so if you put the configure/make/install steps in a script and use that as your entry point, it's going to build your program, and then the container will halt.
Also you need to get the downloaded files into the container. That can be done using a COPY or an ADD directive (before the RUN of course). If you have the tar.gz file saved locally, then ADD will both copy the file into the container and expand it into a directory automatically. COPY will not expand it, so if you do that, you'll need to add a tar -zxvf or similar to the RUN.
If you want to download the file directly into the container, that could be done with ADD <source URL>, but in that case it won't expand it, so you'll have to do that in the RUN. COPY doesn't allow sourcing from a URL. This post explains COPY vs ADD in more detail.
You can have the three commands in a shell script and then use the following docker commands
COPY ./<path to your script>/<script-name>.sh /
ENTRYPOINT ["/<script-name>.sh"]
CMD ["run"]
For reference, you can create your docker file as they have created for one of the projects I worked on Apache Artemis Active Mq:
https://github.com/apache/activemq-artemis/blob/master/artemis-docker/Dockerfile-ubuntu

Docker build not using cache when copying Gemfile while using --cache-from

On my local machine, I have built the latest image, and running another docker build uses cache everywhere it should.
Then I upload the image to the registry as the latest, and then on my CI server, I'm pulling the latest image of my app in order to use it as the build cache to build the new version :
docker pull $CONTAINER_IMAGE:latest
docker build --cache-from $CONTAINER_IMAGE:latest \
--tag $CONTAINER_IMAGE:$CI_COMMIT_SHORT_SHA \
.
From the build output we can see the COPY of the Gemfile is not using the cache from the latest image, while I haven't updated that file :
Step 15/22 : RUN gem install bundler -v 1.17.3 && ln -s /usr/local/lib/ruby/gems/2.2.0/gems/bundler-1.16.0 /usr/local/lib/ruby/gems/2.2.0/gems/bundler-1.16.1
---> Using cache
---> 47a9ad7747c6
Step 16/22 : ENV BUNDLE_GEMFILE=$APP_HOME/Gemfile BUNDLE_JOBS=8
---> Using cache
---> 1124ad337b98
Step 17/22 : WORKDIR $APP_HOME
---> Using cache
---> 9cd742111641
Step 18/22 : COPY Gemfile $APP_HOME/
---> f7ff0ee82ba2
Step 19/22 : COPY Gemfile.lock $APP_HOME/
---> c963b4c4617f
Step 20/22 : RUN bundle install
---> Running in 3d2cdf999972
Aside node : It is working perfectly on my local machine.
Looking at the Docker documentation Leverage build cache doesn't seem to explain the behaviour here as neither the Dockerfile, nor the Gemfile has changed, so the cache should be used.
What could prevent Docker from using the cache for the Gemfile?
Update
I tried to copy the files setting the right permissions using COPY --chown=user:group source dest but it still doesn't use the cache.
Opened Docker forum topic: https://forums.docker.com/t/docker-build-not-using-cache-when-copying-gemfile-while-using-cache-from/69186
Let me share with you some information that helped me to fix some issues with Docker build and --cache-from, while optimizing a CI build.
I had struggled for several days because I didn't have the correct understanding, I was basing myself on incorrect explanations found on the webs.
So I'm sharing this here hoping it will be useful to you.
When providing multiple --cache-from, the order matters
The order is very important, because at the first match, Docker will stop looking for other matches and it will use that one for all the rest of the commands.
This is explained by the person who implemented the feature in the Github PR:
When using multiple --cache-from they are checked for a cache hit in the order that user specified. If one of the images produces a cache hit for a command only that image is used for the rest of the build.
There is also a lenghtier explanation in the initial ticket proposal:
Specifying multiple --cache-from images is bit problematic. If both images match there is no way(without doing multiple passes) to figure out what image to use. So we pick the first one(let user control the priority) but that may not be the longest chain we could have matched in the end. If we allow matching against one image for some commands and later switch to a different image that had a longer chain we risk in leaking some information between images as we only validate history and layers for cache. Currently I left it so that if we get a match we only use this target image for rest of the commands.
Using --cache-from is exclusive: the local Docker cache won't be used
This means that it doesn't add new caching sources, the image tags you provide will be the only caching source for the Docker build.
Even if you just built the same image locally, the next time you run docker build for it, in order to benefit from the cache, you need to either:
provide the correct tag(s) with --cache-from (and with the correct precedence); or
not use --cache-from at all (so that it will use the local build cache)
If the parent image changes, the cache will be invalidated
For example, if you have an image based on docker:stable, and docker:stable gets updated, the cached builds of your image will not be valid anymore as the layers of the base image were changed.
This is why, if you're configuring a CI build, it can be useful to docker pull the base image as well and include it in the --cache-from, as mentioned in this comment in yet another Github discussion.
I struggled with this problem, and in my case I used COPY when the checksum might have changed (but only technically, the content was functionally identical). So, I worked around this way:
Dockerfile:
ARG builder_image=base-builder
# Compilation/build stage
FROM golang:1.16 AS base-builder
RUN echo "build the app" > /go/app
# This step is required to facilitate docker cache. With the definition of a `builder_image` build tag
# we can essentially skip the build stage and use a prebuilt-image directly.
FROM $builder_image AS builder
# myapp docker image
FROM ubuntu:20.04 AS myapp
COPY --from=builder /go/app /opt/my-app/bin/
Then, I can run the following:
# build cache
DOCKER_BUILDKIT=1 docker build --target base-builder -t myapp-builder .
docker push myapp-builder
# use cache
DOCKER_BUILDKIT=1 docker build --target myapp --build-arg=builder_image=myapp-builder -t myapp .
docker push myapp
This way we can force Docker to use a prebuilt image as a cache.
For whoever is fighting with DockerHub automated builds and --cache-from. I realized images built from DockerHub would always lead to cache bust on COPY commands when pulled and used as build cache source. It seems to be also the case for #Marcelo (refs his comment).
I investigated by creating a very simple image doing a couple of RUN commands and later COPY. Everything is using the cache except the COPY. Even though content and permissions of the file being copied is the same on both the pulled image and the one built locally (verified via sha1sum and ls -l).
The solution for me was to publish the image to the registry from the CI (Travis in my case) rather than letting DockerHub automated build doing it. Let me emphasis here that I'm talking here about a specific case where files are definitely the same and should not cache bust, but you're using DockerHub automated builds.
I'm not sure why is that, but I know for instance old docker-engine version e.g. prior 1.8.0 didn't ignore file timestamp to decide whether to use the cache or not, refs https://docs.docker.com/release-notes/docker-engine/#180-2015-08-11 and https://github.com/moby/moby/pull/12031.
For a COPY command to be cached, the checksum needs to be identical on the source being copied. You can compare the checksum in the docker history output between the cache image and the one you just built. Most importantly, the checksum includes metadata like the file owner and file permission, in addition to file contents. Whitespace changes inside a file like changing to linefeeds between Linux and Windows styles will also affect this. If you download the code from a repo, it's likely the metadata, like the owner, will be different from the cached value.

Selecting different code branches when using a shared base image in Docker

I am containerising a codebase that serves multiple applications. I have created three images;
app-base:
FROM ubuntu
RUN apt-get install package
COPY ./app-code /code-dir
...
app-foo:
FROM app-base:latest
RUN foo-specific-setup.sh
and app-buzz which is very similar to app-foo.
This works currently, except I want to be able to build versions of app-foo and app-buzz for specific code branches and versions. It's easy to do that for app-base and tag appropriately, but app-foo and app-buzz can't dynamically select that tag, they are always pinned to app-base:latest.
Ultimately I want this build process automated by Jenkins. I could just dynamically re-write the Dockerfile, or not have three images and just have two nearly-but-not-quite identical Dockerfiles for each app that would need to be kept in sync manually (later increasing to 4 or 5). Each of those solutions has obvious drawbacks however.
I've seen lots of discussions in the past about things such as an INCLUDE statement, or dynamic tags. None seemed to come to anything.
Does anyone have a working, clean(ish) solution to this problem? As long as it means Dockerfile code can be shared across images, I'd be happy. If it also means that the shared layers of images don't need to be rebuilt for each app, then even better.
You could still use build args to do this.
Dockerfile:
FROM ubuntu
ARG APP_NAME
RUN echo $APP_NAME-specific-setup.sh >> /root/test
ENTRYPOINT cat /root/test
Build:
docker build --build-arg APP_NAME=foo -t foo .
Run:
$ docker run --rm foo
foo-specific-setup.sh
In your case you could run the correct script in the RUN using the argument you just set before. You would have one Dockerfile per app-base variant and run the correct set-up based on the build argument.
FROM ubuntu
RUN apt-get install package
COPY ./app-code /code-dir
ARG APP_NAME
RUN $APP_NAME-specific-setup.sh
Any layers before setting the ARG would not need to be rebuilt when creating other versions.
You can then push the built images to separate docker repositories for each app.
If your apps need different ENTRYPOINT instructions, you can have an APP_NAME-entrypoint.sh per app and rename it to entrypoint.sh within your APP_NAME-specific-setup.sh (or pass it through as an argument to run).

I'm trying to make the perfect docker build file, do i need to build it from scratch each time?

For an assignment the marker requires of me to create a dockerfile to build my project's container, however I have a fairly complex set of tasks I need to have work in the right way together for my dockerfile to be of any use to me, so I am currently building a file that takes 30 minutes each time just to see if minor changes affect the outcome in the right way, so my question is, is there a better way of doing this?
The Dockerfile best practices, or an earlier question might help: Creating a Dockerfile - docker starts from scratch on each new build
In my experience, a full build every time means you're working against docker's caching mechanism, usually by having COPY . . early in the Dockerfile.
If the files copied into the image are then used to drive a package manager, or download other sources - try copying just the script or requirements file, then using it, then copying the rest of the sources.
a simplified python example, restated from the best practices link:
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
With that structure, as long as requirements.txt does not change, the first COPY and following RUN command use cached layers and rebuilds are much faster.
The first tip is using COPY/ADD for artifacts that need to be download when docker builds.
The second tip is, you can create one Dockerfile for each step and reuse them in next steps.
for example, if you want to install postgres db, and install wildfly in your image. You can start creating a Dockerfile for postgresDB only, and build it to make your-postgres docker image.
Then create another Dockerfile which reuse your-postgres image by
FROM your-postgres
.....
and so on...

Resources