Downloading Large Files Inside A Docker Container

Downloading Large Files Inside A Docker Container - docker

A simple problem that I am facing. I am using docker to build a simple image which downloads a file using curl or wget after container is loaded. The file that I am downloading is about 7GB and need 7GB more to extract it in the same container. Below is the code for the image I am building.
FROM node:alpine3.15 as builder
RUN apk update
RUN apk add wget
RUN apk add unzip
COPY package.json ./
COPY index.js ./
ADD startCensusLoad.sh .
RUN npm install
ENTRYPOINT ["/bin/sh","startProcess.sh"]
And this is my startProcess.sh file code.
wget --continue zip_file_download_link_goes_here;
unzip downloaded_zip_file.zip;
npm start;
When I build the image and start a container, the download starts as expected but stops at multiple points like 2 GB or 4 GB or 1.8 GB and so on. I have not been able to download the full file in the container yet. I have also tried curl and seeing the same behavior. Any suggestion on how to solve this issue. The container would run in a kube environment later on so that I can assign an ephemeral volume but for now I am trying to download full file in the container itself.

I was able to resolve this by doing more search on the census website. I was able to find that census.gov does provide that 7GB data in States level and I was able to use download a chunk at a time and process it. So basically container size did not exceed more than the largest state data.

Related

Unzip local file and delete original in Dockerfile image build

I'm trying to uncompress a file and delete the original, compressed, archive in my Dockerfile image build instructions. I need to do this because the file in question is larger than the 2GB limit set by Github on large file sizes (see here). The solution I'm pursuing is to compress the file (bringing it under the 2GB limit), and then decompress when I build the application. I know it's bad practice to build large images and plan to integrate a external database into the project but don't have time now to do this.
I've tried various options, but have been unsuccessful.
Compress the file in .zip format and use apt-get to install unzip and then decompress the file with unzip:
FROM python:3.8-slim
#install unzip
RUN apt-get update && apt-get install unzip
WORKDIR /app
COPY /data/databases/file.db.zip /data/databases
RUN unzip /data/databases/file.db.zip && rm -f /data/databases/file.db.zip
COPY ./ ./
This fails with unzip: cannot find or open /data/databases/file.db.zip, /data/databases/file.db.zip.zip or /data/databases/file.db.zip.ZIP. I don't understand this, as I thought COPY added files to the image.
Following this advice, I compressed the large file with gzip and tried to use the Docker native ADD command to uncompress it, i.e.:
FROM python:3.8-slim
WORKDIR /app
ADD /data/databases/file.db.gz /data/databases/file.db
COPY ./ ./
While this compiles without error, it does not decompress the file, which I can see using docker exec -t -i clean-dash /bin/bash to explore the image directory structure. Since the large file is a gzip file, my understanding is ADD should decompress it, i.e. from the docs.
How can I solve these requirements?

ADD only decompresses local tar files, not necessarily compressed single files. It may work to package the contents in a tar file, even if it only contains a single file:
ADD ./data/databases/file.tar.gz /data/databases/
(cd data/databases && tar cvzf file.tar.gz file.db)
docker build .
If you're using the first approach, you must use a multi-stage build here. The problem is that each RUN command generates a new image layer, so the resulting image is always the previous layer plus whatever changes the RUN command makes; RUN rm a-large-file will actually result in an image that's slightly larger than the image that contains the large file.
The BusyBox tool set includes, among other things, an implementation of unzip(1), so you should be able to split this up into a stage that just unpacks the large file and then a stage that copies the result in:
FROM busybox AS unpack
WORKDIR /unpack
COPY data/databases/file.db.zip /
RUN unzip /file.db.zip
FROM python:3.8-slim
COPY --from=unpack /unpack/ /data/databases/
In terms of the Docker image any of these approaches will create a single very large layer. In the past I've run into operational problems with single layers larger than about 1 GiB, things like docker push hanging up halfway through. With the multi-stage build approach, if you have multiple files you're trying to copy, you could have several COPY steps that break the batch of files into multiple layers. (But if it's a single SQLite file, there's nothing you can really do.)

Based on #David Maze's answer, the following worked, which I post here for completeness.
#unpacks zipped database
FROM busybox AS unpack
WORKDIR /unpack
COPY data/databases/file.db.zip /
RUN unzip /file.db.zip
FROM python:3.8-slim
COPY --from=unpack /unpack/file.db /
WORKDIR /app
COPY ./ ./
#move the unpacked db and delete the original
RUN mv /file.db ./data/databases && rm -f ./data/databases/file.db.zip

How to download a file from URL using Dockerfile

I'm new to the docker world. I'm writing a docker file to install a certain library. The first step it does is download the library from a URL. I'm not sure if it's possible in docker.
I need to install the library on RedHat System.
http://service.sap.com/download is the URL I need to download the library. How can I write Dockerfile for the same?
Can someone please help?
Appreciate all your help! Thanks!

You run a RUN command depending on the programs availible in your system. If you have wget on your system, you just put the command in your dockerfile:
RUN wget http://your.destination/file
If you need that file to move to a location in your image, you keep using RUN command with mv, if the file is outside your image, you can use COPY command.
To resume, downloading your file from the system
[CLI] wget http://your.destination/file
[DOCKERFILE] COPY file .
Downloading a file with docker
[DOCKERFILE] RUN wget http://your.destination/file
[DOCKERFILE] RUN mv file my/destination/folder

How can I install .apk file in alpine linux offline?

My docker image based on alpine Linex can not get anything from network. So the command "apk add xxx" is valid. Now my idea is downloading the .apk file and coping it into the docker container. But how can I install the .apk file ?

Let's say you are trying to install glibc in Alpine
Download the packages into your current directory
wget "https://circle-artifacts.com/gh/andyshinn/alpine-pkg-glibc/6/artifacts/0/home/ubuntu/alpine-pkg-glibc/packages/x86_64/glibc-2.21-r2.apk"
wget "https://circle-artifacts.com/gh/andyshinn/alpine-pkg-glibc/6/artifacts/0/home/ubuntu/alpine-pkg-glibc/packages/x86_64/glibc-bin-2.21-r2.apk"
Then, use apk with --allow-untrusted flag
apk add --allow-untrusted glibc-2.21-r2.apk glibc-bin-2.21-r2.apk
And finish the installation (only needed in this example)
/usr/glibc/usr/bin/ldconfig /lib /usr/glibc/usr/lib

Next steps are fine for me:
Get an "online" Alpine machine and download packages. Example is with "zip" and "rsync" packages:
Update your system: sudo apk update
Download only this packages: apk fetch zip rsync
You will get this files (or maybe an actual version):
zip-3.0-r8.apk
rsync-3.1.3-r3.apk
Upload this files to the "offline" Alpine machine.
Install apk packages:
sudo apk add --allow-untrusted zip-3.0-r8.apk
sudo apk add --allow-untrusted rsync-3.1.3-r3.apk
More info: https://wiki.alpinelinux.org/wiki/Alpine_Linux_package_management

please note that the flag --recursive is necessary when you fetch your apk to download all the dependencies too, else you might get an error when you go offline for missing packages.
sudo apk update
sudo apk fetch --recursive packageName
transfer the files to the offline host
sudo apk add --allow-untrusted <dependency.apk>
sudo apk add --allow-untrusted <package.apk>

If it's possible to run Docker commands from a system that's connected to the public Internet, you can do this in a Docker-native way by splitting your image into two parts.
The first image only contains the apk commands, but no actual application code.
FROM alpine
RUN apk add ...
Build that image docker build -t me/alpine-base, connected to the network.
You now need to transfer that image into the isolated environment. If it's possible to connect some system to both networks, and run a Docker registry inside the environment, then you can use docker push to send the image to the isolated environment. Otherwise, this is one of the few cases where you need docker save: create a tar file of the image, move that file into the isolated environment (through a bastion host, on a USB key, ...), and docker load it on the target system.
Now you have that base image on the target system, so you can install the application on top of it without calling apk.
FROM me/alpine-base
WORKDIR /app
COPY . .
CMD ...
This approach will work for any sort of artifact. If you have something like an application's package.json/requirements.txt/Gemfile/go.mod that lists out all of the application's library dependencies, you can run the download-and-install step ahead of time like this, but you'll need to remember to repeat it and manually move the updated base image if these dependencies ever change.

Mount-ing a CDROM repo during docker build

I'm building a docker image which also involves a small yum install. I'm currently in a location where firewall's and access controls makes docker pull, yum install etc extremely slow.
In my case, its a JRE8 docker image using this official image script
My problem:
Building the image requires just 2 libraries (gzip + tar) which combined is only of (132 kB + 865 kB). But the yum inside docker build script will first download the repo information which is over 80 MB. While 80 MB is generally small, here, this took over 1 hour just to download. If my colleagues need to build, this would be sheer waste of productive time, not to mention frustration.
Workarounds I'm aware of:
Since this image may not need the full yum power, I can simply grab the *.rpm files, COPY in container script and use rpm -i instead of yum
I can save the built image and locally distribute
I could also find closest mirror for docker-hub, but not yum
My bet:
I've copy of the linux CD with about the same version
I can add commands in dockerfile to rename the *.repo to *.repo.old
Add a cdrom.repo in /etc/yum.repos.d/ inside the container
Use yum to load most common libraries from the CDROM instead of internet
My problem:
I'm not able to make out how to create a mount point to a cdrom repo from inside the container build without using httpd.
In plain linux I do this:
mkdir /cdrom
mount /dev/cdrom /cdrom
cat > /etc/yum.repos.d/cdrom.repo <<EOF
[cdrom]
name=CDROM Repo
baseurl=file:///cdrom
enabled=1
gpgcheck=1
gpgkey=file:///cdrom/RPM-GPG-KEY-oracle
EOF
Any help appreciated.

Docker containers cannot access host devices. I think you will have to write a wrapper script around the docker build command to do the following
First mount the CD ROM to a directory within the docker context ( that would be a sub-directory where your DockerFile exists).
call docker build command using contents from this directory
Un-mount the CD ROM.
so,
cd docker_build_dir
mkdir cdrom
mount /dev/cdrom cdrom
docker build "$#" .
umount cdrom
In the DockerFile, you would simple do this:
RUN cd cdrom && rpm -ivh rpms_you_need

How do I dockerize an existing application...the basics

I am using windows and have boot2docker installed. I've downloaded images from docker hub and run basic commands. BUT
How do I take an existing application sitting on my local machine (lets just say it has one file index.php, for simplicity). How do I take that and put it into a docker image and run it?

Imagine you have the following existing python2 application "hello.py" with the following content:
print "hello"
You have to do the following things to dockerize this application:
Create a folder where you'd like to store your Dockerfile in.
Create a file named "Dockerfile"
The Dockerfile consists of several parts which you have to define as described below:
Like a VM, an image has an operating system. In this example, I use ubuntu 16.04. Thus, the first part of the Dockerfile is:
FROM ubuntu:16.04
Imagine you have a fresh Ubuntu - VM, now you have to install some things to get your application working, right? This is done by the next part of the Dockerfile:
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install -y python
For Docker, you have to create a working directory now in the image. The commands that you want to execute later on to start your application will search for files (like in our case the python file) in this directory. Thus, the next part of the Dockerfile creates a directory and defines this as the working directory:
RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app
As a next step, you copy the content of the folder where the Dockerfile is stored in to the image. In our example, the hello.py file is copied to the directory we created in the step above.
COPY . /usr/src/app
Finally, the following line executes the command "python hello.py" in your image:
CMD [ "python", "hello.py" ]
The complete Dockerfile looks like this:
FROM ubuntu:16.04
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install -y python
RUN mkdir -p /usr/src/app
WORKDIR /usr/src/app
COPY . /usr/src/app
CMD [ "python", "hello.py" ]
Save the file and build the image by typing in the terminal:
$ docker build -t hello .
This will take some time. Afterwards, check if the image "hello" how we called it in the last line has been built successfully:
$ docker images
Run the image:
docker run hello
The output shout be "hello" in the terminal.
This is a first start. When you use Docker for web applications, you have to configure ports etc.

Your index.php is not really an application. The application is your Apache or nginx or even PHP's own server.
Because Docker uses features not available in the Windows core, you are running it inside an actual virtual machine. The only purpose for that would be training or preparing images for your real server environment.
There are two main concepts you need to understand for Docker: Images and Containers.
An image is a template composed of layers. Each layer contains only the differences between the previous layer and some offline system information. Each layer is fact an image. You should always make your image from an existing base, using the FROM directive in the Dockerfile (Reference docs at time of edit. Jan Vladimir Mostert's link is now a 404).
A container is an instance of an image, that has run or is currently running. When creating a container (a.k.a. running an image), you can map an internal directory from it to the outside. If there are files in both locations, the external directory override the one inside the image, but those files are not lost. To recover them you can commit a container to an image (preferably after stopping it), then launch a new container from the new image, without mapping that directory.

You'll need to build a docker image first, using a dockerFile, you'd probably setup apache on it, tell the dockerFile to copy your index.php file into your apache and expose a port.
See http://docs.docker.com/reference/builder/
See my other question for an example of a docker file:
Switching users inside Docker image to a non-root user (this is for copying over a .war file into tomcat, similar to copying a .php file into apache)

First off, you need to choose a platform to run your application (for instance, Ubuntu). Then install all the system tools/libraries necessary to run your application. This can be achieved by Dockerfile. Then, push Dockerfile and app to git or Bitbucket. Later, you can auto-build in the docker hub from github or Bitbucket. The later part of this tutorial here has more on that. If you know the basics just fast forward it to 50:00.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart