Small change in large file in Docker container produces huge layer - docker

I am using docker to have versioned database on my local dev environment (e.g. to be able to snapshot/revert db state). I need it due to nature of my work. I can not use transactions to achieve what I want [one of reasons - some of statements are DDL]
So, I have docker container with one large file (MySQL Inno db file)
If I change this file a little bit (like update row in table), and then commit container, new layer will be created, and size of this layer will be size of this huge file, even if only couple of bytes in file changed.
I understand it happens because for docker file is 'atomic' structure, if file is being modified its copy is created in new layer, and this layer is later included in image
Is there a way to change this behavior and to make Docker to store diffs on file level, e.g. if 10 bytes of 10 GiG file was changed, create layer with size smaller then 10 GiG?
Mb I can use some other storage engine? [which one?]
I also not very bound to docker, so I can even switch to rkt, question is - do you guys think it can help? (mb image format is different and can store diffs on file content level)

Related

Why is a new docker image the same size of the original one from which the commit was made?

I downloaded a Docker image and made some changes inside a container based on it. Then I commited those changes to create a new image that I would actually like to use.
docker images says that these images have about the same size. So, it seemed to me that Docker copied everything it needs to the new image.
Yet I can't remove the old image which I no longer need. It seems like I'm getting the worst of both worlds: neither is space conserved by a parenting relationship, nor can I delete the unwanted files.
What gives? Am I interpreting docker images output wrong (maybe it's not reporting the actual on-disk size)?
you may remove the first image with a force,
docker image rm -f $IMAGE_ID
As for the same size, it depends mainly on your changes, you can check if they match exactly on a byte level with:
docker image inspect IMAGE_NAME:$TAG --format='{{.Size}}'

How to add/mount large files kept in SharePoint to Docker Container through Dockerfile

I'm new to using Docker and wanted to understand how to add large folders (combined ~1GB) kept elsewhere (such as in SharePoint) to the Docker container using Dockerfile. What is the best way to add the files and can someone explain the commands to be used? For example, one method I have come across is the following:
ADD http://example.com/big.tar.xz /usr/src/things/
Does the /usr/src/things/ specify the location where I want to save the folders (not individual files) with respect to my original repository?
This answer is from: Adding large files to docker during build which covers the question at a high level. Can someone share details/commands for each step involved? One answer mentions not adding the files to the image but mounting as a volume. Is that a better option than using ADD in the Dockerfile.
Thanks!

Docker good practices. "data" in a volume vs image

I currently have an application that (for the sake of simplicity) just requires a .csv file. However, this file needs to be constructed with a script, let's call it create_db.py.
I have two images (let's call them API1 and API2) that require the .csv file so I declared a 2-stage build and both images copy the .csv into their filesystem. This makes the Dockerfiles somewhat ugly as API1 and API2 have the same first lines of the Dockerfile plus there is no guarantee that both images have the same .csv because it is constructed "on the fly"
I have two possible solutions to this problem:
First option:
Make a separate docker image that executes create_db.py and then tag it as data:latest. Copy the .csv in API1 and API2 doing
FROM data:latest as datapipeline
FROM continuumio/miniconda3:4.7.12
...
...
COPY --from=datapipeline file.csv .
Then I will need to create a bash file to make sure data:latest is built (and up to date) before building API1 and API2.
Pros: Data can be pulled from a repository if you are in a different machine, no need to "rebuilt it" again.
Cons: Every time I build API1 and API2 I need to make sure that data:latest is up to date. API1 and API2 require data:latest to be used.
Second option:
Create a volume data/ and an image that runs create_db.py and mount the volume so the .csv is in data/. Then mount the volume for API1 and API2. I will also need some kind of mechanism that makes sure that data/ contains the required file.
Mounting volumes sounds like the right choice when dealing with shared data, but in this case, I am not sure because my data needs "to be built" before being able to be used. Should I go with the first option then?
Chosen solution, thanks to #David Maze
What I ended up doing is separating the data pipeline in its own Docker image and then COPY from that image in API1 and API2.
To make sure that API1 and API2 always have the latest "data image" versión, the data pipeline calculate the hashes of all output files, then tries to do docker pull data:<HASH> if it fails it means that this version of the data is not in the registry and the data image is tagged as both data:<HASH> and data:latest and pushed to the registry. This guarantees that data:latest always points to the last data pushed to the registry and at the same time I can keep track of all the data:<HASH> versions
If it’s manageable size-wise, I’d prefer baking it into the image. There’s two big reasons for this: it makes it possible to just docker run the image without any external host dependencies, and it works much better in cluster environments (Docker Swarm, Kubernetes) where sharing files can be problematic.
There’s two more changes you can make to this to improve your proposed Dockerfile. You can pass the specific version of the dataset you’re using as an ARG, which will help the situation where you need to build two copies of the image and need them to have the same dataset. You can also directly COPY --from= an image, without needing to declare it as a stage.
FROM continuumio/miniconda3:4.7.12
ARG data_version=latest
COPY --from=data:${data_version} file.csv .
I’d consider the volume approach only if the data file is really big (gigabytes). Docker images start to get unwieldy at that size, so if you have a well-defined auxiliary data set you can break out, that will help things run better. Another workable approach could be to store the datafile somewhere remote like an AWS S3 bucket, and download it at startup time (adds some risk of startup-time failure and increases the startup time, but leaves the image able to start autonomously).

Creating MBTiles file with varied levels of detail using existing OpenMapTiles docker tasks?

I'm working hard to get up to speed with OpenMapTiles. The quickstart.sh script usually runs to completion so I've preferred it as a source of truth over the sometimes inconsistent documentation. Time to evolve.
What is the most efficient way to build an MBTiles file that contains, say, planet-level data for zooms 0-6 and bounded data for zooms 7-13, ideally for multiple bounded areas (e.g., a handful of metro areas). Seems a common use case during development. Can it be done with the existing Docker tools?
Did you try to download a OSM file from http://download.geofabrik.de/index.html and place it in /data folder, as stated in the quickstart.md (https://github.com/openmaptiles/openmaptiles/blob/master/QUICKSTART.md) ?
Placing the osm.pbf file in your /data folder and adjusting the .env and openmaptiles.yaml file to your preferred zoom should help you with a next step.
I'm not sure what you mean with the bounds.

What is Docker.qcow2?

I was looking at my disk with DaisyDisk and I have a 30GB something called Docker.qcow2. More specifically, DaisyDisk puts it under ~/Library/Containers/com.docker.docker/Data/vms/0/Docker.qcow2. Is it some kind of cache? Can I delete it? I have a lot of images from old projects that I won't ever use and I'm trying to clear up my disk.
The .qcow2 file is exposed to the VM as a block device with a maximum size of 64GiB by default. As new files are created in the filesystem by containers, new sectors are written to the block device. These new sectors are appended to the .qcow2 file causing it to grow in size, until it eventually becomes fully allocated. It stops growing when it hits this maximum size.
You can stop Docker and delete this file, however deleting it will also remove all your containers and images. And Docker will recreate this file on start.
If you stumbled upon this, you're probably not stoked about a 64gb file. If you open up Docker > Preferences, you can tone it down quite a bit to a more reasonable size. Doing this will delete the old cow file, and that will delete your containers, so be careful.
I've had the same issue. Instead of deleting the file or adjusting the size using the settings simply use the following commands:
docker images
This will show all of the images on your system and the size of each image (you'd be surprised how quickly this can get out of hand).
docker image rm IMAGEID
This will remove the image with the ID that you can get from the images command above.
I use this method and it frees up a lot of disk space.

Resources