How to get large amounts of data into a docker image? - docker

I want to upload a significant amount of images for processing to use in a Docker instance.
As I have observed this is normally done by a download script (where the images are downloaded to the instance).
I have several terabytes of images so I do not want to download them each time. Is there a way to get my images to a specific location in the Docker instance?
What is the standard way of doing this?

Related

Increasing the storage space of a docker container on Windows to 2-3TB

Working on a windows computer with 5TB available space. Working on building an application to process large amounts of data that uses docker containers to create replicable environments. Most of the data processing is done in parallel using many smaller docker containers, but the final tool/container requires all the data to come together in one place. The output area is mounted to a volume, but most of the data is just copied into the container. This will be multiple TBs of storage space. RAM luckily isn't an issue in this case.
Willing to try any suggestions and make what changes I can.
Is this possible?
I've tried increasing disk space for docker using .wslconfig but this doesn't help.

Docker: reducing parent images (how big is too big)?

big newbie to Docker here, so forgive if the question is unclear:
I am trying to build my own Jupyter notebook image, at the simplest level this is just:
FROM jupyter/scipy-notebook
RUN pip install keras
When I build this image (lets call it keras-notebook), I understand that I have 2 images locally, the parent image jupyter/scipy-notebook and my keras-notebook.
Unfortunately, this leaves me with 2 4GB+ images stored locally - how can I build my keras-notebook without having to have a local jupyter/scipy-notebook?
The bigger question I also have is, how big is too big for a Docker image? Most people suggest a recommended image size of 100s of MB and these images are in the range of GB - so is that (almost) defeating the point of containerizing this software?
If your base image is big, your custom container will be bigger.
There is no right or wrong on building images, just the optimal, to illustrate that: think about the download time of a 4gb image and a 100mb one that does the same thing, of course you will use the 100mb one.
Making images smaller is a challenge to most image builders.

Pulling docker images

Is there a way where I can manually download a docker image?
I have pretty slow Internet connection and for me is better to get a link of the image and download it elsewhere with better Internet speed,
How can I get the direct URL of the image managed by docker pull?
It's possible to obtain that, but let me suggest two other ways!
If you can connect to a remote server with a fast connection, and that server can run Docker, you could docker pull on that server, then you can docker save to export an image (and all its layers and metadata) as tarball, and transfer that tarball any way you like.
If you want to transfer multiple images sharing a common base, the previous method won't be great, because you will end up transferring multiple tarballs sharing a lot of data. So another possibility is to run a private registry e.g. on a "movable" computer (laptop), connect it to the fast network, pull images, push images to the private registry; then move the laptop to the "slow" network, and pull images from it.
If none of those solutions is acceptable for you, don't hesitate to give more details, we'll be happy to help!
You could pull down the individual layers with this:
https://github.com/samalba/docker-registry-debug
Use the curlme option.
Reassembling the layers into an image is left as an exercise for the reader.

Using EC2 to resize images stored on S3 on demand

We need to serve the same image in a number of possible sizes in our app. The library consists of 10's of thousands of images which will be stored on S3, so storing the same image in all it's possible sizes does not seem ideal. I have seen a few mentions on Google that EC2 could be used to resize S3 images on the fly, but I am struggling to find more information. Could anyone please point me in the direction of some more info or ideally, some code samples?
Tip
It was not obvious to us at first, but never serve images to an app or website directly from S3, it is highly recommended to use CloudFront. There are 3 reasons:
Cost - CloudFront is cheaper
Performance - CloudFront is faster
Reliability - S3 will occasionally not serve a resource when queried frequently i.e. more than 10-20 times a second. This took us ages to debug as resources would randomly not be available.
The above are not necessarily failings of S3 as it's meant to be a storage and not a content delivery service.
Why not store all image sizes, assuming you aren't talking about hundreds of different possible sizes? Storage cost is minimal. You would also then be able to serve your images up through Cloudfront (or directly from S3) such that you don't have to use your application server to resize images on the fly. If you serve a lot of these images, the amount of processing cost you save (i.e. CPU cycles, memory requirements, etc.) by not having to dynamically resize images and process image requests in your web server would likely easily offset the storage cost.
What you need is an image server. Yes, it can be hosted on EC2. These links should help starting off: https://github.com/adamdbradley/foresight.js/wiki/Server-Resizing-Images
http://en.wikipedia.org/wiki/Image_server

Cropping and resizing images on the fly with node.js

I run a node.js server on Amazon EC2. I am getting a huge csv file with data containing links to product images on a remote host. I want to crop and store the images in different sizes on Amazon S3.
How could this be done, preferably just with streams, without saving anything to disc?
I don't think you can get around saving the full-size image to disk temporarily, since resizing/cropping/etc would normally require having the full image file. So, I say ImageMagick.

Resources