How to modify 'docker-compose-local.yml' for it to install all requirements (needed to run Amazon MWAA environment locally) - docker

I am running aws-mwaa-local-runner in order to run a local Apache Airflow environment (in Docker for Windows).
However, after creating the container using ./mwaa-local-env start, I repeatedly get the Broken DAG ModuleNotFoundError' However, when I check my /docker/config/requirements.txt file (see here, although my file has a few more requirements that I need in it). When I compare my /docker/config/requirements.txt file with the output of pip freeze command run in Airflow container, I can see those requirements I need for my DAGs are missing.
I tried to pip install my other requirements in Airflow container but to no avail.
Is there a way to modify docker-compose-local.yml file so it installs all of my requirements.txt when creating the container (i.e. running the Airflow)?
Is there maybe something I might be missing? Any help or suggestion would be greatly appreciated.

Look at this: https://github.com/aws/aws-mwaa-local-runner . You should run the requirements file located in dags, locally.
pip install -r requirements.txt

Add your extra requirements to dags/requirements.txt, not docker/config/requirements.txt. The former is installed every time you start the service, but the latter is only installed when you build or rebuild the image.
Additionally, keeping your added requirements separate is important because you will need to upload the list to your MWAA environment.

Related

Why are dependencies installed via an ENTRYPOINT and tini?

I have a question regarding an implementation of a Dockerfile on dask-docker.
FROM continuumio/miniconda3:4.8.2
RUN conda install --yes \
-c conda-forge \
python==3.8 \
[...]
&& rm -rf /opt/conda/pkgs
COPY prepare.sh /usr/bin/prepare.sh
RUN mkdir /opt/app
ENTRYPOINT ["tini", "-g", "--", "/usr/bin/prepare.sh"]
prepare.sh is just facilitating installation of additional packages via conda, pip and apt.
There are two things I don't get about that:
Why not just place those instructions in the Dockerfile? Possibly indirectly (modularized) by COPYing dedicated files (requirements.txt, environment.yaml, ...)
Why execute this via tini? At the end it does exec "$#" where one can start a scheduler or worker - that's more what I associate with tini.
This way everytime you run the container from the built image you have to repeat the installation process!?
Maybe I'm overthinking it but it seems rather unusual - but maybe that's a Dockerfile pattern with good reasons for it.
optional bonus questions for Dask insiders:
why copy prepare.sh to /usr/bin (instead of f.x. to /tmp)?
What purpose serves the created directory /opt/app?
It really depends on the nature and usage of the files being installed by the entry point script. In general, I like to break this down into a few categories:
Local files that are subject to frequent changes on the host system, and will be rolled into the final image for production release. This is for things like the source code for an application that is under development and needs to be tested in the container. You want these to be copied into the runtime every time the image is rebuilt. Use a COPY in the Dockerfile.
Files from other places that change frequently and/or are specific to the deployment environment. This is stuff like secrets from a Hashicorp vault, network settings, server configurations, etc.... that will probably be downloaded into the container all the time, even when it goes into production. The entry point script should download these, and it should decide which files to get and from where based on environment variables that are injected by the host.
libraries, executable programs (under /bin, /usr/local/bin, etc...), and things that specifically should not change except during a planned upgrade. Usually anything that is installed using pip, maven or some other program that does dependency management, and anything installed with apt-get or equivalent. These files should not be installed from the Dockerfile or from the entrypoint script. Much, much better is to build your base image with all of the dependencies already installed, and then use that image as the FROM source for further development. This has a number of advantages: it ensures a stable, centrally located starting platform that everyone can use for development and testing (it forces uniformity where it counts); it prevents you from hammering on the servers that host those libraries (constantly re-downloading all of those libraries from pypy.org is really bad form... someone has to pay for that bandwidth); it makes the build faster; and if you have a separate security team, this might help reduce the number of files they need to scan.
You are probably looking at #3, but I'm including all three since I think it's a helpful way to categorize things.

How can I update the composer version that is being used inside my ddev containers?

Currently my docker/ddev setup is running Composer version 1.10.6 2020-05-06 inside the container.
I would like to make the composer version inside the container be 1.10.7 2020-06-03.
I found one way to do it: ddev exec sudo composer self-update, but it's not permanent. The container reverts back to using 1.10.6 after a ddev restart.
In all of my searches, I can't find a way to update the documents that create the container so they update composer permanently. I don't need it to attempt to update every time I start my container, I just need to be able to tell it now to permanently change over to the version I want.
An additional piece: adding RUN sudo composer self-update to the .ddev/web-build/Dockerfile makes it attempt to update every time, which is not ideal. I want to update when I'm ready, as I also need to update my test servers to match versions.
I added that command to my Dockerfile and it updated to 1.10.7. I removed the command from my Dockerfile so that it doesn't update every time I restart ddev. When I restarted ddev (without that command in the Dockerfile) it reverted composer back to 1.10.6.
Where is it getting the instructions to use that version? I need to find that and tell it to use 1.10.7 instead. I don't want it to update itself every time I do ddev restart.
It's not normally important, but you can add a .ddev/web-build/Dockerfile with these contents:
ARG BASE_IMAGE
FROM $BASE_IMAGE
RUN composer self-update
And your composer will be updated during the image build process.
Randy's suggestion worked well for me, however I've also found an alternative solution which involves less typing.
Read the project config.yaml and it explains how the Composer version can be changed.
This file is found in ~/yourprojectname/.ddev/config.yaml.
The first lines of the file are the configuration used and the remaining lines of the file explain the configuration alternatives available. Enjoy :)
# if composer_version:"" it will use the current ddev default composer release.
# It can also be set to "1", to get most recent composer v1
# or "2" for most recent composer v2.
# It can be set to any existing specific composer version.
# After first project 'ddev start' this will not be updated until it changes

Separating Docker files and application source files to optimize production environment

I have a bunch of (Ruby) scripts stored on a server. Up until now, my team has used them by opening an accessor app that launches a list of the script names, and they select the script they want to run in that instance on the files in their working folder. The scripts are run directly from the server, so updates made to the script files are automatically reflected when a user runs the script.
The scripts require a fair amount of specific dependencies, so I'm trying to move to a Docker-based workflow to eliminate the problems we encounter with incongruent computer environments. I've been able to successfully build an image with our script library and run an instance of it on my computer.
However, all of the documentation and tutorials include the application source files when building an image, so that all the files are copied over by the Dockerfile. From my understanding, this means that any time the code in the application files needs to be updated, all the users will need to rebuild the image before trying to run anything. I would very rarely ever need to make changes to the environment settings/dependencies, but the app code is changed relatively frequently, so it seems like having every user rebuild an image every single time a line of app code is changed would actually slow down everyone's workflow considerably.
My question is this: Is it not possible to have Docker simply create the environment that a user must have to run the applications, but have the applications themselves still run directly off the server where they were originally stored? And does a new container need to be created every single time a user wants to run any one of the scripts? (The users are not tech-savvy.)
Generally you'd do this by using a Docker image instead of the checked-out tree of scripts. You can use a Docker registry to store a built copy of the image somewhere on the network; Docker Hub works for this, most large public-cloud providers have some version of this (AWS ECR, Google GCR, Azure ACR, ...), or you can run your own. The workflow for using this would generally look like
# Get any updates to the "latest" version of the image
# (can be run infrequently)
docker pull ourorg/scripts
# Actually run the script, injecting config files and credentials
docker run --rm \
-v $PWD/config:/config \
-v $HOME/.ssh:/config/.ssh \
ourorg/scripts \
some_script.rb
# Nothing in this example actually requires a local copy of the scripts
I'm envisioning a directory that has kind of a mix of scripts and support files and not a lot of organization to it. Still, you could write a simple Dockerfile that looks like
FROM ruby:2.7
WORKDIR /opt/scripts
# As of Bundler 2.1, there is no compatibility between Bundler
# versions; this must match exactly what is in Gemfile.lock
RUN gem install bundler -v 2.1.4
# Copy the scripts in and do basic installation
COPY Gemfile Gemfile.lock .
RUN bundle install
COPY . .
ENV PATH /opt/scripts:$PATH
# Prefix all commands with...
ENTRYPOINT ["bundle", "exec"]
# The default command to run is...
CMD ["ls"]
On the back end you'd need a continous integration service (Jenkins is popular if a little unwieldy; there are a large selection of cloud-hosted ones) that can rebuild the Docker image whenever there's a commit to the source repository. You can generally rig this up so that it happens automatically whenever anybody pushes anything.
This process makes more sense of most people are just using the set of scripts and few of them are developing them. It's also a little bit difficult to discover what the scripts are (you might be able to docker run --rm ourorg/scripts ls though).
Is it not possible to have Docker simply create the environment that a user must have to run the applications, but have the applications themselves still run directly off the server where they were originally stored?
This always strikes me as an ineffective use of Docker. You have all of the fiddly steps of your current workflow that require everyone to run a git pull or equivalent routinely, but you also have to inject the host source tree into the container. If there are OS incompatibilities in, for example, native gems in the vendor tree, you have to work around that.
# You still need to do this periodically
git pull
# And you also need to
sudo docker run \
--rm \
-v $PWD:/app \
-v $HOME/config:/config \
-v $HOME/.ssh:/config/.ssh \
-w /app \
ruby:2.7 \
bundle exec ./some_script.rb
Some of these details (especially the config file and credentials) you'd have to deal with even if you did build an image; some others of the details you could improve by building an image. Inside the image you need to correct the ownership and permissions on the ssh keys and replace the $PWD/vendor tree with something the container can run, without modifying the mounted host directories.
Is it not possible to have Docker simply create the environment that a user must have to run the applications, but have the applications themselves still run directly off the server where they were originally stored?
You can build an image with all the environment already installed then mount the directory with the scripts so the container can read the scripts from the host. Something like
docker run -it --rm -v /opt/myscripts:/myscripts myimage somescript.rb
Then your image Dockerfile would end with:
WORKDIR /myscripts
ENTRYPOINT ["/usr/bin/ruby"]
And does a new container need to be created every single time a user wants to run any one of the scripts?
Of course, a container is just an isolated process managed by docker, you could make a wrapper so the users wouldn't need to type the full docker run command.

Docker-compose for local development, installing dependencies

I am trying to setup docker-compose architecture for local development and production and I can't figure when in the containers life it's the best time to install library dependencies. In the same time I am not sure if these should be placed in the container or in external volume.
All my code is mounted in external volumes, so that changes are immidiately taken into without rebuilding the containers, but I am not sure about libraries that need to be installed by pip (I am running python backend) and npm/yarn (for webpack front-end).
Placing requirments.txt and package.json into the containers and running pip install and yarn install in the container build process means that I have to rebuild the container any time dependecies change - that is too much overhead.
Putting them in an external volume and running pip install and yarn install as part of the command of each container when it is started seems to solve the issue.
The build process of each container then contains only platform dependencies (eg. installing python, webpack or other platform tools), but libraries are installed after started (with CMD directive).
Is this the correct approach? I have seen lot of examples doing exactly the oposite and running npm install in the build process of the container - but I don't see any advantage for that, am I missing something?
Installing dependecies is usually part of the build process. Mounting code is a good trick when developing in order to get changes directly reflected.
Concerning adding requirements.txt or package.json. Installing dependecies takes time, and for that you need to take advantage of docker layer caching. In particular, you want to avoid cache invalidation.
For pip I suggest the following in development phase: For dependencies that you are unlikely to change, install these in separate RUN instuction. Your Docker file will look something like.
FROM ..
RUN pip install package1 package2 package3 ...
ADD requirements.txt requirements.txt
RUN RUN pip install -r requirements.txt
...
Keep only dependencies that might be changed in requirements.txt. Once you are done developing, add the packages back to the requirements.txt and build using the requirements file.
A similar approach would be adding two requirements files, and at the end combining them.

Development dependencies in Dockerfile or separate Dockerfiles for production and testing

I'm not sure if I should create different Dockerfile files for my Node.js app. One for production without the development dependencies and one for testing with the development dependencies included.
Or one file which is basically the development Dockerfile.dev. Then main difference of both files is the npm install command:
Production:
FROM ...
...
RUN npm install --quiet --production
...
CMD ...
Development/Test:
FROM ...
...
RUN npm install
...
CMD ...
The question arises because I want to be able to run my tests inside the container via docker run command. Therefore I need the test dependencies (typically dev dependencies for me).
Seems a little bit odd to put dependencies not needed in production into the image. On the other hand creating/maintaining a second Dockerfile.dev which just minor differences seems also not right. So what is the a good practise for this kind of problem.
No, you don't need to have different Dockerfiles and in fact you should avoid that.
The goal of docker is to ship your app in an immutable, well tested artifact (docker images) which is identical for production and test and even dev.
Why? Because if you build different artifacts for test and production how can you guarantee what you have already tested is working in production too? you can't because they are two different things.
Given all that, if by test you mean unit tests, then you can mount your source code inside docker container and run tests without building any docker images. And that's fine. Remember you can build image for tests but that terribly slow and makes development quiet difficult and slow which is not good at all. Then if your test passed you can build you app container safely.
But if you mean acceptance test that actually needs to run against your running application then you should create one image for your app (only one) and run tests in another container (mount test source code for example) and run tests against that container. This obviously means what your build for your app is different for npm installs for your tests.
I hope this gives you some over view.
Well then you'll have to support several Dockerfiles that are almost identical. Instead I recommend to use NodeJS feature like production profile. And another one recommendation regarding to
RUN npm install --quiet --production
It is better to create separate .sh file and do something like this instead:
ADD ./scripts/run.sh /run.sh
RUN chmod +x /*.sh
And also think about to start using Gulp.
UPD #1
By default npm install installs devDependencies. In order to get around this - use npm install --production OR set the NODE_ENV environment variable to production value.
Putting script line in separate file is a good practice in order not to change Dockerfile often. If you'll need changes next time then you'll have to update only script-file and you're done. In future you could also have some additional work to do.

Resources