mounting folder in singularity -which directory to mount - docker

My input files for the singularity program is not recognized (not found), which I think is due to the directory is not mounted within singularity.
I know that the mounting can be set by this command, but I am not sure which folders to mount.
export SINGULARITY_BIND="/somefolder:/somefolder"
How do I know which folders should be before and after the ":" sign in SINGULARITY_BIND?
I have set:
SINGULARITY_CACHEDIR=/mnt/scratch/username/software
where my singularity is stored.
My complete script:
export SINGULARITY_CACHEDIR=/mnt/scratch/username/software
export SINGULARITY_BIND="/home/username:/mnt/scratch/username"
OUTPUT_DIR="${PWD}/quickstart-output"
INPUT_DIR="${PWD}/quickstart-testdata"
BIN_VERSION="1.4.0"
# Run DeepVariant.
singularity run \
docker://google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WGS \ **Replace this string with exactly one of the following [WGS,WES,PACBIO,HYBRID_PACBIO_ILLUMINA]**
--ref="${INPUT_DIR}"/ucsc.hg19.chr20.unittest.fasta \
--reads="${INPUT_DIR}"/NA12878_S1.chr20.10_10p1mb.bam \
--regions "chr20:10,000,000-10,010,000" \
--output_vcf="${OUTPUT_DIR}"/output.vcf.gz \
--output_gvcf="${OUTPUT_DIR}"/output.g.vcf.gz \
--intermediate_results_dir "${OUTPUT_DIR}/intermediate_results_dir" \ **Optional.
--num_shards=1 \ **How many cores the `make_examples` step uses. Change it to the number of CPU cores you have.**
My error:
singularity.clean.sh: line 23: --ref=/home/username/scratch/username/software/quickstart-testdata/ucsc.hg19.chr20.unittest.fasta: No such file or directory

If you want to bind mount your current working directory, you can use --bind $(pwd). If you want to bind mount your home directory, you can use --bind $HOME (note that singularity mounts the home directory by default). See the Singularity documentation for more information.
Based on your INPUT_DIR and OUTPUT_DIR, it seems like you can bind mount your current working directory. To do that, use --bind $(pwd). Note this argument goes before the name of the singularity container.
Just to be safe, also use --pwd $(pwd) to set the working directory in the container as the current working directory on the host.
OUTPUT_DIR="${PWD}/quickstart-output"
INPUT_DIR="${PWD}/quickstart-testdata"
BIN_VERSION="1.4.0"
singularity run --bind $(pwd) --pwd $(pwd) \
docker://google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WGS \
--ref="${INPUT_DIR}/ucsc.hg19.chr20.unittest.fasta" \
--reads="${INPUT_DIR}/NA12878_S1.chr20.10_10p1mb.bam" \
--regions "chr20:10,000,000-10,010,000" \
--output_vcf="${OUTPUT_DIR}/output.vcf.gz" \
--output_gvcf="${OUTPUT_DIR}/output.g.vcf.gz" \
--intermediate_results_dir "${OUTPUT_DIR}/intermediate_results_dir" \
--num_shards=1
The syntax of the --bind argument is path-on-host:path-in-container. And using --bind path is shorthand for --bind path:path, which means the source path is mounted as the same path in the container. This can be very useful, because one does not have to rewrite paths and think in terms of the container's directories.

Related

Avoiding duplicated arguments when running a Docker container

I have a tensorflow training script which I want to run using a Docker container (based on the official TF GPU image). Although everything works just fine, running the container with the script is horribly verbose and ugly. The main problem is that my training script allows the user to specify various directories used during training, for input data, logging, generating output, etc. I don't want to have change what my users are used to, so the container needs to be informed of the location of these user-defined directories, so it can mount them. So I end up with something like this:
docker run \
-it --rm --gpus all -d \
--mount type=bind,source=/home/guest/datasets/my-dataset,target=/datasets/my-dataset \
--mount type=bind,source=/home/guest/my-scripts/config.json,target=/config.json \
-v /home/guest/my-scripts/logdir:/logdir \
-v /home/guest/my-scripts/generated:/generated \
train-image \
python train.py \
--data_dir /datasets/my-dataset \
--gpu 0 \
--logdir ./logdir \
--output ./generated \
--config_file ./config.json \
--num_epochs 250 \
--batch_size 128 \
--checkpoint_every 5 \
--generate True \
--resume False
In the above I am mounting a dataset from the host into the container, and also mounting a single config file config.json (which configures the TF model). I specify a logging directory logdir and an output directory generated as volumes. Each of these resources are also passed as parameters to the train.py script.
This is all very ugly, but I can't see another way of doing it. Of course I could put all this in a shell script, and provide command line arguments which set these duplicated values from the outside. But this doesn't seem a nice solution, because if I want to anything else with the container, for example check the logs, I would use the raw docker command.
I suspect this question will likely be tagged as opinion-based, but I've not found a good solution for this that I can recommend to my users.
As user Ron van der Heijden points out, one solution is to use docker-compose in combination with environment variables defined in an .env file. Nice answer.

Unable to set environment variable inside docker container when calling sh file from Dockerfile CMD

I am following this link to create a spark cluster. I am able to run the spark cluster. However, I have to give an absolute path to start spark-shell. I am trying to set environment variables i.e. PATH and a few others in start-shell.sh. However, it's not setting that inside container. I tried printing it using printenv inside the container. But these variables are never reflected.
Am I trying to set environment variables incorrectly? Spark cluster is running successfully though.
I am using docker-compose.yml to build and recreate an image and container.
docker-compose up --build
Dockerfile
# builder step used to download and configure spark environment
FROM openjdk:11.0.11-jre-slim-buster as builder
# Add Dependencies for PySpark
RUN apt-get update && apt-get install -y curl vim wget software-properties-common ssh net-tools ca-certificates python3 python3-pip python3-numpy python3-matplotlib python3-scipy python3-pandas python3-simpy
# JDBC driver download and install
ADD https://go.microsoft.com/fwlink/?linkid=2168494 /usr/share/java
RUN update-alternatives --install "/usr/bin/python" "python" "$(which python3)" 1
# Fix the value of PYTHONHASHSEED
# Note: this is needed when you use Python 3.3 or greater
ENV SPARK_VERSION=3.1.2 \
HADOOP_VERSION=3.2 \
SPARK_HOME=/opt/spark \
PYTHONHASHSEED=1
# Download and uncompress spark from the apache archive
RUN wget --no-verbose -O apache-spark.tgz "https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" \
&& mkdir -p ${SPARK_HOME} \
&& tar -xf apache-spark.tgz -C ${SPARK_HOME} --strip-components=1 \
&& rm apache-spark.tgz
My Dockerfile-spark
When using SPARK_BIN="${SPARK_HOME}/bin/ under ENV in Dockerfile, environment variable get's set. It is visible inside the docker container by using printenv
FROM apache-spark
WORKDIR ${SPARK_HOME}
ENV SPARK_MASTER_PORT=7077 \
SPARK_MASTER_WEBUI_PORT=8080 \
SPARK_LOG_DIR=${SPARK_HOME}/logs \
SPARK_MASTER_LOG=${SPARK_HOME}/logs/spark-master.out \
SPARK_WORKER_LOG=${SPARK_HOME}/logs/spark-worker.out \
SPARK_WORKER_WEBUI_PORT=8080 \
SPARK_MASTER="spark://spark-master:7077" \
SPARK_WORKLOAD="master"
COPY start-spark.sh /
CMD ["/bin/bash", "/start-spark.sh"]
start-spark.sh
#!/bin/bash
. "$SPARK_HOME/bin/load-spark-env.sh"
export SPARK_BIN="${SPARK_HOME}/bin/" # This doesn't work here
export PATH="${SPARK_HOME}/bin/:${PATH}" # This doesn't work here
# When the spark work_load is master run class org.apache.spark.deploy.master.Master
if [ "$SPARK_WORKLOAD" == "master" ];
then
export SPARK_MASTER_HOST=`hostname` # This works here
cd $SPARK_BIN && ./spark-class org.apache.spark.deploy.master.Master --ip $SPARK_MASTER_HOST --port $SPARK_MASTER_PORT --webui-port $SPARK_MASTER_WEBUI_PORT >> $SPARK_MASTER_LOG.
My File structure is
dockerfile
dockerfile-spark # this uses pre-built image created by dockerfile
start-spark.sh # invoked buy dockerfile-spark
docker-compose.yml # uses build parameter to build an image from dockerfile-spark
From inside the master container
root#3abbd4508121:/opt/spark# export
declare -x HADOOP_VERSION="3.2"
declare -x HOME="/root"
declare -x HOSTNAME="3abbd4508121"
declare -x JAVA_HOME="/usr/local/openjdk-11"
declare -x JAVA_VERSION="11.0.11+9"
declare -x LANG="C.UTF-8"
declare -x OLDPWD
declare -x PATH="/usr/local/openjdk-11/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
declare -x PWD="/opt/spark"
declare -x PYTHONHASHSEED="1"
declare -x SHLVL="1"
declare -x SPARK_HOME="/opt/spark"
declare -x SPARK_LOCAL_IP="spark-master"
declare -x SPARK_LOG_DIR="/opt/spark/logs"
declare -x SPARK_MASTER="spark://spark-master:7077"
declare -x SPARK_MASTER_LOG="/opt/spark/logs/spark-master.out"
declare -x SPARK_MASTER_PORT="7077"
declare -x SPARK_MASTER_WEBUI_PORT="8080"
declare -x SPARK_VERSION="3.1.2"
declare -x SPARK_WORKER_LOG="/opt/spark/logs/spark-worker.out"
declare -x SPARK_WORKER_WEBUI_PORT="8080"
declare -x SPARK_WORKLOAD="master"
declare -x TERM="xterm"
root#3abbd4508121:/opt/spark#
There are a couple of different ways to set environment variables in Docker, and a couple of different ways to run processes. A container normally runs one process, which is controlled by the image's ENTRYPOINT and CMD settings. If you docker exec a second process in the container, that does not run as a child process of the main process, and will not see environment variables that are set by that main process.
In the setup you show here, the start-spark.sh script is the main container process (it is the image's CMD). If you docker exec your-container printenv, it will see things set in the Dockerfile but not things set in this script.
Things like filesystem paths will generally be fixed every time you run the container, no matter what command you're running there, so you can specify these in the Dockerfile
ENV SPARK_BIN=${SPARK_HOME}/bin PATH=${SPARK_BIN}:${PATH}
You can specify both an ENTRYPOINT and a CMD in your Dockerfile; if you do, the CMD is passed as arguments to the ENTRYPOINT. This leads to a useful pattern where the CMD is a standard shell command, and the ENTRYPOINT is a wrapper that does first-time setup and then runs it. You can split your script into two:
#!/bin/sh
# spark-env.sh
. "${SPARK_BIN}/load-spark-env.snh"
exec "$#"
#!/bin/sh
# start-spark.sh
spark-class org.apache.spark.deploy.master.Master \
--ip "$SPARK_MASTER_HOST" \
--port "$SPARK_MASTER_PORT" \
--webui-port "$SPARK_MASTER_WEBUI_PORT"
Then in your Dockerfile specify both parts
COPY spark-env.sh start-spark.sh /
ENTRYPOINT ["/spark-env.sh"] # must be JSON-array syntax
CMD ["/start-spark.sh"] # or any other valid CMD
This is useful for your debugging since it's straightforward to override the CMD in a docker run or docker-compose run instruction, leaving the ENTRYPOINT in place.
docker-compose run spark \
printenv
This launches a new container based on all of the same Dockerfile setup. When it runs, it runs printenv instead of the CMD in the image. This will do the first-time setup in the ENTRYPOINT script, and then the final exec "$#" line will run printenv instead of starting the Spark application. This will show you the environment the application will have when it starts.

Accessing `docker run` arguments from the Dockerfile

I run an image like that:
docker run <image_name> <config_file>
where config_file is the path to a JSON file which contains the configuration of my application.
Inside the Dockerfile, I do
ENTRYPOINT ["uwsgi", \
"--log-encoder", "json {\"msg\":\"${msg}\"}\n", \
"--http", ":80", \
"--master", \
"--wsgi-file", "app.py", \
"--callable", "app", \
"--threads", "10", \
"--pyargv"]
At the same time, I would like to access some of the values in the configuration file in the Dockerfile. For example to configure the JSON log encoder of uWSGI.
How can I do that?
This is impossible. See comment by #David Maze.

How to export an environment variable to a docker image?

I can define "static" environment variables in a Dockerfile with ENV, but is it possible to pass some value at build time to this variable? I'm attempting something like this, which doesn't work:
FROM phusion/baseimage
RUN mkdir -p /foo/2016/bin && \
FOOPATH=`ls -d /foo/20*/bin` && \
export FOOPATH
ENV PATH $PATH:$FOOPATH
Of course, in the real use case I'd be running/unpacking something that creates a directory whose name will change with different versions, dates, etc., and I'd like to avoid modifying the Dockerfile every time the directory name changes.
Edit: Since it appears it's not possible, the best workaround so far is using a symlink:
FROM phusion/baseimage
RUN mkdir -p /foo/2016/bin && \
FOOPATH=`ls -d /foo/20*/bin` && \
ln -s $FOOPATH /mypath
ENV PATH $PATH:/mypath
To pass a value in at build time, use an ARG.
FROM phusion/baseimage
RUN mkdir -p /foo/2016/bin && \
FOOPATH=`ls -d /foo/20*/bin` && \
export FOOPATH
ARG FOOPATH
ENV PATH $PATH:${FOOPATH}
Then you can run docker build --build-arg FOOPATH=/dir -t myimage .
Edit: from you comment, my answer above won't solve your issue. There's nothing in the Dockerfile you can update from the output of the run command, the output isn't parsed, only the resulting filesystem is saved. For this, I think you're best off in your run command writing the path to the image and read in from your /etc/profile or a custom entrypoint script. That depends on how you want to launch your container and the base image.

Mount directory to docker container while using CWL (Common Workflow Language)

I'm trying to get a bunch of very big files into a docker container while using CWL. When using the default method of file-inputs via
job.yml:
input_file:
class: File
path: /home/ubuntu/data/bigfile.zip
the CWL runner somehow copies the file and gets stuck. Is there an easy way of just mounting a directory directly into a docker container?
task.cwl:
cwlVersion: cwl:draft-3
class: CommandLineTool
baseCommand: run.sh
hints:
- class: DockerRequirement
dockerImageId: name123
inputs:
- id: input_file
type: File
inputBinding:
position: 1
outputs: []
Thanks in advance!
The CWL user guide has an example for how to do this: https://www.commonwl.org/user_guide/15-staging/index.html
You use InitialWorkDirRequirement and add the input file to the list of files to be staged in the working directory, like so:
cwlVersion: v1.0
class: CommandLineTool
baseCommand: cat
hints:
DockerRequirement:
dockerPull: alpine
inputs:
in1:
type: File
inputBinding:
position: 1
valueFrom: $(self.basename)
requirements:
InitialWorkDirRequirement:
listing:
- $(inputs.in1)
outputs:
out1: stdout
When you run this, say with the CWL reference runner (cwltool), you can see that the input file is mounted directly in the working directory (but safely, in a ReadOnly mode):
[job step-staging.cwl] /private/tmp/docker_tmpIaCJQ8$ docker \
run \
-i \
--volume=/private/tmp/docker_tmpIaCJQ8:/XMOiku:rw \
--volume=/private/tmp/docker_tmpW2RR3v:/tmp:rw \
--volume=/Users/kghose/Work/code/conditional/runif-examples/wf1.cwl:/XMOiku/wf1.cwl:ro \
--workdir=/XMOiku \
--read-only=true \
--log-driver=none \
--user=501:20 \
--rm \
--env=TMPDIR=/tmp \
--env=HOME=/XMOiku \
--cidfile=/private/tmp/docker_tmpdV6afe/20190502114327-207989.cid \
alpine \
cat \
wf1.cwl > /private/tmp/docker_tmpIaCJQ8/f3c708b20abf7fbf7f089060ec071c0956eb0cfd
However, as #TheDudeAbides says, the behavior of CWL 1.0 is to mount the files and not copy them. So even if you did not stage them, they would be mounted to make them available to the container, just in a different directory. This is how cwltool, toil and the SBG platform work.

Resources