Airflow - failing XCOM push when using Alpine image - docker

I want to run KubernetesPodOperator in Airflow that reads some file and send the content to XCOM.
Definition looks like:
read_file = DefaultKubernetesPodOperator(
image = 'alpine:3.16',
cmds = ['bash', '-cx'],
arguments = ['cat file.json >> /airflow/xcom/return.json'],
name = 'some-name',
task_id = 'some_name',
do_xcom_push = True,
image_pull_policy = 'IfNotPresent',
)
but I am getting: INFO - stderr from command: cat: can't open '/***/xcom/return.json': No such file or directory
When I use ubuntu:22.04 it works, but I want it make faster by using smaller (Alpine) image. Why it is not working with alpine and how to overcome that?

Related

error when pulling a docker container using singularity in nextflow

I am making a very short workflow in which I use a tool for my analysis called salmon.
In the hpc that I am working in, I cannot install this tool so I decided to pull the container from biocontainers.
In the hoc we do not have docker installed (I also do not have permission to do so) but we have singularity instead.
So I have to pull docker container (from: quay.io/biocontainers/salmon:1.2.1--hf69c8f4_0) using singularity.
The workflow management system that I am working with is nextflow.
This is the short workflow I made (index.nf):
#!/usr/bin/env nextflow
nextflow.preview.dsl=2
container = 'quay.io/biocontainers/salmon:1.2.1--hf69c8f4_0'
shell = ['/bin/bash', '-euo', 'pipefail']
process INDEX {
script:
"""
salmon index \
-t /hpc/genome/gencode.v39.transcripts.fa \
-i index \
"""
}
workflow {
INDEX()
}
I run it using this command:
nextflow run index.nf -resume
But got this error:
salmon: command not found
Do you know how I can fix the issue?
You are so close! All you need to do is move these directives into your nextflow.config or declare them at the top of your process body:
container = 'quay.io/biocontainers/salmon:1.2.1--hf69c8f4_0'
shell = ['/bin/bash', '-euo', 'pipefail']
My preference is to use a process selector to assign the container directive. So for example, your nextflow.config might look like:
process {
shell = ['/bin/bash', '-euo', 'pipefail']
withName: INDEX {
container = 'quay.io/biocontainers/salmon:1.2.1--hf69c8f4_0'
}
}
singularity {
enabled = true
// not strictly necessary, but highly recommended
cacheDir = '/path/to/singularity/cache'
}
And your index.nf might then look like:
nextflow.enable.dsl=2
params.transcripts = '/hpc/genome/gencode.v39.transcripts.fa'
process INDEX {
input:
path fasta
output:
path 'index'
"""
salmon index \\
-t "${fasta}" \\
-i index \\
"""
}
workflow {
transcripts = file( params.transcripts )
INDEX( transcripts )
}
If run using:
nextflow run -ansi-log false index.nf
You should see the following results:
N E X T F L O W ~ version 21.04.3
Launching `index.nf` [clever_bassi] - revision: d235de22c4
Pulling Singularity image docker://quay.io/biocontainers/salmon:1.2.1--hf69c8f4_0 [cache /path/to/singularity/cache/quay.io-biocontainers-salmon-1.2.1--hf69c8f4_0.img]
[8a/279df4] Submitted process > INDEX

How to use Bazel rules_docker container_flatten to create a Docker image?

I'd like to slim down a debian 10 Docker image using Bazel and then flatten the result into a single layer image.
Here's the code I have:
load("#io_bazel_rules_docker//container:container.bzl", "container_image", "container_flatten", "container_bundle", "container_import")
load("#io_bazel_rules_docker//docker/util:run.bzl", "container_run_and_commit")
container_run_and_commit(
name = "debian10_slimmed_layers",
commands = ["rm -rf /usr/share/man/*"],
image = "#debian10//image",
)
# Create an image just so we can flatten it.
container_image(
name = "debian10_slimmed_image",
base = ":debian10_slimmed_layers",
)
# Flatten the layers to a single layer.
container_flatten(
name = "debian10_flatten",
image = ":debian10_slimmed_image",
)
Where I'm stuck is I can't figure out how to use the output of debian10_flatten to produce a runnable Docker image.
I tried:
container_image(
name = "debian10",
base = ":debian10_flatten",
)
That fails with:
2021/06/27 13:16:25 Unable to load docker image from bazel-out/k8-fastbuild/bin/debian10_flatten.tar:
file manifest.json not found in tar
container_flatten gives you the filesystem tarball. You need to add the tarball as tars in debian10, instead of deps:
container_image(
name = "debian10",
tars = [":debian10_flatten.tar"],
)
deps is for another container_image rule (or equivalent). If you had a docker save-style tarball, container_load would be the way to get the container_image equivalent.
I figured this out looking at the implementation in container/flatten.bzl. The docs could definitely use some improvements if somebody wants to open a PR (they're generated from the python-style docstring in that .bzl file).

starting container process caused "exec: \"/tmp/installer.sh\": permission denied"

I have a base image (named #release_docker//image) and I'm trying to install some apt packages on it (alongside my built binary). Here is what it looks like:
load("#io_bazel_rules_docker//docker/package_managers:download_pkgs.bzl", "download_pkgs")
load("#io_bazel_rules_docker//docker/package_managers:install_pkgs.bzl", "install_pkgs")
download_pkgs(
name = "downloaded-packages",
image_tar = "#release_docker//image",
packages = [
"numactl",
"pciutils",
"python",
],
)
install_pkgs(
name = "installed-packages",
image_tar = "#release_docker//image",
installables_tar = ":downloaded-packages.tar",
output_image_name = "release_docker_with_packages"
)
cc_image(
name = "my-image",
base = ":installed-packages",
binary = ":built-binary",
)
But inside the build docker (a docker image which the build command runs in), when I run bazel build :my-image --action_env DOCKER_HOST=tcp://192.168.1.2:2375, it errors:
+ DOCKER=/usr/bin/docker
+ [[ -z /usr/bin/docker ]]
+ TO_JSON_TOOL=bazel-out/host/bin/external/io_bazel_rules_docker/docker/util/to_json
+ source external/io_bazel_rules_docker/docker/util/image_util.sh
++ bazel-out/host/bin/external/io_bazel_rules_docker/contrib/extract_image_id bazel-out/k8-fastbuild/bin/external/release_docker/image/image.tar
+ image_id=b55375fc9c651e1eff0428490d01b4883de0fca62b5b18e8ede9f3d812b3fc10
+ /usr/bin/docker load -i bazel-out/k8-fastbuild/bin/external/release_docker/image/image.tar
+++ pwd
+++ pwd
++ /usr/bin/docker run -d -v /opt/bazel-root-directory/...[path-to].../downloaded-packages.tar:/tmp/bazel-out/k8-fastbuild/bin/marzban/downloaded-packages.tar -v /opt/bazel-root-directory/...[path-to].../installed-packages.install:/tmp/installer.sh --privileged b55375fc9c651e1eff0428490d01b4883de0fca62b5b18e8ede9f3d812b3fc10 /tmp/installer.sh
/usr/bin/docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "exec: \"/tmp/installer.sh\": permission denied": unknown.
+ cid=ce62e444aefe1f32a20575750a6ee1cc9c2f79d46f2f60187a8bc23f87b5aa25
I came across the same issue and it took some time for me to find the actual cause.
As you conjectured, there is a bug in your version of rules_docker repo. The actual problem is the assumption that a local folder can be directly mounted into the target image. Obviously, the assumption fails in the case of DIND (Docker-In-Docker).
Fortunately, this bug has been already fixed as part of install_pkgs uses named volumes to work with DIND. As the title suggests, the solution is to use a named volume instead of short -v src:dst.
So, the solution is to upgrade to v0.13.0 or newer.
rules_docker$ git tag --contains 32f12766248bef88358fc1646a3e0a66efd0e502 | head -1
v0.13.0
I was running into the exact same problems as you. If you change '"#release_docker//image"' to ""#release_docker//image:image.tar" it should work.
The rule requires a .tar file (which is the same format as a docker save imageName). I didn't look into the code behind the rule, but I'd assume the image also needs access to apt.
Here is a working example
BUILD FILE
load(
"#io_bazel_rules_docker//docker/package_managers:download_pkgs.bzl",
"download_pkgs",
)
load(
"#io_bazel_rules_docker//docker/package_managers:install_pkgs.bzl",
"install_pkgs",
)
install_pkgs(
name = "postgresPythonImage",
image_tar = "#py3_image_base//image:image.tar",
installables_tar = ":postgresql_pkgs.tar",
output_image_name = "postgres_python_base"
)
download_pkgs(
name = "postgresql_pkgs",
image_tar = "#ubuntu1604//image:image.tar",
packages = [
"postgresql"
],
)
WORKSPACE
http_archive(
name = "layer_definitions",
strip_prefix = "layer-definitions-ade30bae7cb1a8c1fed70e18040936fad75de8a3",
urls = ["https://github.com/GoogleCloudPlatform/layer-definitions/archive/ade30bae7cb1a8c1fed70e18040936fad75de8a3.tar.gz"],
sha256 = "af72a1a804934ba154c97c43429ec556eeaadac70336f614ac123b7f5a5db299"
)
load("#layer_definitions//layers/ubuntu1604/base:deps.bzl", ubuntu1604_base_deps = "deps")
ubuntu1604_base_deps()

python docker apis containers.run doesnt print output to console

I have pulled alpine image
I have build the container
I am trying run the image but I do not see any output on my console, anyone whats wrong?
If I run using docker run I can see the output
Python version is 2.7.10
dockerClient = docker.from_env()
image = dockerClient.images.pull(alpine)
dockerClient.images.build(path = "build/", tag="alpine_tests")
dockerClient.containers.run('alpine_tests', 'pwd')
If you really want to see information immediately, you should use docker build in CLI, or subproccess.call(['docker', 'build']) in Python.
When using the Python SDK, the stdout and stderr messages are always not necessary when success. Only when crashed or failed, messages are good.
# docker build
client = docker.from_env()
try:
client.images.build(
path='build/',
tag='alpine_tests',
)
except docker.errors.BuildError as exc:
for log in exc.build_log:
msg = log.get('stream')
if msg:
print(msg, end='')
raise
# docker push
ret = client.images.push(alpine_tests)
last_line = ret.splitlines()[-1]
info = json.loads(last_line)
error = info.get('error')
if error:
print(error)
exit(1)

How to get Task ID from within ECS container?

Hello I am interested in retrieving the Task ID from within inside a running container which lives inside of a EC2 host machine.
AWS ECS documentation states there is an environment variable ECS_CONTAINER_METADATA_FILE with the location of this data but will only be set/available if ECS_ENABLE_CONTAINER_METADATA variable is set to true upon cluster/EC2 instance creation. I don't see where this can be done in the aws console.
Also, the docs state that this can be done by setting this to true inside the host machine but would require to restart the docker agent.
Is there any other way to do this without having to go inside the EC2 to set this and restart the docker agent?
This doesn't work for newer Amazon ECS container versions anymore, and in fact it's now much simpler and also enabled by default. Please refer to this docu, but here's a TL;DR:
If you're using Amazon ECS container agent version 1.39.0 and higher, you can just do this inside the docker container:
curl -s "$ECS_CONTAINER_METADATA_URI_V4/task" \
| jq -r ".TaskARN" \
| cut -d "/" -f 3
Here's a list of container agent releases, but if you're using :latest – you're definitely fine.
The technique I'd use is to set the environment variable in the container definition.
If you're managing your tasks via Cloudformation, the relevant yaml looks like so:
Taskdef:
Type: AWS::ECS::TaskDefinition
Properties:
...
ContainerDefinitions:
- Name: some-name
...
Environment:
- Name: AWS_DEFAULT_REGION
Value: !Ref AWS::Region
- Name: ECS_ENABLE_CONTAINER_METADATA
Value: 'true'
This technique helps you keep everything straightforward and reproducible.
If you need metadata programmatically and don't have access to the metadata file, you can query the agent's metadata endpoint:
curl http://localhost:51678/v1/metadata
Note that if you're getting this information as a running task, you may not be able to connect to the loopback device, but you can connect to the EC2 instance's own IP address.
We set it with the so called user data, which are executed at the start of the machine. There are multiple ways to set it, for example: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html#user-data-console
It could look like this:
#!/bin/bash
cat <<'EOF' >> /etc/ecs/ecs.config
ECS_CLUSTER=ecs-staging
ECS_ENABLE_CONTAINER_METADATA=true
EOF
Important: Adjust the ECS_CLUSTER above to match your cluster name, otherwise the instance will not connect to that cluster.
Previous answers are correct, here is another way of doing this:
From the ec2 instance where container is running, run this command
curl http://localhost:51678/v1/tasks | python -mjson.tool |less
From the AWS ECS cli Documentation
Command:
aws ecs list-tasks --cluster default
Output:
{
"taskArns": [
"arn:aws:ecs:us-east-1:<aws_account_id>:task/0cc43cdb-3bee-4407-9c26-c0e6ea5bee84",
"arn:aws:ecs:us-east-1:<aws_account_id>:task/6b809ef6-c67e-4467-921f-ee261c15a0a1"
]
}
To list the tasks on a particular container instance
This example command lists the tasks of a specified container instance, using the container instance UUID as a filter.
Command:
aws ecs list-tasks --cluster default --container-instance f6bbb147-5370-4ace-8c73-c7181ded911f
Output:
{
"taskArns": [
"arn:aws:ecs:us-east-1:<aws_account_id>:task/0cc43cdb-3bee-4407-9c26-c0e6ea5bee84"
]
}
My ECS solution as bash and Python snippets. Logging calls are able to print for debug by piping to sys.stderr while print() is used to pass the value back to a shell script
#!/bin/bash
TASK_ID=$(python3.8 get_ecs_task_id.py)
echo "TASK_ID: ${TASK_ID}"
Python script - get_ecs_task_id.py
import json
import logging
import os
import sys
import requests
# logging configuration
# file_handler = logging.FileHandler(filename='tmp.log')
# redirecting to stderr so I can pass back extracted task id in STDOUT
stdout_handler = logging.StreamHandler(stream=sys.stderr)
# handlers = [file_handler, stdout_handler]
handlers = [stdout_handler]
logging.basicConfig(
level=logging.INFO,
format="[%(asctime)s] {%(filename)s:%(lineno)d} %(levelname)s - %(message)s",
handlers=handlers,
datefmt="%Y-%m-%d %H:%M:%S",
)
logger = logging.getLogger(__name__)
def get_ecs_task_id(host):
path = "/task"
url = host + path
headers = {"Content-Type": "application/json"}
r = requests.get(url, headers=headers)
logger.debug(f"r: {r}")
d_r = json.loads(r.text)
logger.debug(d_r)
ecs_task_arn = d_r["TaskARN"]
ecs_task_id = ecs_task_arn.split("/")[2]
return ecs_task_id
def main():
logger.debug("Extracting task ID from $ECS_CONTAINER_METADATA_URI_V4")
logger.debug("Inside get_ecs_task_id.py, redirecting logs to stderr")
logger.debug("so that I can pass the task id back in STDOUT")
host = os.environ["ECS_CONTAINER_METADATA_URI_V4"]
ecs_task_id = get_ecs_task_id(host)
# This print statement passes the string back to the bash wrapper, don't remove
logger.debug(ecs_task_id)
print(ecs_task_id)
if __name__ == "__main__":
main()

Resources