Kubeflow Pipelines v2 is giving Permission Denied on OutputPath - kubeflow

In Kubeflow Pipelines v2, running on EKS with default install, I'm getting a "permission denied" error.
It ran correctly in KFP v1.
time="2022-04-26T21:53:30.710Z" level=info msg="capturing logs" argo=true
I0426 21:53:30.745547 18 launcher.go:144] PipelineRoot defaults to "minio://mlpipeline/v2/artifacts".
I0426 21:53:30.745908 18 cache.go:120] Connecting to cache endpoint 10.100.244.104:8887
I0426 21:53:30.854201 18 launcher.go:193] enable caching
F0426 21:53:30.979055 18 main.go:50] Failed to execute component: failed to create directory "/tmp/outputs/output_context_path" for output parameter "output_context_path": mkdir /tmp/outputs/output_context_path: permission denied
time="2022-04-26T21:53:30.980Z" level=info msg="/tmp/outputs/output_context_path/data -> /var/run/argo/outputs/artifacts/tmp/outputs/output_context_path/data.tgz" argo=true
time="2022-04-26T21:53:30.981Z" level=info msg="Taring /tmp/outputs/output_context_path/data"
Error: failed to tarball the output /tmp/outputs/output_context_path/data to /var/run/argo/outputs/artifacts/tmp/outputs/output_context_path/data.tgz: stat /tmp/outputs/output_context_path/data: permission denied
failed to tarball the output /tmp/outputs/output_context_path/data to /var/run/argo/outputs/artifacts/tmp/outputs/output_context_path/data.tgz: stat /tmp/outputs/output_context_path/data: permission denied
The code that produces this is here:
import kfp
from kfp.v2.dsl import component, Artifact, Input, InputPath, Output, OutputPath, Dataset, Model
from typing import NamedTuple
def same_step_000_afc67b36914c4108b47e8b4bb316869d_fn(
input_context_path: InputPath(str),
output_context_path: OutputPath(str),
run_info: str ="gAR9lC4=",
metadata_url: str="",
):
from base64 import urlsafe_b64encode, urlsafe_b64decode
from pathlib import Path
import datetime
import requests
import tempfile
import dill
import os
input_context = None
with Path(input_context_path).open("rb") as reader:
input_context = reader.read()
# Helper function for posting metadata to mlflow.
def post_metadata(json):
if metadata_url == "":
return
try:
req = requests.post(metadata_url, json=json)
req.raise_for_status()
except requests.exceptions.HTTPError as err:
print(f"Error posting metadata: {err}")
# Move to writable directory as user might want to do file IO.
# TODO: won't persist across steps, might need support in SDK?
os.chdir(tempfile.mkdtemp())
# Load information about the current experiment run:
run_info = dill.loads(urlsafe_b64decode(run_info))
# Post session context to mlflow.
if len(input_context) > 0:
input_context_str = urlsafe_b64encode(input_context)
post_metadata({
"experiment_id": run_info["experiment_id"],
"run_id": run_info["run_id"],
"step_id": "same_step_000",
"metadata_type": "input",
"metadata_value": input_context_str,
"metadata_time": datetime.datetime.now().isoformat(),
})
# User code for step, which we run in its own execution frame.
user_code = f"""
import dill
# Load session context into global namespace:
if { len(input_context) } > 0:
dill.load_session("{ input_context_path }")
{dill.loads(urlsafe_b64decode("gASVGAAAAAAAAACMFHByaW50KCJIZWxsbyB3b3JsZCIplC4="))}
# Remove anything from the global namespace that cannot be serialised.
# TODO: this will include things like pandas dataframes, needs sdk support?
_bad_keys = []
_all_keys = list(globals().keys())
for k in _all_keys:
try:
dill.dumps(globals()[k])
except TypeError:
_bad_keys.append(k)
for k in _bad_keys:
del globals()[k]
# Save new session context to disk for the next component:
dill.dump_session("{output_context_path}")
"""
# Runs the user code in a new execution frame. Context from the previous
# component in the run is loaded into the session dynamically, and we run
# with a single globals() namespace to simulate top-level execution.
exec(user_code, globals(), globals())
# Post new session context to mlflow:
with Path(output_context_path).open("rb") as reader:
context = urlsafe_b64encode(reader.read())
post_metadata({
"experiment_id": run_info["experiment_id"],
"run_id": run_info["run_id"],
"step_id": "same_step_000",
"metadata_type": "output",
"metadata_value": context,
"metadata_time": datetime.datetime.now().isoformat(),
})
Environment
How did you deploy Kubeflow Pipelines (KFP)?
From manifests
KFP version:
1.8.1
KFP SDK version:
1.8.12
I SUSPECT this is because I'm using the native functionality in Kubeflow to write out files to a local temp directory, but I (theorize) in KFP v2 it doesn't auto-create this. Do I need to have a bucket created for this purpose on KFP v2 on AWS?
EDIT TWO: here's the generated yaml - line 317 is the one that worries me. It APPEARS it's putting in the string of output_context_path when shouldn't that be a variable? is that substituted at runtime? --
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: root-pipeline-compilation-
annotations:
pipelines.kubeflow.org/kfp_sdk_version: 1.8.12
pipelines.kubeflow.org/pipeline_compilation_time: '2022-04-29T18:04:24.336927'
pipelines.kubeflow.org/pipeline_spec: '{"inputs": [{"default": "", "name": "context",
"optional": true, "type": "String"}, {"default": "", "name": "metadata_url",
"optional": true, "type": "String"}, {"default": "", "name": "pipeline-root"},
{"default": "pipeline/root_pipeline_compilation", "name": "pipeline-name"}],
"name": "root_pipeline_compilation"}'
pipelines.kubeflow.org/v2_pipeline: "true"
labels:
pipelines.kubeflow.org/v2_pipeline: "true"
pipelines.kubeflow.org/kfp_sdk_version: 1.8.12
spec:
entrypoint: root-pipeline-compilation
templates:
- name: root-pipeline-compilation
inputs:
parameters:
- {name: metadata_url}
- {name: pipeline-name}
- {name: pipeline-root}
dag:
tasks:
- name: run-info-fn
template: run-info-fn
arguments:
parameters:
- {name: pipeline-name, value: '{{inputs.parameters.pipeline-name}}'}
- {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
- name: same-step-000-d5554cccadc4445f91f51849eb5f2de6-fn
template: same-step-000-d5554cccadc4445f91f51849eb5f2de6-fn
dependencies: [run-info-fn]
arguments:
parameters:
- {name: metadata_url, value: '{{inputs.parameters.metadata_url}}'}
- {name: pipeline-name, value: '{{inputs.parameters.pipeline-name}}'}
- {name: pipeline-root, value: '{{inputs.parameters.pipeline-root}}'}
- {name: run-info-fn-run_info, value: '{{tasks.run-info-fn.outputs.parameters.run-info-fn-run_info}}'}
- name: run-info-fn
container:
args:
- sh
- -c
- |2
if ! [ -x "$(command -v pip)" ]; then
python3 -m ensurepip || python3 -m ensurepip --user || apt-get install python3-pip
fi
PIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'kfp' 'dill' 'kfp==1.8.12' && "$0" "$#"
- sh
- -ec
- |
program_path=$(mktemp -d)
printf "%s" "$0" > "$program_path/ephemeral_component.py"
python3 -m kfp.v2.components.executor_main --component_module_path "$program_path/ephemeral_component.py" "$#"
- |2+
import kfp
from kfp.v2 import dsl
from kfp.v2.dsl import *
from typing import *
def run_info_fn(
run_id: str,
) -> NamedTuple("RunInfoOutput", [("run_info", str),]):
from base64 import urlsafe_b64encode
from collections import namedtuple
import datetime
import base64
import dill
import kfp
client = kfp.Client(host="http://ml-pipeline:8888")
run_info = client.get_run(run_id=run_id)
run_info_dict = {
"run_id": run_info.run.id,
"name": run_info.run.name,
"created_at": run_info.run.created_at.isoformat(),
"pipeline_id": run_info.run.pipeline_spec.pipeline_id,
}
# Track kubernetes resources associated wth the run.
for r in run_info.run.resource_references:
run_info_dict[f"{r.key.type.lower()}_id"] = r.key.id
# Base64-encoded as value is visible in kubeflow ui.
output = urlsafe_b64encode(dill.dumps(run_info_dict))
return namedtuple("RunInfoOutput", ["run_info"])(
str(output, encoding="ascii")
)
- --executor_input
- '{{$}}'
- --function_to_execute
- run_info_fn
command: [/kfp-launcher/launch, --mlmd_server_address, $(METADATA_GRPC_SERVICE_HOST),
--mlmd_server_port, $(METADATA_GRPC_SERVICE_PORT), --runtime_info_json, $(KFP_V2_RUNTIME_INFO),
--container_image, $(KFP_V2_IMAGE), --task_name, run-info-fn, --pipeline_name,
'{{inputs.parameters.pipeline-name}}', --run_id, $(KFP_RUN_ID), --run_resource,
workflows.argoproj.io/$(WORKFLOW_ID), --namespace, $(KFP_NAMESPACE), --pod_name,
$(KFP_POD_NAME), --pod_uid, $(KFP_POD_UID), --pipeline_root, '{{inputs.parameters.pipeline-root}}',
--enable_caching, $(ENABLE_CACHING), --, 'run_id={{workflow.uid}}', --]
env:
- name: KFP_POD_NAME
valueFrom:
fieldRef: {fieldPath: metadata.name}
- name: KFP_POD_UID
valueFrom:
fieldRef: {fieldPath: metadata.uid}
- name: KFP_NAMESPACE
valueFrom:
fieldRef: {fieldPath: metadata.namespace}
- name: WORKFLOW_ID
valueFrom:
fieldRef: {fieldPath: 'metadata.labels[''workflows.argoproj.io/workflow'']'}
- name: KFP_RUN_ID
valueFrom:
fieldRef: {fieldPath: 'metadata.labels[''pipeline/runid'']'}
- name: ENABLE_CACHING
valueFrom:
fieldRef: {fieldPath: 'metadata.labels[''pipelines.kubeflow.org/enable_caching'']'}
- {name: KFP_V2_IMAGE, value: 'python:3.7'}
- {name: KFP_V2_RUNTIME_INFO, value: '{"inputParameters": {"run_id": {"type":
"STRING"}}, "inputArtifacts": {}, "outputParameters": {"run_info": {"type":
"STRING", "path": "/tmp/outputs/run_info/data"}}, "outputArtifacts": {}}'}
envFrom:
- configMapRef: {name: metadata-grpc-configmap, optional: true}
image: python:3.7
volumeMounts:
- {mountPath: /kfp-launcher, name: kfp-launcher}
inputs:
parameters:
- {name: pipeline-name}
- {name: pipeline-root}
outputs:
parameters:
- name: run-info-fn-run_info
valueFrom: {path: /tmp/outputs/run_info/data}
artifacts:
- {name: run-info-fn-run_info, path: /tmp/outputs/run_info/data}
metadata:
annotations:
pipelines.kubeflow.org/v2_component: "true"
pipelines.kubeflow.org/component_ref: '{}'
pipelines.kubeflow.org/arguments.parameters: '{"run_id": "{{workflow.uid}}"}'
labels:
pipelines.kubeflow.org/kfp_sdk_version: 1.8.12
pipelines.kubeflow.org/pipeline-sdk-type: kfp
pipelines.kubeflow.org/v2_component: "true"
pipelines.kubeflow.org/enable_caching: "true"
initContainers:
- command: [launcher, --copy, /kfp-launcher/launch]
image: gcr.io/ml-pipeline/kfp-launcher:1.8.7
name: kfp-launcher
mirrorVolumeMounts: true
volumes:
- {name: kfp-launcher}
- name: same-step-000-d5554cccadc4445f91f51849eb5f2de6-fn
container:
args:
- sh
- -c
- |2
if ! [ -x "$(command -v pip)" ]; then
python3 -m ensurepip || python3 -m ensurepip --user || apt-get install python3-pip
fi
PIP_DISABLE_PIP_VERSION_CHECK=1 python3 -m pip install --quiet --no-warn-script-location 'dill' 'requests' 'kfp==1.8.12' && "$0" "$#"
- sh
- -ec
- |
program_path=$(mktemp -d)
printf "%s" "$0" > "$program_path/ephemeral_component.py"
python3 -m kfp.v2.components.executor_main --component_module_path "$program_path/ephemeral_component.py" "$#"
- |2+
import kfp
from kfp.v2 import dsl
from kfp.v2.dsl import *
from typing import *
def same_step_000_d5554cccadc4445f91f51849eb5f2de6_fn(
input_context_path: InputPath(str),
output_context_path: OutputPath(str),
run_info: str = "gAR9lC4=",
metadata_url: str = "",
):
from base64 import urlsafe_b64encode, urlsafe_b64decode
from pathlib import Path
import datetime
import requests
import tempfile
import dill
import os
input_context = None
with Path(input_context_path).open("rb") as reader:
input_context = reader.read()
# Helper function for posting metadata to mlflow.
def post_metadata(json):
if metadata_url == "":
return
try:
req = requests.post(metadata_url, json=json)
req.raise_for_status()
except requests.exceptions.HTTPError as err:
print(f"Error posting metadata: {err}")
# Move to writable directory as user might want to do file IO.
# TODO: won't persist across steps, might need support in SDK?
os.chdir(tempfile.mkdtemp())
# Load information about the current experiment run:
run_info = dill.loads(urlsafe_b64decode(run_info))
# Post session context to mlflow.
if len(input_context) > 0:
input_context_str = urlsafe_b64encode(input_context)
post_metadata({
"experiment_id": run_info["experiment_id"],
"run_id": run_info["run_id"],
"step_id": "same_step_000",
"metadata_type": "input",
"metadata_value": input_context_str,
"metadata_time": datetime.datetime.now().isoformat(),
})
# User code for step, which we run in its own execution frame.
user_code = f"""
import dill
# Load session context into global namespace:
if { len(input_context) } > 0:
dill.load_session("{ input_context_path }")
{dill.loads(urlsafe_b64decode("gASVGAAAAAAAAACMFHByaW50KCJIZWxsbyB3b3JsZCIplC4="))}
# Remove anything from the global namespace that cannot be serialised.
# TODO: this will include things like pandas dataframes, needs sdk support?
_bad_keys = []
_all_keys = list(globals().keys())
for k in _all_keys:
try:
dill.dumps(globals()[k])
except TypeError:
_bad_keys.append(k)
for k in _bad_keys:
del globals()[k]
# Save new session context to disk for the next component:
dill.dump_session("{output_context_path}")
"""
# Runs the user code in a new execution frame. Context from the previous
# component in the run is loaded into the session dynamically, and we run
# with a single globals() namespace to simulate top-level execution.
exec(user_code, globals(), globals())
# Post new session context to mlflow:
with Path(output_context_path).open("rb") as reader:
context = urlsafe_b64encode(reader.read())
post_metadata({
"experiment_id": run_info["experiment_id"],
"run_id": run_info["run_id"],
"step_id": "same_step_000",
"metadata_type": "output",
"metadata_value": context,
"metadata_time": datetime.datetime.now().isoformat(),
})
- --executor_input
- '{{$}}'
- --function_to_execute
- same_step_000_d5554cccadc4445f91f51849eb5f2de6_fn
command: [/kfp-launcher/launch, --mlmd_server_address, $(METADATA_GRPC_SERVICE_HOST),
--mlmd_server_port, $(METADATA_GRPC_SERVICE_PORT), --runtime_info_json, $(KFP_V2_RUNTIME_INFO),
--container_image, $(KFP_V2_IMAGE), --task_name, same-step-000-d5554cccadc4445f91f51849eb5f2de6-fn,
--pipeline_name, '{{inputs.parameters.pipeline-name}}', --run_id, $(KFP_RUN_ID),
--run_resource, workflows.argoproj.io/$(WORKFLOW_ID), --namespace, $(KFP_NAMESPACE),
--pod_name, $(KFP_POD_NAME), --pod_uid, $(KFP_POD_UID), --pipeline_root, '{{inputs.parameters.pipeline-root}}',
--enable_caching, $(ENABLE_CACHING), --, input_context_path=, 'metadata_url={{inputs.parameters.metadata_url}}',
'run_info={{inputs.parameters.run-info-fn-run_info}}', --]
env:
- name: KFP_POD_NAME
valueFrom:
fieldRef: {fieldPath: metadata.name}
- name: KFP_POD_UID
valueFrom:
fieldRef: {fieldPath: metadata.uid}
- name: KFP_NAMESPACE
valueFrom:
fieldRef: {fieldPath: metadata.namespace}
- name: WORKFLOW_ID
valueFrom:
fieldRef: {fieldPath: 'metadata.labels[''workflows.argoproj.io/workflow'']'}
- name: KFP_RUN_ID
valueFrom:
fieldRef: {fieldPath: 'metadata.labels[''pipeline/runid'']'}
- name: ENABLE_CACHING
valueFrom:
fieldRef: {fieldPath: 'metadata.labels[''pipelines.kubeflow.org/enable_caching'']'}
- {name: KFP_V2_IMAGE, value: 'public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/codeserver-python:v1.5.0'}
- {name: KFP_V2_RUNTIME_INFO, value: '{"inputParameters": {"input_context_path":
{"type": "STRING"}, "metadata_url": {"type": "STRING"}, "run_info": {"type":
"STRING"}}, "inputArtifacts": {}, "outputParameters": {"output_context_path":
{"type": "STRING", "path": "/tmp/outputs/output_context_path/data"}}, "outputArtifacts":
{}}'}
envFrom:
- configMapRef: {name: metadata-grpc-configmap, optional: true}
image: public.ecr.aws/j1r0q0g6/notebooks/notebook-servers/codeserver-python:v1.5.0
volumeMounts:
- {mountPath: /kfp-launcher, name: kfp-launcher}
inputs:
parameters:
- {name: metadata_url}
- {name: pipeline-name}
- {name: pipeline-root}
- {name: run-info-fn-run_info}
outputs:
artifacts:
- {name: same-step-000-d5554cccadc4445f91f51849eb5f2de6-fn-output_context_path,
path: /tmp/outputs/output_context_path/data}
metadata:
annotations:
pipelines.kubeflow.org/v2_component: "true"
pipelines.kubeflow.org/component_ref: '{}'
pipelines.kubeflow.org/arguments.parameters: '{"input_context_path": "", "metadata_url":
"{{inputs.parameters.metadata_url}}", "run_info": "{{inputs.parameters.run-info-fn-run_info}}"}'
pipelines.kubeflow.org/max_cache_staleness: P0D
labels:
pipelines.kubeflow.org/kfp_sdk_version: 1.8.12
pipelines.kubeflow.org/pipeline-sdk-type: kfp
pipelines.kubeflow.org/v2_component: "true"
pipelines.kubeflow.org/enable_caching: "true"
initContainers:
- command: [launcher, --copy, /kfp-launcher/launch]
image: gcr.io/ml-pipeline/kfp-launcher:1.8.7
name: kfp-launcher
mirrorVolumeMounts: true
volumes:
- {name: kfp-launcher}
arguments:
parameters:
- {name: context, value: ''}
- {name: metadata_url, value: ''}
- {name: pipeline-root, value: ''}
- {name: pipeline-name, value: pipeline/root_pipeline_compilation}
serviceAccountName: pipeline-runner
It's DEFINITELY a regression - here's the same YAML generated with the two compiler flags on. The first works, the second doesn't.
using the compiler in v1 mode - https://gist.github.com/aronchick/0dfc57d2a794c1bd4fb9bff9962cfbd6
using the compiler in v2 mode - https://gist.github.com/aronchick/473060503ae189b360fbded04d802c80

Related

Folder deleted/not created inside the common dir mounted with emptyDir{} type on EKS Fargate pod

We are facing strange issue with EKS Fargate Pods. We want to push logs to cloudwatch with sidecar fluent-bit container and for that we are mounting the separately created /logs/boot and /logs/access folders on both the containers with emptyDir: {} type. But somehow the access folder is getting deleted. When we tested this setup in local docker it produced desired results and things were working fine but not when deployed in the EKS fargate. Below is our manifest files
Dockerfile
FROM anapsix/alpine-java:8u201b09_server-jre_nashorn
ARG LOG_DIR=/logs
# Install base packages
RUN apk update
RUN apk upgrade
# RUN apk add ca-certificates && update-ca-certificates
# Dynamically set the JAVA_HOME path
RUN export JAVA_HOME="$(dirname $(dirname $(readlink -f $(which java))))" && echo $JAVA_HOME
# Add Curl
RUN apk --no-cache add curl
RUN mkdir -p $LOG_DIR/boot $LOG_DIR/access
RUN chmod -R 0777 $LOG_DIR/*
# Add metadata to the image to describe which port the container is listening on at runtime.
# Change TimeZone
RUN apk add --update tzdata
ENV TZ="Asia/Kolkata"
# Clean APK cache
RUN rm -rf /var/cache/apk/*
# Setting JAVA HOME
ENV JAVA_HOME=/opt/jdk
# Copy all files and folders
COPY . .
RUN rm -rf /opt/jdk/jre/lib/security/cacerts
COPY cacerts /opt/jdk/jre/lib/security/cacerts
COPY standalone.xml /jboss-eap-6.4-integration/standalone/configuration/
# Set the working directory.
WORKDIR /jboss-eap-6.4-integration/bin
EXPOSE 8177
CMD ["./erctl"]
Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vinintegrator
namespace: eretail
labels:
app: vinintegrator
pod: fargate
spec:
selector:
matchLabels:
app: vinintegrator
pod: fargate
replicas: 2
template:
metadata:
labels:
app: vinintegrator
pod: fargate
spec:
securityContext:
fsGroup: 0
serviceAccount: eretail
containers:
- name: vinintegrator
imagePullPolicy: IfNotPresent
image: 653580443710.dkr.ecr.ap-southeast-1.amazonaws.com/vinintegrator-service:latest
resources:
limits:
memory: "7629Mi"
cpu: "1.5"
requests:
memory: "5435Mi"
cpu: "750m"
ports:
- containerPort: 8177
protocol: TCP
# securityContext:
# runAsUser: 506
# runAsGroup: 506
volumeMounts:
- mountPath: /jboss-eap-6.4-integration/bin
name: bin
- mountPath: /logs
name: logs
- name: fluent-bit
image: 657281243710.dkr.ecr.ap-southeast-1.amazonaws.com/fluent-bit:latest
imagePullPolicy: IfNotPresent
env:
- name: HOST_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
resources:
limits:
memory: 200Mi
requests:
cpu: 200m
memory: 100Mi
volumeMounts:
- name: fluent-bit-config
mountPath: /fluent-bit/etc/
- name: logs
mountPath: /logs
readOnly: true
volumes:
- name: fluent-bit-config
configMap:
name: fluent-bit-config
- name: logs
emptyDir: {}
- name: bin
persistentVolumeClaim:
claimName: vinintegrator-pvc
Below is the /logs folder ownership and permission. Please notice the 's' in drwxrwsrwx
drwxrwsrwx 3 root root 4096 Oct 1 11:50 logs
Below is the content inside logs folder. Please notice the access folder is not created or deleted.
/logs # ls -lrt
total 4
drwxr-sr-x 2 root root 4096 Oct 1 11:50 boot
/logs #
Below is the configmap of Fluent-Bit
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: eretail
labels:
k8s-app: fluent-bit
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
#INCLUDE application-log.conf
application-log.conf: |
[INPUT]
Name tail
Path /logs/boot/*.log
Tag boot
[INPUT]
Name tail
Path /logs/access/*.log
Tag access
[OUTPUT]
Name cloudwatch_logs
Match *boot*
region ap-southeast-1
log_group_name eks-fluent-bit
log_stream_prefix boot-log-
auto_create_group On
[OUTPUT]
Name cloudwatch_logs
Match *access*
region ap-southeast-1
log_group_name eks-fluent-bit
log_stream_prefix access-log-
auto_create_group On
parsers.conf: |
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
Below is error log of Fluent-bit container
AWS for Fluent Bit Container Image Version 2.14.0
Fluent Bit v1.7.4
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2021/10/01 06:20:33] [ info] [engine] started (pid=1)
[2021/10/01 06:20:33] [ info] [storage] version=1.1.1, initializing...
[2021/10/01 06:20:33] [ info] [storage] in-memory
[2021/10/01 06:20:33] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/10/01 06:20:33] [error] [input:tail:tail.1] read error, check permissions: /logs/access/*.log
[2021/10/01 06:20:33] [ warn] [input:tail:tail.1] error scanning path: /logs/access/*.log
[2021/10/01 06:20:38] [error] [net] connection #33 timeout after 5 seconds to: 169.254.169.254:80
[2021/10/01 06:20:38] [error] [net] socket #33 could not connect to 169.254.169.254:80
Suggest remove the following from your Dockerfile:
RUN mkdir -p $LOG_DIR/boot $LOG_DIR/access
RUN chmod -R 0777 $LOG_DIR/*
Use the following method to setup the log directories and permissions:
apiVersion: v1
kind: Pod # Deployment
metadata:
name: busy
labels:
app: busy
spec:
volumes:
- name: logs # Shared folder with ephemeral storage
emptyDir: {}
initContainers: # Setup your log directory here
- name: setup
image: busybox
command: ["bin/ash", "-c"]
args:
- >
mkdir -p /logs/boot /logs/access;
chmod -R 777 /logs
volumeMounts:
- name: logs
mountPath: /logs
containers:
- name: app # Run your application and logs to the directories
image: busybox
command: ["bin/ash","-c"]
args:
- >
while :; do echo "$(date): $(uname -r)" | tee -a /logs/boot/boot.log /logs/access/access.log; sleep 1; done
volumeMounts:
- name: logs
mountPath: /logs
- name: logger # Any logger that you like
image: busybox
command: ["bin/ash","-c"]
args: # tail the app logs, forward to CW etc...
- >
sleep 5;
tail -f /logs/boot/boot.log /logs/access/access.log
volumeMounts:
- name: logs
mountPath: /logs
The snippet runs on Fargate as well, run kubectl logs -f busy -c logger to see the tailing. In real world, the "app" is your java app, "logger" is any log agent you desired. Note Fargate has native logging capability using AWS Fluent-bit, you do not need to run AWS Fluent-bit as sidecar.

What is kubeflow gpu resource node allocation criteria?

I’m curious about the Kubeflow GPU Resource. I’m running the job below.
The only part where I specified the GPU Resource is on first container with only 1 GPU. However, the event message tells me 0/4 nodes are available: 4 Insufficient nvidia.com/gpu.
Why is this job searching for 4 nodes though I specified only 1 GPU resource? Does my interpretation have a problem? Thanks much in advance.
FYI) I have 3 worker nodes with each 1 gpu.
apiVersion: batch/v1
kind: Job
metadata:
name: saint-train-3
annotations:
sidecar.istio.io/inject: "false"
spec:
template:
spec:
initContainers:
- name: dataloader
image: <AWS CLI Image>
command: ["/bin/sh", "-c", "aws s3 cp s3://<Kubeflow Bucket>/kubeflowdata.tar.gz /s3-data; cd /s3-data; tar -xvzf kubeflowdata.tar.gz; cd kubeflow_data; ls"]
volumeMounts:
- mountPath: /s3-data
name: s3-data
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef: {key: AWS_ACCESS_KEY_ID, name: aws-secret}
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef: {key: AWS_SECRET_ACCESS_KEY, name: aws-secret}
containers:
- name: trainer
image: <Our Model Image>
command: ["/bin/sh", "-c", "wandb login <ID>; python /opt/ml/src/main.py --base_path='/s3-data/kubeflow_data' --debug_mode='0' --project='kubeflow-test' --name='test2' --gpu=0 --num_epochs=1 --num_workers=4"]
volumeMounts:
- mountPath: /s3-data
name: s3-data
resources:
limits:
nvidia.com/gpu: "1"
- name: gpu-watcher
image: pytorch/pytorch:latest
command: ["/bin/sh", "-c", "--"]
args: [ "while true; do sleep 30; done;" ]
volumeMounts:
- mountPath: /s3-data
name: s3-data
volumes:
- name: s3-data
persistentVolumeClaim:
claimName: test-claim
restartPolicy: OnFailure
backoffLimit: 6
0/4 nodes are available: 4 Insufficient nvidia.com/gpu
This is mean you haven't nodes with label nvidia.com/gpu

Openshift oc patch not executing initdb.sql from /docker-entrypoint-initdb.d

OpenShift:
I have the below MySQL Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: mysql-master
spec:
selector:
matchLabels:
app: mysql-master
strategy:
type: Recreate
template:
metadata:
labels:
app: mysql-master
spec:
volumes:
- name: mysql-persistent-storage
persistentVolumeClaim:
claimName: ro-mstr-nfs-datadir-claim
containers:
- image: mysql:5.7
name: mysql-master
env:
- name: MYSQL_SERVER_CONTAINER
value: mysql
- name: MYSQL_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: mysql-secret
key: MYSQL_ROOT_PASSWORD
- name: MYSQL_DATABASE
valueFrom:
secretKeyRef:
name: mysql-secret
key: MYSQL_DATABASE
- name: MYSQL_USER
valueFrom:
secretKeyRef:
name: mysql-secret
key: MYSQL_USER
- name: MYSQL_PASSWORD
valueFrom:
secretKeyRef:
name: mysql-secret
key: MYSQL_PASSWORD
ports:
- containerPort: 3306
name: mysql-master
volumeMounts:
- name: mysql-persistent-storage
mountPath: /var/lib/mysql
I created a deployment using this yml file which created a deployment and pod which is successfully running.
And I have a configmap
apiVersion: v1
kind: ConfigMap
metadata:
name: ro-mstr-mysqlinitcnfgmap
data:
initdb.sql: |-
CREATE TABLE aadhaar ( name varchar(255) NOT NULL,
sex char NOT NULL, birth DATE NOT NULL, death DATE NULL,
id int(255) NOT NULL AUTO_INCREMENT, PRIMARY KEY (id) );
CREATE USER 'usera'#'%' IDENTIFIED BY 'usera';
GRANT REPLICATION SLAVE ON *.* TO 'usera' IDENTIFIED BY 'usera';
FLUSH PRIVILEGES;
Now I need to patch the above deployment using this configmap. I am using the below command
oc patch deployment mysql-master -p '{ "spec": { "template": { "spec": { "volumes": [ { "name": "ro-mysqlinitconf-vol", "configMap": { "name": "ro-mstr-mysqlinitcnfgmap" } } ], "containers": [ { "image": "mysql:5.7", "name": "mysql-master", "volumeMounts": [ { "name": "ro-mysqlinitconf-vol", "mountPath": "/docker-entrypoint-initdb.d" } ] } ] } } } }'
So the above command is successful, I validated the Deployment description and inside the container it placed the initdb.sql file successfully, and recreated the pod. But the issue is it has not created the aadhaar table. I think it has not executed the initdb.sql file from docker-entrypoint-initdb.d.
If you dive into the entrypoint script in your image (https://github.com/docker-library/mysql/blob/75f81c8e20e5085422155c48a50d99321212bf6f/5.7/docker-entrypoint.sh#L341-L350) you can see it only runs the initdb.d files if it is also creating the database the first time. I think maybe you assumed it always ran them on startup?

Is there a sneaky way to run a command before the entrypoint (in a k8s deployment manifest) without having to modify the dockerfile/image? [duplicate]

In this official document, it can run command in a yaml config file:
https://kubernetes.io/docs/tasks/configure-pod-container/
apiVersion: v1
kind: Pod
metadata:
name: hello-world
spec: # specification of the pod’s contents
restartPolicy: Never
containers:
- name: hello
image: "ubuntu:14.04"
env:
- name: MESSAGE
value: "hello world"
command: ["/bin/sh","-c"]
args: ["/bin/echo \"${MESSAGE}\""]
If I want to run more than one command, how to do?
command: ["/bin/sh","-c"]
args: ["command one; command two && command three"]
Explanation: The command ["/bin/sh", "-c"] says "run a shell, and execute the following instructions". The args are then passed as commands to the shell. In shell scripting a semicolon separates commands, and && conditionally runs the following command if the first succeed. In the above example, it always runs command one followed by command two, and only runs command three if command two succeeded.
Alternative: In many cases, some of the commands you want to run are probably setting up the final command to run. In this case, building your own Dockerfile is the way to go. Look at the RUN directive in particular.
My preference is to multiline the args, this is simplest and easiest to read. Also, the script can be changed without affecting the image, just need to restart the pod. For example, for a mysql dump, the container spec could be something like this:
containers:
- name: mysqldump
image: mysql
command: ["/bin/sh", "-c"]
args:
- echo starting;
ls -la /backups;
mysqldump --host=... -r /backups/file.sql db_name;
ls -la /backups;
echo done;
volumeMounts:
- ...
The reason this works is that yaml actually concatenates all the lines after the "-" into one, and sh runs one long string "echo starting; ls... ; echo done;".
If you're willing to use a Volume and a ConfigMap, you can mount ConfigMap data as a script, and then run that script:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: my-configmap
data:
entrypoint.sh: |-
#!/bin/bash
echo "Do this"
echo "Do that"
---
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: my-container
image: "ubuntu:14.04"
command:
- /bin/entrypoint.sh
volumeMounts:
- name: configmap-volume
mountPath: /bin/entrypoint.sh
readOnly: true
subPath: entrypoint.sh
volumes:
- name: configmap-volume
configMap:
defaultMode: 0700
name: my-configmap
This cleans up your pod spec a little and allows for more complex scripting.
$ kubectl logs my-pod
Do this
Do that
If you want to avoid concatenating all commands into a single command with ; or && you can also get true multi-line scripts using a heredoc:
command:
- sh
- "-c"
- |
/bin/bash <<'EOF'
# Normal script content possible here
echo "Hello world"
ls -l
exit 123
EOF
This is handy for running existing bash scripts, but has the downside of requiring both an inner and an outer shell instance for setting up the heredoc.
I am not sure if the question is still active but due to the fact that I did not find the solution in the above answers I decided to write it down.
I use the following approach:
readinessProbe:
exec:
command:
- sh
- -c
- |
command1
command2 && command3
I know my example is related to readinessProbe, livenessProbe, etc. but suspect the same case is for the container commands. This provides flexibility as it mirrors a standard script writing in Bash.
IMHO the best option is to use YAML's native block scalars. Specifically in this case, the folded style block.
By invoking sh -c you can pass arguments to your container as commands, but if you want to elegantly separate them with newlines, you'd want to use the folded style block, so that YAML will know to convert newlines to whitespaces, effectively concatenating the commands.
A full working example:
apiVersion: v1
kind: Pod
metadata:
name: myapp
labels:
app: myapp
spec:
containers:
- name: busy
image: busybox:1.28
command: ["/bin/sh", "-c"]
args:
- >
command_1 &&
command_2 &&
...
command_n
Here is my successful run
apiVersion: v1
kind: Pod
metadata:
labels:
run: busybox
name: busybox
spec:
containers:
- command:
- /bin/sh
- -c
- |
echo "running below scripts"
i=0;
while true;
do
echo "$i: $(date)";
i=$((i+1));
sleep 1;
done
name: busybox
image: busybox
Here is one more way to do it, with output logging.
apiVersion: v1
kind: Pod
metadata:
labels:
type: test
name: nginx
spec:
containers:
- image: nginx
name: nginx
volumeMounts:
- name: log-vol
mountPath: /var/mylog
command:
- /bin/sh
- -c
- >
i=0;
while [ $i -lt 100 ];
do
echo "hello $i";
echo "$i : $(date)" >> /var/mylog/1.log;
echo "$(date)" >> /var/mylog/2.log;
i=$((i+1));
sleep 1;
done
dnsPolicy: ClusterFirst
restartPolicy: Always
volumes:
- name: log-vol
emptyDir: {}
Here is another way to run multi line commands.
apiVersion: batch/v1
kind: Job
metadata:
name: multiline
spec:
template:
spec:
containers:
- command:
- /bin/bash
- -exc
- |
set +x
echo "running below scripts"
if [[ -f "if-condition.sh" ]]; then
echo "Running if success"
else
echo "Running if failed"
fi
name: ubuntu
image: ubuntu
restartPolicy: Never
backoffLimit: 1
Just to bring another possible option, secrets can be used as they are presented to the pod as volumes:
Secret example:
apiVersion: v1
kind: Secret
metadata:
name: secret-script
type: Opaque
data:
script_text: <<your script in b64>>
Yaml extract:
....
containers:
- name: container-name
image: image-name
command: ["/bin/bash", "/your_script.sh"]
volumeMounts:
- name: vsecret-script
mountPath: /your_script.sh
subPath: script_text
....
volumes:
- name: vsecret-script
secret:
secretName: secret-script
I know many will argue this is not what secrets must be used for, but it is an option.

Named arguments not getting picked up from my kubernetes template

I'm trying to update a kubernetes template that we have so that I can pass in arguments such as --db-config <value> when my container starts up.
This is obviously not right b/c there's not getting picked up
...
containers:
- name: {{ .Chart.Name }}
...
args: ["--db-config", "/etc/app/cfg/db.yaml", "--tkn-config", "/etc/app/cfg/tkn.yaml"] <-- WHY IS THIS NOT WORKING
Here's an example showing your approach working:
main.go:
package main
import "flag"
import "fmt"
func main() {
db := flag.String("db-config", "default", "some flag")
tk := flag.String("tk-config", "default", "some flag")
flag.Parse()
fmt.Println("db-config:", *db)
fmt.Println("tk-config:", *tk)
}
Dockerfile [simplified]:
FROM scratch
ADD kube-flags /
ENTRYPOINT ["/kube-flags"]
Test:
docker run kube-flags:180906
db-config: default
tk-config: default
docker run kube-flags:180906 --db-config=henry
db-config: henry
tk-config: default
pod.yaml:
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
containers:
- image: gcr.io/.../kube-flags:180906
imagePullPolicy: Always
name: test
args:
- --db-config
- henry
- --tk-config
- turnip
test:
kubectl logs test
db-config: henry
tk-config: turnip

Resources