I have a train.py and using a docker pushed an image from local to AWS ECR.
But i am getting this error
The primary container for production variant variant-1 did not pass the ping
health check. Please check CloudWatch logs for this endpoint.
Here is the complete Docker file. What am I missing.
FROM python:3.7
RUN python -m pip install sagemaker-training snowflake-connector-python[pandas] \
pandas scikit-learn boto3 numpy joblib sagemaker flask gevent gunicorn
ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/ml:${PATH}"
# Set up the program in the image
COPY pred_demo_sm/train.py /opt/ml/code/train.py
COPY pred_demo_sm/serve /opt/ml/code/serve
COPY pred_demo_sm/output /opt/ml/output
COPY pred_demo_sm/model /opt/ml/model
WORKDIR /opt/ml
ENTRYPOINT [ "python3.7", "/opt/ml/code/train.py"]
Here is the complete .sh file which builds and pushes the image to AWS ECR
#!/usr/bin/env bash
# This script shows how to build the Docker image and push it to ECR to be ready for use
# by SageMaker.
# The argument to this script is the image name. This will be used as the image on the local
# machine and combined with the account and region to form the repository name for ECR.
image=$1
if [ "$image" == "" ]
then
echo "Usage: $0 <image-name>"
exit 1
fi
chmod +x pred_demo_sm/train.py
chmod +x pred_demo_sm/serve
chmod +x pred_demo_sm/model/*
# Get the account number associated with the current IAM credentials
account=$(aws sts get-caller-identity --query Account --output text)
if [ $? -ne 0 ]
then
exit 255
fi
# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}
fullname="${account}.dkr.ecr.${region}.amazonaws.com/${image}:latest"
# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${image}" > /dev/null 2>&1
if [ $? -ne 0 ]
then
aws ecr create-repository --repository-name "${image}" > /dev/null
fi
# Get the login command from ECR and execute it directly
aws ecr get-login-password --region "${region}" | docker login --username AWS --password-stdin "${account}".dkr.ecr."${region}".amazonaws.com
# Build the docker image locally with the image name and then push it to ECR
# with the full name.
docker build -t ${image}
docker tag ${image} ${fullname}
docker push ${fullname}
The training job gets successfully completed in Sagemaker.
But fails while deploying the model in sagemaker.
Within your Dockerfile, could you try replacing the COPY, WORKDIR, and ENTRYPOINT commands with the following.
COPY pred_demo_sm/opt/program
WORKDIR /opt/program
Make sure that your serve file is executable as well.
Related
I have created a docker image using AmazonLinux:2 base image in my Dockerfile. This docker container will run as Jenkins build agent on a Linux server and has to make certain AWS API calls. In my Dockerfile, I'm copying a shell-script called assume-role.sh.
Code snippet:-
COPY ./assume-role.sh .
RUN ["chmod", "+x", "assume-role.sh"]
ENTRYPOINT ["/assume-role.sh"]
CMD ["bash", "--"]
Shell script definition:-
#!/usr/bin/env bash
#echo Your container args are: "${1} ${2} ${3} ${4} ${5}"
echo Your container args are: "${1}"
ROLE_ARN="${1}"
AWS_DEFAULT_REGION="${2:-us-east-1}"
SESSIONID=$(date +"%s")
DURATIONSECONDS="${3:-3600}"
#Temporary loggings starts here
id
pwd
ls .aws
cat .aws/credentials
#Temporary loggings ends here
# AWS STS AssumeRole
RESULT=(`aws sts assume-role --role-arn $ROLE_ARN \
--role-session-name $SESSIONID \
--duration-seconds $DURATIONSECONDS \
--query '[Credentials.AccessKeyId,Credentials.SecretAccessKey,Credentials.SessionToken]' \
--output text`)
# Setting up temporary creds
export AWS_ACCESS_KEY_ID=${RESULT[0]}
export AWS_SECRET_ACCESS_KEY=${RESULT[1]}
export AWS_SECURITY_TOKEN=${RESULT[2]}
export AWS_SESSION_TOKEN=${AWS_SECURITY_TOKEN}
echo 'AWS STS AssumeRole completed successfully'
# Making test AWS API calls
aws s3 ls
echo 'test calls completed'
I'm running the docker container like this:-
docker run -d -v $PWD/.aws:/.aws:ro -e XDG_CACHE_HOME=/tmp/go/.cache arn:aws:iam::829327394277:role/myjenkins test-image
What I'm trying to do here is mounting .aws credentials from host directory to the volume on container at root level. The volume mount is successful and I can see the log outputs as describe in its shell file :-
ls .aws
cat .aws/credentials
It tells me there is a .aws folder with credentials inside it in the root level (/). However somehow, AWS CLI is not picking up and as a result remaining API calls like AWS STS assume-role is getting failed.
Can somebody please suggest me here?
[Output of docker run]
Your container args are: arn:aws:iam::829327394277:role/myjenkins
uid=0(root) gid=0(root) groups=0(root)
/
config
credentials
[default]
aws_access_key_id = AKXXXXXXXXXXXXXXXXXXXP
aws_secret_access_key = e8SYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYxYm
Unable to locate credentials. You can configure credentials by running "aws configure".
AWS STS AssumeRole completed successfully
Unable to locate credentials. You can configure credentials by running "aws configure".
test calls completed
I found the issue finally.
The path was wrong while mounting the .aws volume to the container.
Instead of this -v $PWD/.aws:/.aws:ro, it was supposed to be -v $PWD/.aws:/root/.aws:ro
I need to use host ssh key inside docker , for this purpose i have build docker like
docker build -t example --build-arg ssh_prv_key="$(cat ~/.ssh/id_rsa)" -f dockerfile-dev .
if we use direct docker command it is working fine , but if I use inside the jenkins pipe-line script getting below error
Running in Durability level: MAX_SURVIVABILITY
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
WorkflowScript: 92: expecting '}', found 'ssh_prv_key' # line 92, column 116.
ev:${GIT_COMMIT} "--build-arg ssh_prv_ke
Below step i have used in jenkins pipe-line
sh "docker build -t ${service_name}-dev:${GIT_COMMIT} --build-arg ssh_prv_key="$(cat ~/.ssh/id_rsa)" -f dockerfile-dev ."
And docker file used like below
ARG ssh_prv_key
# Authorize SSH Host
# Add the keys and set permissions
RUN mkdir -p /root/.ssh
RUN echo "$ssh_prv_key" > /root/.ssh/id_rsa && \
chmod 600 /root/.ssh/id_rsa
I solved a similar issue as follow:
Jenkins pipeline
sh "cp ~/.ssh/id_rsa id_rsa"
sh "docker build -t ${service_name}-dev:${GIT_COMMIT} -f dockerfile-dev ."
sh "rm id_rsa"
Dockerfile
# Some instructions...
ADD id_rsa id_rsa
# Now use the "id_rsa" file inside the image...
I am trying to find a "global" solution for injecting an SSH key into a container. I know that there are several solutions including docker build kit and so on...but I don't want to build an image and inject the SSH key. I want to inject the SSH key by using an existing image with docker compose.
I use the following docker compose file:
version: '3.1'
services:
server1:
image: XXXXXXX
container_name: server1
command: bash -c "/root/init.sh && python3 /root/my_python.py"
environment:
- MANAGED_HOST=mserver
volumes:
- ./init.sh:/root/init.sh
secrets:
- id_rsa
secrets:
id_rsa:
file: /home/user/.ssh/id_rsa
The init.sh is as follows:
#!/bin/bash
eval "$(ssh-agent -s)" > /dev/null
if [ ! -d "/root/.ssh/" ]; then
mkdir /root/.ssh
ssh-keyscan $MANAGED_HOST > /root/.ssh/known_hosts
fi
ssh-add -k /run/secrets/id_rsa
If I run docker compose with the parameter command
bash -c "/root/init.sh && python3 /root/my_python.py", then the SSH authentication to the appropriate remote host ($MANAGED_HOST) is not working.
An agent process is running:
root 8 1 0 12:50 ? 00:00:00 ssh-agent -s
known_hosts is OK:
root#c67655d87ced:~# cat /root/.ssh/known_hosts
BLABLABLA ssh-rsa AAAAB3BLABLABLA....
and the agent is running, but the private key is not added:
root#c67655d87ced:~# ssh-add -l
Could not open a connection to your authentication agent.
Now, if I log in the container (docker exec -it server1 /bin/bash) and run the commands from init.sh one by one from the command line, then the SSH authentication to the appropriate remote host ($MANAGED_HOST) is working?!?
Any idea, how I can get it working by using the docker compose?
It should be enough to cause the file $HOME/.ssh/id_rsa to exist with appropriate permissions; you don't need an ssh agent running.
#!/bin/sh
if ! [ -d "$HOME/.ssh" ]; then
mkdir "$HOME/.ssh"
fi
chmod 0700 "$HOME/.ssh"
if [ -n "$MANAGED_HOST" ]; then
ssh-keyscan "$MANAGED_HOST" >> "$HOME/.ssh/known_hosts"
fi
if [ -f /run/secrets/id_rsa ]; then
cp /run/secrets/id_rsa "$HOME/.ssh/id_rsa"
chmod 0400 "$HOME/.ssh/id_rsa"
fi
# exec "$#"
A typical pattern is to use the Dockerfile ENTRYPOINT to do first-time setup tasks like this. That will get passed the CMD as arguments, and the commented exec "$#" line at the end of the file runs that as a command. You'd set this up in your image's Dockerfile like:
FROM XXXXXX
...
# Script must be executable on the host, and must start with a
# #!/bin/sh "shebang" line
COPY init.sh /root
# MUST use JSON-array form
ENTRYPOINT ["/root/init.sh"]
# Can use any Dockerfile syntax
CMD ["python3", "/root/my_python.py"]
In your specific example, you're launching init.sh as a subprocess. The ssh-agent setup sets some environment variables, like $SSH_AUTH_SOCK, but when these run as a subprocess they don't get propagated back out to the host process. You can use the standard POSIX shell . builtin (the bash source builtin is equivalent, but non-standard) to cause those environment variables to be set in the context of the parent shell:
command: sh -c ". /root/init.sh && exec python3 /root/my_python.py"
The exec replaces the shell wrapper with the Python script, which you generally want. This will also wind up being the parent process of ssh-agent, which could potentially surprise your process if it happens to exit.
I have the following script thats run in my jenkins job
set +x
SERVICE_ACCOUNT=`cat "$GCLOUD_AUTH_FILE"`
docker login -u _json_key -p "${SERVICE_ACCOUNT}" https://gcr.io
set -x
docker pull gcr.io/$MYPROJECT/automation:master
docker run --rm --attach STDOUT -v "$(pwd)":/workspace -v "$GCLOUD_AUTH_FILE":/gcloud-auth/service_account_key.json -v /var/run/docker.sock:/var/run/docker.sock -e "BRANCH=master" -e "PROJECT=myproject" gcr.io/myproject/automation:master "/building/buildImages.sh" "myapp"
if [ $? -ne 0 ]; then
exit 1
fi
I am now trying to do this in cloudbuild.yaml such that I can run my script using my own automation image (which has a bunch of dependencies docker/jdk/pip etc installed) , and mount my git folders in my workspace directory
I tried putting my cloudbuild.yaml at the top level in my directory in my git repo and set it up as this
steps:
- name: 'gcr.io/myproject/automation:master'
volumes:
- name: 'current-working-dir'
path: /mydirectory
args: ['bash', '-c','/building/buildImages.sh', 'myapp']
timeout: 4000s
But this gives me errors saying the
invalid build: Volume "current-working-dir" is only used by one step
Just FYI, my script buildImages.sh, copies folders and dockerfiles, runs pip install/ npm/ and gradle commands and then docker build commands (kind of all in one solution).
Whats the way to translate my script to cloudbuild.yaml
try this in your cloudbuild.yaml:
steps:
- name: 'gcr.io/<your-project>/<image>'
args: ['sh','<your-script>.sh']
using this I was able to pull the image from Google Cloud Registry that has my script, then run the script using 'sh'. It didn't matter where the script is. I'm using alpine in my Dockerfile as base image.
I'd like to serve Tensorfow Model by using OpenFaaS. Basically, I'd like to invoke the "serve" function in such a way that tensorflow serving is going to expose my model.
OpenFaaS is running correctly on Kubernetes and I am able to invoke functions via curl or from the UI.
I used the incubator-flask as example, but I keep receiving 502 Bad Gateway all the time.
The OpenFaaS project looks like the following
serve/
- Dockerfile
stack.yaml
The inner Dockerfile is the following
FROM tensorflow/serving
RUN mkdir -p /home/app
RUN apt-get update \
&& apt-get install curl -yy
RUN echo "Pulling watchdog binary from Github." \
&& curl -sSLf https://github.com/openfaas-incubator/of-watchdog/releases/download/0.4.6/of-watchdog > /usr/bin/fwatchdog \
&& chmod +x /usr/bin/fwatchdog
WORKDIR /root/
# remove unecessery logs from S3
ENV TF_CPP_MIN_LOG_LEVEL=3
ENV AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
ENV AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
ENV AWS_REGION=${AWS_REGION}
ENV S3_ENDPOINT=${S3_ENDPOINT}
ENV fprocess="tensorflow_model_server --rest_api_port=8501 \
--model_name=${MODEL_NAME} \
--model_base_path=${MODEL_BASE_PATH}"
# Set to true to see request in function logs
ENV write_debug="true"
ENV cgi_headers="true"
ENV mode="http"
ENV upstream_url="http://127.0.0.1:8501"
# gRPC tensorflow serving
# EXPOSE 8500
# REST tensorflow serving
# EXPOSE 8501
RUN touch /tmp/.lock
HEALTHCHECK --interval=5s CMD [ -e /tmp/.lock ] || exit 1
CMD [ "fwatchdog" ]
the stack.yaml file looks like the following
provider:
name: faas
gateway: https://gateway-url:8080
functions:
serve:
lang: dockerfile
handler: ./serve
image: repo/serve-model:latest
imagePullPolicy: always
I build the image with faas-cli build -f stack.yaml and then I push it to my docker registry with faas-cli push -f stack.yaml.
When I execute faas-cli deploy -f stack.yaml -e AWS_ACCESS_KEY_ID=... I get Accepted 202 and it appears correctly among my functions. Now, I want to invoke the tensorflow serving on the model I specified in my ENV.
The way I try to make it work is to use curl in this way
curl -d '{"inputs": ["1.0", "2.0", "5.0"]}' -X POST https://gateway-url:8080/function/deploy-model/v1/models/mnist:predict
but I always obtain 502 Bad Gateway.
Does anybody have experience with OpenFaaS and Tensorflow Serving? Thanks in advance
P.S.
If I run tensorflow serving without of-watchdog (basically without the openfaas stuff), the model is served correctly.
Elaborating the link mentioned by #viveksyngh.
tensorflow-serving-openfaas:
Example of packaging TensorFlow Serving with OpenFaaS to be deployed and managed through OpenFaaS with auto-scaling, scale-from-zero and a sane configuration for Kubernetes.
This example was adapted from: https://www.tensorflow.org/serving
Pre-reqs:
OpenFaaS
OpenFaaS CLI
Docker
Instructions:
Clone the repo
$ mkdir -p ~/dev/
$ cd ~/dev/
$ git clone https://github.com/alexellis/tensorflow-serving-openfaas
Clone the sample model and copy it to the function's build context
$ cd ~/dev/tensorflow-serving-openfaas
$ git clone https://github.com/tensorflow/serving
$ cp -r serving/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_cpu ./ts-serve/saved_model_half_plus_two_cpu
Edit the Docker Hub username
You need to edit the stack.yml file and replace alexellis2 with your Docker Hub account.
Build the function image
$ faas-cli build
You should now have a Docker image in your local library which you can deploy to a cluster with faas-cli up
Test the function locally
All OpenFaaS images can be run stand-alone without OpenFaaS installed, let's do a quick test, but replace alexellis2 with your own name.
$ docker run -p 8081:8080 -ti alexellis2/ts-serve:latest
Now in another terminal:
$ curl -d '{"instances": [1.0, 2.0, 5.0]}' \
-X POST http://127.0.0.1:8081/v1/models/half_plus_two:predict
{
"predictions": [2.5, 3.0, 4.5
]
}
From here you can run faas-cli up and then invoke your function from the OpenFaaS UI, CLI or REST API.
$ export OPENFAAS_URL=http://127.0.0.1:8080
$ curl -d '{"instances": [1.0, 2.0, 5.0]}' $OPENFAAS_URL/function/ts-serve/v1/models/half_plus_two:predict
{
"predictions": [2.5, 3.0, 4.5
]
}