We make use of Github Self-Hosted action runners running on EC2 machines (m5.xlarge). We use these as part of our CI/CD pipeline to support docker image builds and automated testing. This solution has worked fine for the last year or so, but all of a sudden yesterday, the builds started to fail with the following error message :
time="2023-02-03T12:00:13Z" level=error msg="error waiting for container: unexpected EOF"
My understanding of this is that it is typically due to docker containers running out of resources (CPU / Memory Limit) being hit but given that these are m5.xlarges (4 vCPU and 16GB Memory) I'm a little surprised. Our builds make use of NPM which I understand can be quite resource hungry but monitoring a container during its execution showed that it was nowhere near the limits of the node:
I've tried to cycle the nodes but there is no difference in behaviour. The following user-data script is used with these nodes which connects it to our Github account and makes it available for jobs. I've also tried using the latest actions-runneer package, but again, no change in behaviour. What other reasons could this error be thrown for as i'm a bit stumped by this.
#!/bin/sh
set -e
curl https://get.docker.com | bash
apt install -y python3-pip jq
pip3 install awscli
mkdir actions-runner && cd actions-runner
curl -O -L https://github.com/actions/runner/releases/download/v2.286.0/actions-runner-linux-x64-2.286.0.tar.gz
tar xzf ./actions-runner-linux-x64-2.286.0.tar.gz
chown -R ubuntu:ubuntu .
instance_id="$(curl -s http://169.254.169.254/latest/meta-data/instance-id)"
url="https://api.github.com/orgs/<REMOVED>/actions/runners/registration-token"
token=$(curl -s -u "<REMOVED>:<REMOVED>" -X POST "$url" | jq -r .token)
sudo -u ubuntu ./config.sh \
--name "products-stage-ec2-runner-$instance_id" \
--token "$token" \
--url "https://github.com/<REMOVED>" \
--labels "<REMOVED>" \
--unattended
sudo ./svc.sh install
sudo ./svc.sh start
See details on my comment of how I resolved this.
Related
This is the docker image we use to host docker-connect with the plugins
FROM confluentinc/cp-kafka-connect:5.3.1
ENV CONNECT_PLUGIN_PATH=/usr/share/java
# JDBC-MariaDB
RUN wget -nv -P /usr/share/java/kafka-connect-jdbc/ https://downloads.mariadb.com/Connectors/java/connector-java-2.4.4/mariadb-java-client-2.4.4.jar
# SNMP Source
RUN wget -nv -P /tmp/ https://github.com/name/kafka-connect-snmp/releases/download/0.0.1.11/kafka-connect-snmp-0.0.1.11.tar.gz
RUN mkdir /tmp/kafka-connect-snmp && tar -xf /tmp/kafka-connect-snmp-0.0.1.11.tar.gz -C /tmp/kafka-connect-snmp/
RUN mv /tmp/kafka-connect-snmp/usr/share/kafka-connect/kafka-connect-snmp /usr/share/java/
Now for some reason, when the node is restarted or shutdown abruptly, the plugin configurations are lost and unable to find any plugins.
Could someone please point out where might be the problem or steps to verify what is going wrong?
PS: I am unable to get the logs of any kind, due to some strict policies
I'm trying to build a docker image that uses nvidia hardware decoding in gstreamer and have encountered a strange problem with making the image.
The build process does not find the nvidia cuda related stuff while running docker build (or nvidia-docker build), but when I spin up the failed image as a container and do those very same steps from within the container everything works. I even saved the container as image which gave me a persistent image that works as intended.
Has anyone experienced similar problem and can shed some light on it?
Dockerfile:
FROM nvcr.io/nvidia/deepstream:3.0-18.11 AS base
ENV DEBIAN_FRONTEND noninteractive
#install some dependencies. NOTE - not removing apt cache for the MWE
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential \
libdc1394-22 \
tmux \
vim \
libjpeg-dev \
libpng-dev \
libpng12-dev \
cuda-toolkit-10-0 \
python3-setuptools \
python3-pip ninja-build pkg-config gobject-introspection gnome-devel bison flex libgirepository1.0-dev liborc-0.4-dev
RUN pip3 install meson && ldconfig
FROM base
#pull and make gstreamer:
RUN cd /tmp && mkdir gstreamer
RUN git clone https://github.com/GStreamer/gst-build.git /tmp/gstreamer \
&& cd /tmp/gstreamer \
&& git checkout tags/1.16.0 \
&& ./setup.py -Dgtk_doc=disabled -Dgst-plugins-bad:nvdec=enabled -Dgst-plugins-bad:nvenc=enabled -Dgst-plugins-bad:iqa=disabled -Dgst-plugins-bad:bluez=disabled --reconfigure \
&& ninja -C build \
&& ninja install -C build
Testing:
build and run the container. Inside the container:
$ gst-inspect-1.0 nvdec
No such element or plugin 'nvdec'
$ cd /tmp/gstreamer
$ ./setup.py -Dgtk_doc=disabled -Dgst-plugins-bad:nvdec=enabled -Dgst-plugins-bad:nvenc=enabled -Dgst-plugins-bad:iqa=disabled -Dgst-plugins-bad:bluez=disabled --reconfigure
$ ninja -C build
$ ninja install -C build
$ gst-inspect-1.0 nvdec
Factory Details:
Rank primary (256)
[... all plugin parameters show up]
GObject
+----GInitiallyUnowned
+----GstObject
+----GstElement
+----GstVideoDecoder
+----GstNvDec
EDIT1
The image builds with no errors, only when I try to call gstreamer it is built with no acceleration. I noticed that in the build process the major difference is
meson.build:109:2: Exception: Problem encountered: The nvdec plugin was enabled explicitly, but required CUDA dependencies were not found.
which does not happen when building from within the container.
Lack of error is related, most likely, to the ninja+meson build system which looks for compatible packages, reports the exception, but doesn't throw it and continues as if nothing wrong happened
EDIT2
Answering comment:
To build it and get the error, just build the attached docker image:
sudo docker build -t gst16:latest . > build.log
This will dump all the output into the build.log file.
I don't have a docker registry that I could use for this and the docker image gets quite big by docker standards (~8 Gigs), but to produce successfully, it's fairly simple:
sudo docker run --runtime="nvidia" -ti gst16:latest /bin/bash
or
sudo nvidia-docker run -ti gst16:latest /bin/bash
which seems to work the same for me. Notice no --rm flag! From within the container:
#check if nvidia decoder plugin is there:
gst-inspect-1.0 nvdec
#fail!
#now build it from within:
cd /opt/gstreamer
./setup.py -Dgtk_doc=disabled -Dgst-plugins-bad:nvdec=enabled -Dgst-plugins-bad:nvenc=enabled -Dgst-plugins-bad:iqa=disabled -Dgst-plugins-bad:bluez=disabled --reconfigure
ninja -C build
ninja install -C build
gst-inspect-1.0 nvdec
#success reported
Now to get the image, exit the container (ctrl+d) and in the host shell:
sudo docker container ls -a to view all containers including stopped ones
from gst16:latest get the CONTAINER_ID and copy it
sudo docker commit <CONTAINER_ID> gst16:manual and after a few seconds you should have the container saved as an image. Verify with sudo docker images
run the new image with sudo docker run --runtime=`nvidia` --rm -ti gst16:manual /bin/bash
from within the container try again the gst-inspect-1.0 nvdec to verify it's working
EDIT3
$ nvidia-docker --version
Docker version 18.09.0, build 4d60db4
I think I found the solution/reason
Writing it here in case someone finds themselves in similar situation, plus I hate finding old threads with similar problem and no answer or "nevermind, I solved it" as the only follow up
Docker build does not have any ties to nvidia runtime and gstreamer requires access to the full nvidia toolchain in order to build the plugins that need it. This is to be resolved with gstreamer 1.18 but until then, there is no way to build gstreamer with nvidia codecs in docker build.
The workaround:
Build image with all dependencies.
Run a container of said image using runtime="nvidia" but don't use --rm flag
In the container, build gstreamer and install it as normally.
Verify with gst-inspect-1.0
Commit the container as new image: docker commit <container_name> <temporary_image_name>
Tag the temporary image properly.
I am new to docker, and am attempting to build an image that involves performing an npm install. Some of our the dependencies are coming from private repos we have, and I am hitting an SSH related issue:
I realised I was not supplying any form of SSH details to my file, and came across various posts online about how to do this using args into the docker build command.
So taken from here, I have added the following to my dockerfile before the npm install command gets run:
ARG ssh_prv_key
ARG ssh_pub_key
RUN apt-get update && \
apt-get install -y \
git \
openssh-server \
libmysqlclient-dev
# Authorize SSH Host
RUN mkdir -p /root/.ssh && \
chmod 0700 /root/.ssh && \
ssh-keyscan github.com > /root/.ssh/known_hosts
# Add the keys and set permissions
RUN echo "$ssh_prv_key" > /root/.ssh/id_rsa && \
echo "$ssh_pub_key" > /root/.ssh/id_rsa.pub && \
chmod 600 /root/.ssh/id_rsa && \
chmod 600 /root/.ssh/id_rsa.pub
So running the docker build command again with the correct args supplied, I do see further activity in the console that suggests my SSH key is being utilised:
But as you can see I am getting no hostkey alg messages and
I still getting the same 'Host key verification failed' error. I was wondering if I could view the log file it references in the error:
Do I need to get the image running in order to be able to connect to it and browse the 'root' folder?
I hope I have made sense, please be gentle I am a docker noob!
Thanks
The lines that start with —-> in the docker build output are valid Docker image IDs. You can pick any of these and docker run them:
docker run --rm -it 59c45dac474a sh
If a step is actually failing, one useful debugging trick is to launch the image built in the step before it and run the command by hand.
Remember that anyone who has your image can do this; the way you’ve built it, if you ever push your image to any repository, your ssh private key is there for the taking, and you should probably consider it compromised. That’s doubly true since it will also be there in plain text in docker history output.
I have a project that is a wrapper for opencv library, written in Rust.
In order to be able to test it I have to build opencv itself. Then I cache it but cold build time is higher than 50 minutes and job gets killed.
How could this timeout be increased? For example, I have 50min per job timeout, but I'd like to have 500 minutes per 10 jobs, so I can run my first cold start build for say 90 minutes and then run fast build for 10 minutes each.
I don't know if it's possible so I'm looking for any workaround. Here is my script which takes most of time:
#!/bin/bash
set -eux -o pipefail
OPENCV_VERSION=${OPENCV_VERSION:-3.4.0}
URL=https://github.com/opencv/opencv/archive/${OPENCV_VERSION}.zip
URL_CONTRUB=https://github.com/opencv/opencv_contrib/archive/${OPENCV_VERSION}.zip
INSTALL_DIR="$HOME/usr/installed-${OPENCV_VERSION}"
if [[ ! -e INSTALL_DIR ]]; then
TMP=$(mktemp -d)
OPENCV_DIR="$(pwd)/opencv-${OPENCV_VERSION}"
OPENCV_CONTRIB_DIR="$(pwd)/opencv_contrib-${OPENCV_VERSION}"
if [[ ! -d "${OPENCV_DIR}/build" ]]; then
curl -sL ${URL} > ${TMP}/opencv.zip
unzip -q ${TMP}/opencv.zip
rm ${TMP}/opencv.zip
curl -sL ${URL_CONTRUB} > ${TMP}/opencv_contrib.zip
unzip -q ${TMP}/opencv_contrib.zip
rm ${TMP}/opencv_contrib.zip
mkdir $OPENCV_DIR/build
fi
pushd $OPENCV_DIR/build
cmake \
-D WITH_CUDA=ON \
-D BUILD_EXAMPLES=OFF \
-D BUILD_TESTS=OFF \
-D BUILD_PERF_TESTS=OFF \
-D BUILD_opencv_java=OFF \
-D BUILD_opencv_python=OFF \
-D BUILD_opencv_python2=OFF \
-D BUILD_opencv_python3=OFF \
-D CMAKE_INSTALL_PREFIX=$HOME/usr \
-D CMAKE_BUILD_TYPE=Release \
-D OPENCV_EXTRA_MODULES_PATH=$OPENCV_CONTRIB_DIR/modules \
-D CUDA_ARCH_BIN=5.2 \
-D CUDA_ARCH_PTX="" \
..
make -j4
make install && touch INSTALL_DIR
popd
touch $HOME/fresh-cache
fi
sudo cp -r $HOME/usr/include/* /usr/local/include/
sudo cp -r $HOME/usr/lib/* /usr/local/lib/
How could this timeout be increased?
According to the Travis docs it's not possible and the timeout is fixed to 50 min (travis-ci.org) respectively 120 min (travis-ci.com).
You could consider to upgrade the travis plan. Though, the real problem is not the timeout but the necessity to build a huge library before each build. Even tough caching improves the situation a bit, it's still bad.
There are some ways to to reduce the build time (per build) – what fits best for you depends on your situation of course.
A. PPA
If you are luckky and there's a PPA shipping a version of OpenCV you can use that one. Travis runs Ubuntu 14.04 Trusty.
B. Pre-build binaries
You always can build OpenCV your own and upload pre-build binaries to eg. a server or different Git repo. Then Travis can then download and install then there.
C. Docker
Docker is imo the best approach to this. Either create a custom Docker Image or use exiting ones (there are enough around). A good start to look for are DockerHub and GitHub. In addition this way enables you to pack any further dependencies, compiler, … – simply everything you need.
D. Contact Travis
You can always drop an issue at Travis and ask for an updated version of OpenCV.
The question is most clear,
How to start complete desktop environment (KDE, XFCE, Gnome doesn't matter) in the Docker remote container.
I were digging over the internet and there are lots of questions about the related topic, but not the same, they all about how to run GUI application not the full desktop.
What I found out:
Necessary run Xvfb
Somehow run e.g. Xfce in that FrameBuffer
Allow x11vnc to share that running X environment
But I'm stuck here actually, always getting whatever errors:
... (EE) Invalid screen configuration 1024x768 for -screen 0
... Cannot open /dev/tty0 (No such file or directory)
Could you give some Dockerfile lines in order reach the goal?
That is I was looking for, the simplest form of the desktop in Docker:
FROM ubuntu
RUN apt-get update
RUN apt-get install xfce4 -y
RUN apt-get install xfce4-goodies -y
RUN apt-get purge -y pm-utils xscreensaver*
RUN apt-get install wget -y
EXPOSE 5901
RUN wget -qO- https://dl.bintray.com/tigervnc/stable/tigervnc-1.8.0.x86_64.tar.gz | tar xz --strip 1 -C /
RUN mkdir ~/.vnc
RUN echo "123456" | vncpasswd -f >> ~/.vnc/passwd
RUN chmod 600 ~/.vnc/passwd
CMD ["/usr/bin/vncserver", "-fg"]
Unfortunately I could not sort out with x11vnc and xvfb. But TigerVNC turned out much better.
This sample generate container with xfce gui and run vncserver with 123456 password. There is no need to overwrite ~/.vnc/xstartup manually because TigerVNC starts up X server by default!
To run the server:
sudo docker run --rm -dti -p 5901:5901 3ab3e0e7cb
To connect there with vncviewer:
vncviewer -AutoSelect 0 -QualityLevel 9 -CompressLevel 0 192.168.1.100:5901
Also you could not care about screen resolution because by default it will resize to fit your screen:
You may also encounter the issue with ipc_channel_posix (chrome and other browsers will not work properly) to eliminate this run container with memory sharing:
docker run -d --shm-size=2g --privileged -p 5901:5901 image-name
x11docker allows to run desktop environments as well as single GUI applications in docker.
Could you give some Dockerfile lines in order reach the goal?
Example desktop images on docker hub.
x11docker does a lot of setup to keep container isolation and provides some additional options like hardware acceleration or pulseaudio sound. Example:
x11docker --desktop x11docker/lxde
x11docker also supports network setups with SSH, VNC and HTML5
Example for SSH setup with xpra:
read Xenv < <(x11docker --xdummy --display=30 x11docker/lxde pcmanfm)
echo $Xenv && export $Xenv
# replace "start" with "start-desktop" to forward a desktop environment
xpra start :30 --use-display --start-via-proxy=no
From client system, connect with
xpra attach ssh:HOSTNAME:30 # replace HOSTNAME with IP or host name of ssh server
Without x11docker:
A quite short setup using Xephyr as nested X server on host is:
Xephyr :1
docker run -v /tmp/.X11-unix/X1:/tmp/.X11-unix/X1:rw \
-e DISPLAY=:1 \
x11docker/xfce
A short Dockerfile with Xfce desktop:
FROM debian:stretch
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y --no-install-recommends xfce4 dbus-x11
CMD startxfce4