OpenMapTiles docker doesn't start with previous configuration - docker

I created a OpenMapTiles container:
using a volume for /data directory
using the image klokantech/openmaptiles-server:1.6.
The container started nicely. I downloaded the planet file. And the service was working fine.
As I am gonna push this to production: if the container dies, my orchestration system (Kubernetes) will restart it automatically and I want it to pick the previous configuration (so it doesn't need to download the planet file again or set any configuration).
So I killed my container and restart it using the same previous volume.
Problem: when my container was restarted, my restarted MapTiles didn't have the previous configuration and I got this error in the UI:
OpenMapTiles Server is designed to work with data downloaded from OpenMapTiles.com, the following files are unknown and will not be used:
osm-2018-04-09-v3.8-planet.mbtiles
Also, I the logs, it appeared:
/usr/lib/python2.7/dist-packages/supervisor/options.py:298: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
'Supervisord is running as root and it is searching '
2018-05-09 09:20:18,359 CRIT Supervisor running as root (no user in config file)
2018-05-09 09:20:18,359 INFO Included extra file "/etc/supervisor/conf.d/openmaptiles.conf" during parsing
2018-05-09 09:20:18,382 INFO Creating socket tcp://localhost:8081
2018-05-09 09:20:18,383 INFO Closing socket tcp://localhost:8081
2018-05-09 09:20:18,399 INFO RPC interface 'supervisor' initialized
2018-05-09 09:20:18,399 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2018-05-09 09:20:18,399 INFO supervisord started with pid 1
2018-05-09 09:20:19,402 INFO spawned: 'wizard' with pid 11
2018-05-09 09:20:19,405 INFO spawned: 'xvfb' with pid 12
2018-05-09 09:20:20,407 INFO success: wizard entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
2018-05-09 09:20:20,407 INFO success: xvfb entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Starting OpenMapTiles Map Server (action: run)
Existing configuration found in /data/config.json
Data file "undefined" not found!
Starting installation...
Installation wizard started at http://:::80/
List of available downloads ready.
And I guess maybe its this undefined in the config the one that is causing problems:
Existing configuration found in /data/config.json
Data file "undefined" not found!
This is my config file:
root#maptiles-0:/# cat /data/config.json
{
"styles": {
"standard": [
"dark-matter",
"klokantech-basic",
"osm-bright",
"positron"
],
"custom": [],
"lang": "",
"langLatin": true,
"langAlts": true
},
"settings": {
"serve": {
"vector": true,
"raster": true,
"services": true,
"static": true
},
"raster": {
"format": "PNG_256",
"hidpi": 2,
"maxsize": 2048
},
"server": {
"title": "",
"redirect": "",
"domains": []
},
"memcache": {
"size": 23.5,
"servers": [
"localhost:11211"
]
}
}
Should i mount a new volume somewhere else? should I change my /data/config.json? I have no idea how to make it ok for it to be killed

I fixed this using the image klokantech/tileserver-gl:v2.3.1. With this image, you can download the vector tiles in form of MBTiles file OpenMapTiles Downloads
you can find the instructions here: https://openmaptiles.org/docs/host/tileserver-gl/
Also: I deployed it to kubernetes using the following StatefulSet:
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
labels:
name: maptiles
name: maptiles
spec:
replicas: 2
selector:
matchLabels:
name: maptiles
serviceName: maptiles
template:
metadata:
labels:
name: maptiles
spec:
containers:
- name: maptiles
command: ["/bin/sh"]
args:
- -c
- |
echo "[INFO] Startingcontainer"; if [ $(DOWNLOAD_MBTILES) = "true" ]; then
echo "[INFO] Download MBTILES_PLANET_URL";
rm /data/*
cd /data/
wget -q -c $(MBTILES_PLANET_URL)
echo "[INFO] Download finished";
fi; echo "[INFO] Start app in /usr/src/app"; cd /usr/src/app && npm install --production && /usr/src/app/run.sh;
env:
- name: MBTILES_PLANET_URL
value: 'https://openmaptiles.com/download/W...'
- name: DOWNLOAD_MBTILES
value: 'true'
livenessProbe:
failureThreshold: 120
httpGet:
path: /health
port: 80
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 5
ports:
- containerPort: 80
name: http
protocol: TCP
readinessProbe:
failureThreshold: 120
httpGet:
path: /health
port: 80
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 5
resources:
limits:
cpu: 500m
memory: 4Gi
requests:
cpu: 100m
memory: 2Gi
volumeMounts:
- mountPath: /data
name: maptiles
volumeClaimTemplates:
- metadata:
creationTimestamp: null
name: maptiles
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 60Gi
storageClassName: standard
I first deploy it with DOWNLOAD_MBTILES='true' and after I change it to DOWNLOAD_MBTILES='false' (so it doesnt clean up the map next time it is deployed).
I tested it and when it has DOWNLOAD_MBTILES='false', you can kill the containers and they start again in a minute or so.

Related

Flask app deployment failed (Liveness probe failed) after "gcloud builds submit ..."

I am a newbie in frontend/backend/DevOps. But I am in need of using Kubernetes to deploy an app on Google Cloud Platform (GCP) to provide a service. Then I start learning by following this series of tutorials:
https://mickeyabhi1999.medium.com/build-and-deploy-a-web-app-with-react-flask-nginx-postgresql-docker-and-google-kubernetes-e586de159a4d
https://medium.com/swlh/build-and-deploy-a-web-app-with-react-flask-nginx-postgresql-docker-and-google-kubernetes-341f3b4de322
And the code of this tutorial series is here: https://github.com/abhiChakra/Addition-App
Everything was fine until the last step: using "gcloud builds submit ..." to build
nginx+react service
flask+wsgi service
nginx+react deployment
flask+wsgi deployment
on a GCP cluster.
1.~3. went well and the status of them are "OK". But the status of flask+wsgi deployment was "Does not have minimum availability" even after many times of restarting.
I used "kubectl get pods" and saw the status of the flask pod was "CrashLoopBackOff".
Then I followed the processes of debugging suggested here:
https://containersolutions.github.io/runbooks/posts/kubernetes/crashloopbackoff/
I used "kubectl describe pod flask" to look into the problem of the flask pod. Then I found the "Exit Code" was 139 and there were messages "Liveness probe failed: Get "http://10.24.0.25:8000/health": read tcp 10.24.0.1:55470->10.24.0.25:8000: read: connection reset by peer" and "Readiness probe failed: Get "http://10.24.0.25:8000/ready": read tcp 10.24.0.1:55848->10.24.0.25:8000: read: connection reset by peer".
The complete log:
Name: flask-676d5dd999-cf6kt
Namespace: default
Priority: 0
Node: gke-addition-app-default-pool-89aab4fe-3l1q/10.140.0.3
Start Time: Thu, 11 Nov 2021 19:06:24 +0800
Labels: app.kubernetes.io/managed-by=gcp-cloud-build-deploy
component=flask
pod-template-hash=676d5dd999
Annotations: <none>
Status: Running
IP: 10.24.0.25
IPs:
IP: 10.24.0.25
Controlled By: ReplicaSet/flask-676d5dd999
Containers:
flask:
Container ID: containerd://5459b747e1d44046d283a46ec1eebb625be4df712340ff9cf492d5583a4d41d2
Image: gcr.io/peerless-garage-330917/addition-app-flask:latest
Image ID: gcr.io/peerless-garage-330917/addition-app-flask#sha256:b45d25ffa8a0939825e31dec1a6dfe84f05aaf4a2e9e43d35084783edc76f0de
Port: 8000/TCP
Host Port: 0/TCP
State: Running
Started: Fri, 12 Nov 2021 17:24:14 +0800
Last State: Terminated
Reason: Error
Exit Code: 139
Started: Fri, 12 Nov 2021 17:17:06 +0800
Finished: Fri, 12 Nov 2021 17:19:06 +0800
Ready: False
Restart Count: 222
Limits:
cpu: 1
Requests:
cpu: 400m
Liveness: http-get http://:8000/health delay=120s timeout=1s period=5s #success=1 #failure=3
Readiness: http-get http://:8000/ready delay=120s timeout=1s period=5s #success=1 #failure=3
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-s97x5 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-s97x5:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-s97x5
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 9m7s (x217 over 21h) kubelet (combined from similar events): Liveness probe failed: Get "http://10.24.0.25:8000/health": read tcp 10.24.0.1:48636->10.24.0.25:8000: read: connection reset by peer
Warning BackOff 4m38s (x4404 over 22h) kubelet Back-off restarting failed container
Following the suggestion here:
https://containersolutions.github.io/runbooks/posts/kubernetes/crashloopbackoff/#step-4
I had increased the "initialDelaySeconds" to 120, but it still failed.
Because I made sure that everything worked fine on my local laptop, so I think there could be some connection or authentication issue.
To be more detailed, the deployment.yaml looks like:
apiVersion: v1
kind: Service
metadata:
name: ui
spec:
type: LoadBalancer
selector:
app: react
tier: ui
ports:
- port: 8080
targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
name: flask
spec:
type: ClusterIP
selector:
component: flask
ports:
- port: 8000
targetPort: 8000
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: flask
spec:
replicas: 1
selector:
matchLabels:
component: flask
template:
metadata:
labels:
component: flask
spec:
containers:
- name: flask
image: gcr.io/peerless-garage-330917/addition-app-flask:latest
imagePullPolicy: "Always"
resources:
limits:
cpu: "1000m"
requests:
cpu: "400m"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 5
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 5
ports:
- containerPort: 8000
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ui
spec:
replicas: 1
selector:
matchLabels:
app: react
tier: ui
template:
metadata:
labels:
app: react
tier: ui
spec:
containers:
- name: ui
image: gcr.io/peerless-garage-330917/addition-app-nginx:latest
imagePullPolicy: "Always"
resources:
limits:
cpu: "1000m"
requests:
cpu: "400m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 5
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 30
periodSeconds: 5
ports:
- containerPort: 8080
docker-compose.yaml:
# we will be creating these services
services:
flask:
# Note that we are building from our current terminal directory where our Dockerfile is located, we use .
build: .
# naming our resulting container
container_name: flask
# publishing a port so that external services requesting port 8000 on your local machine
# are mapped to port 8000 on our container
ports:
- "8000:8000"
nginx:
# Since our Dockerfile for web-server is located in react-app foler, our build context is ./react-app
build: ./react-app
container_name: nginx
ports:
- "8080:8080"
Nginx Dockerfile:
# first building react project, using node base image
FROM node:10 as build-stage
# setting working dir inside container
WORKDIR /react-app
# required to install packages
COPY package*.json ./
# installing npm packages
RUN npm install
# copying over react source material
COPY src ./src
# copying over further react material
COPY public ./public
# copying over our nginx config file
COPY addition_container_server.conf ./
# creating production build to serve through nginx
RUN npm run build
# starting second, nginx build-stage
FROM nginx:1.15
# removing default nginx config file
RUN rm /etc/nginx/conf.d/default.conf
# copying our nginx config
COPY --from=build-stage /react-app/addition_container_server.conf /etc/nginx/conf.d/
# copying production build from last stage to serve through nginx
COPY --from=build-stage /react-app/build/ /usr/share/nginx/html
# exposing port 8080 on container
EXPOSE 8080
CMD ["nginx", "-g", "daemon off;"]
Nginx server config:
server {
listen 8080;
# location of react build files
root /usr/share/nginx/html/;
# index html from react build to serve
index index.html;
# ONLY KUBERNETES RELEVANT: endpoint for health checkup
location /health {
return 200 "health ok";
}
# ONLY KUBERNETES RELEVANT: endpoint for readiness checkup
location /ready {
return 200 "ready";
}
# html file to serve with / endpoint
location / {
try_files $uri /index.html;
}
# proxing under /api endpoint
location /api {
client_max_body_size 10m;
add_header 'Access-Control-Allow-Origin' http://<NGINX_SERVICE_ENDPOINT>:8080;
proxy_pass http://flask:8000/;
}
}
There are two important functions in App.js:
...
insertCalculation(event, calculation){
/*
Making a POST request via a fetch call to Flask API with numbers of a
calculation we want to insert into DB. Making fetch call to web server
IP with /api/insert_nums which will be reverse proxied via Nginx to the
Application (Flask) server.
*/
event.preventDefault();
fetch('http://<NGINX_SERVICE_ENDPOINT>:8080/api/insert_nums', {method: 'POST',
mode: 'cors',
headers: {
'Content-Type' : 'application/json'
},
body: JSON.stringify(calculation)}
).then((response) => {
...
getHistory(event){
/*
Making a GET request via a fetch call to Flask API to retrieve calculations history.
*/
event.preventDefault()
fetch('http://<NGINX_SERVICE_ENDPOINT>:8080/api/data', {method: 'GET',
mode: 'cors'
}
).then(response => {
...
Flask Dockerfile:
# using base image
FROM python:3.8
# setting working dir inside container
WORKDIR /addition_app_flask
# adding run.py to workdir
ADD run.py .
# adding config.ini to workdir
ADD config.ini .
# adding requirements.txt to workdir
ADD requirements.txt .
# installing flask requirements
RUN pip install -r requirements.txt
# adding in all contents from flask_app folder into a new flask_app folder
ADD ./flask_app ./flask_app
# exposing port 8000 on container
EXPOSE 8000
# serving flask backend through uWSGI server
CMD [ "python", "run.py" ]
run.py:
from gevent.pywsgi import WSGIServer
from flask_app.app import app
# As flask is not a production suitable server, we use will
# a WSGIServer instance to serve our flask application.
if __name__ == '__main__':
WSGIServer(('0.0.0.0', 8000), app).serve_forever()
app.py:
from flask import Flask, request, jsonify
from flask_app.storage import insert_calculation, get_calculations
app = Flask(__name__)
#app.route('/')
def index():
return "My Addition App", 200
#app.route('/health')
def health():
return '', 200
#app.route('/ready')
def ready():
return '', 200
#app.route('/data', methods=['GET'])
def data():
'''
Function used to get calculations history
from Postgres database and return to fetch call in frontend.
:return: Json format of either collected calculations or error message
'''
calculations_history = []
try:
calculations = get_calculations()
for key, value in calculations.items():
calculations_history.append(value)
return jsonify({'calculations': calculations_history}), 200
except:
return jsonify({'error': 'error fetching calculations history'}), 500
#app.route('/insert_nums', methods=['POST'])
def insert_nums():
'''
Function used to insert a calculation into our postgres
DB. Operands of operation received from frontend.
:return: Json format of either success or failure response.
'''
insert_nums = request.get_json()
firstNum, secondNum, answer = insert_nums['firstNum'], insert_nums['secondNum'], insert_nums['answer']
try:
insert_calculation(firstNum, secondNum, answer)
return jsonify({'Response': 'Successfully inserted into DB'}), 200
except:
return jsonify({'Response': 'Unable to insert into DB'}), 500
I can't tell what is going wrong. And I also wonder what should be the better way to debug such a cloud deployment case? Because in normal programs, we can set some breakpoints and print or log something to examine the root location of code that causes the problem, in cloud deployment, however, I lost my direction of debugging.
...Exit Code was 139...
This could mean there's a bug in your Flask app. You can start with minimum spec instead of trying to do all in one goal:
apiVersion: v1
kind: Pod
metadata:
name: flask
labels:
component: flask
spec:
containers:
- name: flask
image: gcr.io/peerless-garage-330917/addition-app-flask:latest
ports:
- containerPort: 8000
See if your pod start accordingly. If it does, try connect to it kubectl port-forward <flask pod name> 8000:8000, follow by curl localhost:8000/health. You should watch your application at all time kubectl logs -f <flask pod name>.
Thanks for #gohm'c response! It is a good suggestion to isolate different parts and start from a smaller component. As suggested, I tried deploying a single flask pod first. Then I used
kubectl port-forward flask 8000:8000
to map the port to local machine. After using
curl localhost:8000/health
to access the port, it showed
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
Handling connection for 8000
E1112 18:52:15.874759 300145 portforward.go:400] an error occurred forwarding 8000 -> 8000: error forwarding port 8000 to pod 4870b939f3224f968fd5afa4660a5af7d10e144ee85149d69acff46a772e94b1, uid : failed to execute portforward in network namespace "/var/run/netns/cni-32f718f0-1248-6da4-c726-b2a5bf1918db": read tcp4 127.0.0.1:38662->127.0.0.1:8000: read: connection reset by peer
At this moment, using
kubectl logs -f flask
returned empty response.
So there is indeed some issues in the flask app.
This health probing is a really simple function in app.py:
#app.route('/health')
def health():
return '', 200
How can I know if the route setting is wrong or not?
Is it because of the WSGIServer in run.py?
from gevent.pywsgi import WSGIServer
from flask_app.app import app
# As flask is not a production suitable server, we use will
# a WSGIServer instance to serve our flask application.
if __name__ == '__main__':
WSGIServer(('0.0.0.0', 8000), app).serve_forever()
If we look at Dockerfile, it seems it exposes the correct port 8000.
If I directly run
python run.py
on my laptop, I can successfully access localhost:8000 .
How can I debug with this kind of problem?

Openshift missing permissions to create a file

The spring boot application is deployed on openshift 4. This application needs to create a file on the nfs-share.
The openshift container has configured a volume mount on the type NFS.
The container on openshift creates a pod with random userid as
sh-4.2$ id
uid=1031290500(1031290500) gid=0(root) groups=0(root),1031290500
The mount point is /nfs/abc
sh-4.2$ ls -la /nfs/
ls: cannot access /nfs/abc: Permission denied
total 0
drwxr-xr-x. 1 root root 29 Nov 25 09:34 .
drwxr-xr-x. 1 root root 50 Nov 25 10:09 ..
d?????????? ? ? ? ? ? abc
on the docker image I created a user "technical" with uid= gid=48760 as shown below.
FROM quay.repository
MAINTAINER developer
LABEL description="abc image" \
name="abc" \
version="1.0"
ARG APP_HOME=/opt/app
ARG PORT=8080
ENV JAR=app.jar \
SPRING_PROFILES_ACTIVE=default \
JAVA_OPTS=""
RUN mkdir $APP_HOME
ADD $JAR $APP_HOME/
WORKDIR $APP_HOME
EXPOSE $PORT
ENTRYPOINT java $JAVA_OPTS -Dspring.profiles.active=$SPRING_PROFILES_ACTIVE -jar $JAR
my deployment config file is as shown below
spec:
volumes:
- name: bad-import-file
persistentVolumeClaim:
claimName: nfs-test-pvc
containers:
- resources:
limits:
cpu: '1'
memory: 1Gi
requests:
cpu: 500m
memory: 512Mi
terminationMessagePath: /dev/termination-log
name: abc
env:
- name: SPRING_PROFILES_ACTIVE
valueFrom:
configMapKeyRef:
name: abc-configmap
key: spring.profiles.active
- name: DB_URL
valueFrom:
configMapKeyRef:
name: abc-configmap
key: db.url
- name: DB_USERNAME
valueFrom:
configMapKeyRef:
name: abc-configmap
key: db.username
- name: BAD_IMPORT_PATH
valueFrom:
configMapKeyRef:
name: abc-configmap
key: bad.import.path
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: abc-secret
key: db.password
ports:
- containerPort: 8080
protocol: TCP
imagePullPolicy: IfNotPresent
volumeMounts:
- name: bad-import-file
mountPath: /nfs/abc
dnsPolicy: ClusterFirst
securityContext:
runAsGroup: 44337
runAsNonRoot: true
supplementalGroups:
- 44337
the PV request is as follows
apiVersion: v1
kind: PersistentVolume
metadata:
name: abc-tuc-pv
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: classic-nfs
mountOptions:
- hard
- nfsvers=3
nfs:
path: /tm03v06_vol3014
server: tm03v06cl02.jit.abc.com
readOnly: false
Now the openshift user has id
sh-4.2$ id
uid=1031290500(1031290500) gid=44337(technical) groups=44337(technical),1031290500
RECENT UPDATE
Just to be clear with the problem, Below I have two commands from the same pod terminal,
sh-4.2$ cd /nfs/
sh-4.2$ ls -la (The first command I tried immediately after pod creation.)
total 8
drwxr-xr-x. 1 root root 29 Nov 29 08:20 .
drwxr-xr-x. 1 root root 50 Nov 30 08:19 ..
drwxrwx---. 14 technical technical 8192 Nov 28 19:06 abc
sh-4.2$ ls -la(few seconds later on the same pod terminal)
ls: cannot access abc: Permission denied
total 0
drwxr-xr-x. 1 root root 29 Nov 29 08:20 .
drwxr-xr-x. 1 root root 50 Nov 30 08:19 ..
d?????????? ? ? ? ? ? abc
So the problem is that I see these question marks(???) on the mount point.
The mounting is working correctly but I cannot access this /nfs/abc directory and I see this ????? for some reason
UPDATE
sh-4.2$ ls -la /nfs/abc/
ls: cannot open directory /nfs/abc/: Stale file handle
sh-4.2$ ls -la /nfs/abc/ (after few seconds on the same pod terminal)
ls: cannot access /nfs/abc/: Permission denied
Could this STALE FILE HANDLE be the reason for this issue?
TL;DR
You can use the anyuid security context to run the pod to avoid having OpenShift assign an arbitrary UID, and set the permissions on the volume to the known UID of the user.
OpenShift will override the user ID the image itself may specify that it should run as:
The user ID isn't actually entirely random, but is an assigned user ID which is unique to your project. In fact, your project is assigned a range of user IDs that applications can be run as. The set of user IDs will not overlap with other projects. You can see what range is assigned to a project by running oc describe on the project.
The purpose of assigning each project a distinct range of user IDs is so that in a multitenant environment, applications from different projects never run as the same user ID. When using persistent storage, any files created by applications will also have different ownership in the file system.
... this is a blessing and a curse, when using shared persistent volume claims for example (e.g. PVC's mounted in ReadWriteMany with multiple pods that read / write data - files created by one pod won't be accessible by the other pod because of the incorrect file ownership and permissions).
One way to get around this issue is using the anyuid security context which "provides all features of the restricted SCC, but allows users to run with any UID and any GID".
When using the anyuid security context, we know the user and group ID's the pod(s) are going to run as, and we can set the permissions on the shared volume in advance. For example, where all pods run with the restricted security context by default:
When running the pod with the anyuid security context, OpenShift doesn't assign an arbitrary UID from the range of UID's allocated for the namespace:
This is just for example, but an image that is built with a non-root user with a fixed UID and GID (e.g. 1000:1000) would run in OpenShift as that user, files would be created with the ownership of that user (e.g. 1000:1000), permissions can be set on the PVC to the known UID and GID of the user set to run the service. For example, we can create a new PVC:
cat <<EOF |kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data
namespace: k8s
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 8Gi
storageClassName: portworx-shared-sc
EOF
... then mount it in a pod:
kubectl run -i --rm --tty ansible --image=lazybit/ansible:v4.0.0 --restart=Never -n k8s --overrides='
{
"apiVersion": "v1",
"kind": "Pod",
"spec": {
"serviceAccountName": "default",
"containers": [
{
"name": "nginx",
"imagePullPolicy": "Always",
"image": "lazybit/ansible:v4.0.0",
"command": ["ash"],
"stdin": true,
"stdinOnce": true,
"tty": true,
"env": [
{
"name": "POD_NAME",
"valueFrom": {
"fieldRef": {
"apiVersion": "v1",
"fieldPath": "metadata.name"
}
}
}
],
"volumeMounts": [
{
"mountPath": "/data",
"name": "data"
}
]
}
],
"volumes": [
{
"name": "data",
"persistentVolumeClaim": {
"claimName": "data"
}
}
]
}
}'
... and create files in the PVC as the USER set in the Dockerfile.

Folder deleted/not created inside the common dir mounted with emptyDir{} type on EKS Fargate pod

We are facing strange issue with EKS Fargate Pods. We want to push logs to cloudwatch with sidecar fluent-bit container and for that we are mounting the separately created /logs/boot and /logs/access folders on both the containers with emptyDir: {} type. But somehow the access folder is getting deleted. When we tested this setup in local docker it produced desired results and things were working fine but not when deployed in the EKS fargate. Below is our manifest files
Dockerfile
FROM anapsix/alpine-java:8u201b09_server-jre_nashorn
ARG LOG_DIR=/logs
# Install base packages
RUN apk update
RUN apk upgrade
# RUN apk add ca-certificates && update-ca-certificates
# Dynamically set the JAVA_HOME path
RUN export JAVA_HOME="$(dirname $(dirname $(readlink -f $(which java))))" && echo $JAVA_HOME
# Add Curl
RUN apk --no-cache add curl
RUN mkdir -p $LOG_DIR/boot $LOG_DIR/access
RUN chmod -R 0777 $LOG_DIR/*
# Add metadata to the image to describe which port the container is listening on at runtime.
# Change TimeZone
RUN apk add --update tzdata
ENV TZ="Asia/Kolkata"
# Clean APK cache
RUN rm -rf /var/cache/apk/*
# Setting JAVA HOME
ENV JAVA_HOME=/opt/jdk
# Copy all files and folders
COPY . .
RUN rm -rf /opt/jdk/jre/lib/security/cacerts
COPY cacerts /opt/jdk/jre/lib/security/cacerts
COPY standalone.xml /jboss-eap-6.4-integration/standalone/configuration/
# Set the working directory.
WORKDIR /jboss-eap-6.4-integration/bin
EXPOSE 8177
CMD ["./erctl"]
Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vinintegrator
namespace: eretail
labels:
app: vinintegrator
pod: fargate
spec:
selector:
matchLabels:
app: vinintegrator
pod: fargate
replicas: 2
template:
metadata:
labels:
app: vinintegrator
pod: fargate
spec:
securityContext:
fsGroup: 0
serviceAccount: eretail
containers:
- name: vinintegrator
imagePullPolicy: IfNotPresent
image: 653580443710.dkr.ecr.ap-southeast-1.amazonaws.com/vinintegrator-service:latest
resources:
limits:
memory: "7629Mi"
cpu: "1.5"
requests:
memory: "5435Mi"
cpu: "750m"
ports:
- containerPort: 8177
protocol: TCP
# securityContext:
# runAsUser: 506
# runAsGroup: 506
volumeMounts:
- mountPath: /jboss-eap-6.4-integration/bin
name: bin
- mountPath: /logs
name: logs
- name: fluent-bit
image: 657281243710.dkr.ecr.ap-southeast-1.amazonaws.com/fluent-bit:latest
imagePullPolicy: IfNotPresent
env:
- name: HOST_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
resources:
limits:
memory: 200Mi
requests:
cpu: 200m
memory: 100Mi
volumeMounts:
- name: fluent-bit-config
mountPath: /fluent-bit/etc/
- name: logs
mountPath: /logs
readOnly: true
volumes:
- name: fluent-bit-config
configMap:
name: fluent-bit-config
- name: logs
emptyDir: {}
- name: bin
persistentVolumeClaim:
claimName: vinintegrator-pvc
Below is the /logs folder ownership and permission. Please notice the 's' in drwxrwsrwx
drwxrwsrwx 3 root root 4096 Oct 1 11:50 logs
Below is the content inside logs folder. Please notice the access folder is not created or deleted.
/logs # ls -lrt
total 4
drwxr-sr-x 2 root root 4096 Oct 1 11:50 boot
/logs #
Below is the configmap of Fluent-Bit
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: eretail
labels:
k8s-app: fluent-bit
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
#INCLUDE application-log.conf
application-log.conf: |
[INPUT]
Name tail
Path /logs/boot/*.log
Tag boot
[INPUT]
Name tail
Path /logs/access/*.log
Tag access
[OUTPUT]
Name cloudwatch_logs
Match *boot*
region ap-southeast-1
log_group_name eks-fluent-bit
log_stream_prefix boot-log-
auto_create_group On
[OUTPUT]
Name cloudwatch_logs
Match *access*
region ap-southeast-1
log_group_name eks-fluent-bit
log_stream_prefix access-log-
auto_create_group On
parsers.conf: |
[PARSER]
Name docker
Format json
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
Below is error log of Fluent-bit container
AWS for Fluent Bit Container Image Version 2.14.0
Fluent Bit v1.7.4
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io
[2021/10/01 06:20:33] [ info] [engine] started (pid=1)
[2021/10/01 06:20:33] [ info] [storage] version=1.1.1, initializing...
[2021/10/01 06:20:33] [ info] [storage] in-memory
[2021/10/01 06:20:33] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/10/01 06:20:33] [error] [input:tail:tail.1] read error, check permissions: /logs/access/*.log
[2021/10/01 06:20:33] [ warn] [input:tail:tail.1] error scanning path: /logs/access/*.log
[2021/10/01 06:20:38] [error] [net] connection #33 timeout after 5 seconds to: 169.254.169.254:80
[2021/10/01 06:20:38] [error] [net] socket #33 could not connect to 169.254.169.254:80
Suggest remove the following from your Dockerfile:
RUN mkdir -p $LOG_DIR/boot $LOG_DIR/access
RUN chmod -R 0777 $LOG_DIR/*
Use the following method to setup the log directories and permissions:
apiVersion: v1
kind: Pod # Deployment
metadata:
name: busy
labels:
app: busy
spec:
volumes:
- name: logs # Shared folder with ephemeral storage
emptyDir: {}
initContainers: # Setup your log directory here
- name: setup
image: busybox
command: ["bin/ash", "-c"]
args:
- >
mkdir -p /logs/boot /logs/access;
chmod -R 777 /logs
volumeMounts:
- name: logs
mountPath: /logs
containers:
- name: app # Run your application and logs to the directories
image: busybox
command: ["bin/ash","-c"]
args:
- >
while :; do echo "$(date): $(uname -r)" | tee -a /logs/boot/boot.log /logs/access/access.log; sleep 1; done
volumeMounts:
- name: logs
mountPath: /logs
- name: logger # Any logger that you like
image: busybox
command: ["bin/ash","-c"]
args: # tail the app logs, forward to CW etc...
- >
sleep 5;
tail -f /logs/boot/boot.log /logs/access/access.log
volumeMounts:
- name: logs
mountPath: /logs
The snippet runs on Fargate as well, run kubectl logs -f busy -c logger to see the tailing. In real world, the "app" is your java app, "logger" is any log agent you desired. Note Fargate has native logging capability using AWS Fluent-bit, you do not need to run AWS Fluent-bit as sidecar.

How to configure a RabbitMQ cluster in Kubernetes with a mounted persistent volume that will allow data to persist when the entire cluster restarts?

I am trying to setup a high-availability RabbitMQ cluster of nodes in my Kubernetes cluster as a StatefulSet so that my data (e.g. queues, messages) persist even after restarting all of the nodes simultaneously. Since I'm deploying the RabbitMQ nodes in Kubernetes, I understand that I need to include an external persistent volume for the nodes to store data in so that the data will persist after a restart. I have mounted an Azure Files Share into my containers as a volume at the directory /var/lib/rabbitmq/mnesia.
When starting with a fresh (empty) volume, the nodes start up without any issues and successfully form a cluster. I can open the RabbitMQ management UI and see that any queue I create is mirrored on all of the nodes, as expected, and the queue (plus any messages in it) will persist as long as there is at least 1 active node. Deleting pods with kubectl delete pod rabbitmq-0 -n rabbit will cause the node to stop and then restart, and the logs show that it successfully syncs with any remaining/active node so everything is fine.
The problem I have encountered is that when I simultaneously delete all RabbitMQ nodes in the cluster, the first node to start up will have the persisted data from the volume and tries to re-cluster with the other two nodes which are, of course, not active. What I expected to happen was that the node would start up, load the queue and message data, and then form a new cluster (since it should notice that no other nodes are active).
I suspect that there may be some data in the mounted volume that indicates the presence of other nodes which is why it tries to connect with them and join the supposed cluster, but I haven't found a way to prevent that and am not certain that this is the cause.
There are two different error messages: one in the pod description (kubectl describe pod rabbitmq-0 -n rabbit) when the RabbitMQ node is in a crash loop and another in the pod logs. The pod description error output includes the following:
exited with 137:
20:38:12.331 [error] Cookie file /var/lib/rabbitmq/.erlang.cookie must be accessible by owner only
Error: unable to perform an operation on node 'rabbit#rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
* Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
* CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
* Target node is not running
In addition to the diagnostics info below:
* See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
* Consult server logs on node rabbit#rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local
* If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local']
rabbit#rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local:
* connected to epmd (port 4369) on rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local
* epmd reports: node 'rabbit' not running at all
no other nodes on rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local
* suggestion: start the node
Current node details:
* node name: 'rabbitmqcli-345-rabbit#rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local'
* effective user's home directory: /var/lib/rabbitmq
* Erlang cookie hash: xxxxxxxxxxxxxxxxx
and the logs output the following info:
Config file(s): /etc/rabbitmq/rabbitmq.conf
Starting broker...2020-06-12 20:39:08.678 [info] <0.294.0>
node : rabbit#rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local
home dir : /var/lib/rabbitmq
config file(s) : /etc/rabbitmq/rabbitmq.conf
cookie hash : xxxxxxxxxxxxxxxxx
log(s) : <stdout>
database dir : /var/lib/rabbitmq/mnesia/rabbit#rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local
...
2020-06-12 20:48:39.015 [warning] <0.294.0> Error while waiting for Mnesia tables: {timeout_waiting_for_tables,['rabbit#rabbitmq-2.rabbitmq-internal.rabbit.svc.cluster.local','rabbit#rabbitmq-1.rabbitmq-internal.rabbit.svc.cluster.local','rabbit#rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local'],[rabbit_user,rabbit_user_permission,rabbit_topic_permission,rabbit_vhost,rabbit_durable_route,rabbit_durable_exchange,rabbit_runtime_parameters,rabbit_durable_queue]}
2020-06-12 20:48:39.015 [info] <0.294.0> Waiting for Mnesia tables for 30000 ms, 0 retries left
2020-06-12 20:49:09.341 [info] <0.44.0> Application mnesia exited with reason: stopped
2020-06-12 20:49:09.505 [error] <0.294.0>
2020-06-12 20:49:09.505 [error] <0.294.0> BOOT FAILED
2020-06-12 20:49:09.505 [error] <0.294.0> ===========
2020-06-12 20:49:09.505 [error] <0.294.0> Timeout contacting cluster nodes: ['rabbit#rabbitmq-2.rabbitmq-internal.rabbit.svc.cluster.local',
2020-06-12 20:49:09.505 [error] <0.294.0> 'rabbit#rabbitmq-1.rabbitmq-internal.rabbit.svc.cluster.local'].
...
BACKGROUND
==========
This cluster node was shut down while other nodes were still running.
2020-06-12 20:49:09.506 [error] <0.294.0>
2020-06-12 20:49:09.506 [error] <0.294.0> This cluster node was shut down while other nodes were still running.
2020-06-12 20:49:09.506 [error] <0.294.0> To avoid losing data, you should start the other nodes first, then
2020-06-12 20:49:09.506 [error] <0.294.0> start this one. To force this node to start, first invoke
To avoid losing data, you should start the other nodes first, then
start this one. To force this node to start, first invoke
"rabbitmqctl force_boot". If you do so, any changes made on other
cluster nodes after this one was shut down may be lost.
What I've tried so far is clearing the /var/lib/rabbitmq/mnesia/rabbit#rabbitmq-0.rabbitmq-internal.rabbit.svc.cluster.local/nodes_running_at_shutdown file contents, and fiddling with config settings such as the volume mount directory and erlang cookie permissions.
Below are the relevant deployment files and config files:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: rabbitmq
namespace: rabbit
spec:
serviceName: rabbitmq-internal
revisionHistoryLimit: 3
updateStrategy:
type: RollingUpdate
replicas: 3
selector:
matchLabels:
app: rabbitmq
template:
metadata:
name: rabbitmq
labels:
app: rabbitmq
spec:
serviceAccountName: rabbitmq
terminationGracePeriodSeconds: 10
containers:
- name: rabbitmq
image: rabbitmq:0.13
lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- >
until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done;
rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all", "ha-sync-mode": "automatic"}'
ports:
- containerPort: 4369
- containerPort: 5672
- containerPort: 5671
- containerPort: 25672
- containerPort: 15672
resources:
requests:
memory: "500Mi"
cpu: "0.4"
limits:
memory: "600Mi"
cpu: "0.6"
livenessProbe:
exec:
# Stage 2 check:
command: ["rabbitmq-diagnostics", "status", "--erlang-cookie", "$(RABBITMQ_ERLANG_COOKIE)"]
initialDelaySeconds: 60
periodSeconds: 60
timeoutSeconds: 15
readinessProbe:
exec:
# Stage 2 check:
command: ["rabbitmq-diagnostics", "status", "--erlang-cookie", "$(RABBITMQ_ERLANG_COOKIE)"]
initialDelaySeconds: 20
periodSeconds: 60
timeoutSeconds: 10
envFrom:
- configMapRef:
name: rabbitmq-cfg
env:
- name: HOSTNAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: RABBITMQ_USE_LONGNAME
value: "true"
- name: RABBITMQ_NODENAME
value: "rabbit#$(HOSTNAME).rabbitmq-internal.$(NAMESPACE).svc.cluster.local"
- name: K8S_SERVICE_NAME
value: "rabbitmq-internal"
- name: RABBITMQ_DEFAULT_USER
value: user
- name: RABBITMQ_DEFAULT_PASS
value: pass
- name: RABBITMQ_ERLANG_COOKIE
value: my-cookie
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumeMounts:
- name: my-volume-mount
mountPath: "/var/lib/rabbitmq/mnesia"
imagePullSecrets:
- name: my-secret
volumes:
- name: my-volume-mount
azureFile:
secretName: azure-rabbitmq-secret
shareName: my-fileshare-name
readOnly: false
---
apiVersion: v1
kind: ConfigMap
metadata:
name: rabbitmq-cfg
namespace: rabbit
data:
RABBITMQ_VM_MEMORY_HIGH_WATERMARK: "0.6"
---
kind: Service
apiVersion: v1
metadata:
namespace: rabbit
name: rabbitmq-internal
labels:
app: rabbitmq
spec:
clusterIP: None
ports:
- name: http
protocol: TCP
port: 15672
- name: amqp
protocol: TCP
port: 5672
- name: amqps
protocol: TCP
port: 5671
selector:
app: rabbitmq
---
kind: Service
apiVersion: v1
metadata:
namespace: rabbit
name: rabbitmq
labels:
app: rabbitmq
type: LoadBalancer
spec:
selector:
app: rabbitmq
ports:
- name: http
protocol: TCP
port: 15672
targetPort: 15672
- name: amqp
protocol: TCP
port: 5672
targetPort: 5672
- name: amqps
protocol: TCP
port: 5671
targetPort: 5671
Dockerfile:
FROM rabbitmq:3.8.4
COPY conf/rabbitmq.conf /etc/rabbitmq
COPY conf/enabled_plugins /etc/rabbitmq
USER root
COPY conf/.erlang.cookie /var/lib/rabbitmq
RUN /bin/bash -c 'ls -ld /var/lib/rabbitmq/.erlang.cookie; chmod 600 /var/lib/rabbitmq/.erlang.cookie; ls -ld /var/lib/rabbitmq/.erlang.cookie'
rabbitmq.conf
## cluster formation settings
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_k8s
cluster_formation.k8s.host = kubernetes.default.svc.cluster.local
cluster_formation.k8s.address_type = hostname
cluster_formation.k8s.service_name = rabbitmq-internal
cluster_formation.k8s.hostname_suffix = .rabbitmq-internal.rabbit.svc.cluster.local
cluster_formation.node_cleanup.interval = 60
cluster_formation.node_cleanup.only_log_warning = true
cluster_partition_handling = autoheal
queue_master_locator=min-masters
## general settings
log.file.level = debug
## Mgmt UI secure/non-secure connection settings (secure not implemented yet)
management.tcp.port = 15672
## RabbitMQ entrypoint settings (will be injected below when image is built)
Thanks in advance!

Liveness Probe gets timed

I'm using Jenkins and Kubernetes to perform this actions.
Since my loadBalancer needs a healthy pod I had to add the livenessProbe to my pod.
My configuration for the pod:
apiVersion: v1
kind: Pod
metadata:
labels:
component: ci
spec:
# Use service account that can deploy to all namespaces
serviceAccountName: default
# Use the persisnte volume
containers:
- name: gcloud
image: gcr.io/cloud-builders/gcloud
command:
- cat
tty: true
- name: kubectl
image: gcr.io/cloud-builders/kubectl
command:
- cat
tty: true
- name: liveness
image: k8s.gcr.io/busybox
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
The issue that happens is when I want to deploy the code (CD over Jenkins) it comes to the touch
/tmp/healthy;
command and it's timed out.
The error response I get looks like this:
java.io.IOException: Failed to execute shell script inside container [kubectl] of pod [wobbl-mobile-label-qcd6x-13mtj]. Timed out waiting for the container to become ready!
When I type kubectl get events
I get the following response:
Liveness probe failed: cat: can't open '/tmp/healthy': No such file or directory
Any hints on how to solve this?
I have read this documentation for the liveness and I took the config for it from there.
As can be seen from the link you are referring . The example is to help you understand the working of liveness probe. I the example below from this link
they have purposely removed /tmp/healthy file after
apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-exec
spec:
containers:
- name: liveness
image: k8s.gcr.io/busybox
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
what this does is it creates /tmp/healthy file when the container is created. After 5 seconds the liveness probe kicks in and checks for /tmp/healthy file , at this moment the container does have a /tmp/healthy file present . After 30 seconds it deletes the file and liveness probe fails to find the /tmp/healthy file and restarts the container. This process will continue to go on and liveness probe will fail the health check after every 30 seconds.
If you only add
touch /tmp/healthy
The liveness probe should work well

Resources