I am running 3 nodes rabbitmq cluster on the Kubernetes. Kubernetes cluster is running on the AWS spot instances and somehow one of the Kubernetes nodes got terminated unexpectedly on with one of the Rabbitmq pods was running. Now the pod git scheduled ona another node and Since then my rabbitmq pod is stuck in the pod initialization state.
Kubernetes event says "FailedPostStartHook".
Logs:
9m46s Warning FailedPostStartHook pod/rabbitmq-0 Exec lifecycle hook ([/bin/sh -c until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done; rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all", "ha-sync-mode": "automatic"}'
]) for Container "rabbitmq" in Pod "rabbitmq-0_devops(c96c1a6e-bf9a-450d-828d-ed0e8a0ad949)" failed - error: command '/bin/sh -c until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done; rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all", "ha-sync-mode": "automatic"}'
' exited with 137: Error: unable to perform an operation on node 'rabbit#rabbitmq-0.rabbitmq-service.devops.svc.cluster.local'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
* Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
* CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
* Target node is not running
In addition to the diagnostics info below:
* See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
* Consult server logs on node rabbit#rabbitmq-0.rabbitmq-service.devops.svc.cluster.local
* If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-0.rabbitmq-service.devops.svc.cluster.local']
Kubernetes statefulset manifest:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: rabbitmq
namespace: devops
spec:
podManagementPolicy: OrderedReady
replicas: 3
revisionHistoryLimit: 3
selector:
matchLabels:
app: rabbitmq
serviceName: rabbitmq-service
template:
metadata:
annotations:
labels:
app: rabbitmq
name: rabbitmq
spec:
containers:
- env:
- name: HOSTNAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: RABBITMQ_USE_LONGNAME
value: "true"
- name: RABBITMQ_BASIC_AUTH
valueFrom:
secretKeyRef:
key: password
name: rabbitmq
- name: RABBITMQ_NODENAME
value: rabbit#$(HOSTNAME).rabbitmq-service.$(NAMESPACE).svc.cluster.local
- name: K8S_SERVICE_NAME
value: rabbitmq-service
- name: RABBITMQ_DEFAULT_USER
value: admin
- name: RABBITMQ_DEFAULT_PASS
valueFrom:
secretKeyRef:
key: password
name: rabbitmq
- name: RABBITMQ_ERLANG_COOKIE
value: some-cookie
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
image: rabbitmq:3.8.1-management-alpine
imagePullPolicy: IfNotPresent
lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- |
until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done; rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all", "ha-sync-mode": "automatic"}'
livenessProbe:
exec:
command:
- rabbitmqctl
- status
failureThreshold: 3
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
name: rabbitmq
ports:
- containerPort: 4369
protocol: TCP
- containerPort: 5672
protocol: TCP
- containerPort: 5671
protocol: TCP
- containerPort: 25672
protocol: TCP
- containerPort: 15672
protocol: TCP
readinessProbe:
exec:
command:
- rabbitmqctl
- status
failureThreshold: 3
initialDelaySeconds: 20
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
resources:
limits:
cpu: "2"
memory: 3Gi
requests:
cpu: "1"
memory: 2Gi
volumeMounts:
- mountPath: /var/lib/rabbitmq/
name: rabbitmq-data
- mountPath: /etc/rabbitmq
name: config
dnsPolicy: ClusterFirst
initContainers:
- command:
- /bin/bash
- -euc
- |
rm -f /var/lib/rabbitmq/.erlang.cookie
cp /rabbitmqconfig/rabbitmq.conf /etc/rabbitmq/rabbitmq.conf
cp /rabbitmqconfig/enabled_plugins /etc/rabbitmq/enabled_plugins
image: rabbitmq:3.8.1-management-alpine
imagePullPolicy: Always
name: copy-rabbitmq-config
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /rabbitmqconfig
name: rabbitmq-configmap
- mountPath: /etc/rabbitmq
name: config
- mountPath: /var/lib/rabbitmq
name: rabbitmq-data
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: rabbitmq
serviceAccountName: rabbitmq
terminationGracePeriodSeconds: 10
volumes:
- configMap:
defaultMode: 420
items:
- key: rabbitmq.conf
path: rabbitmq.conf
- key: enabled_plugins
path: enabled_plugins
name: rabbitmq-configmap
name: rabbitmq-configmap
- emptyDir: {}
name: config
updateStrategy:
type: RollingUpdate
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
creationTimestamp: null
name: rabbitmq-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: gp2
volumeMode: Filesystem
Things I have tried:
Logged into the struck pod and executed(This command just struck without any response)
rabbitmqctl stop_app
Tried deleting the pod forcefully but no luck.
Logged into the struck pod and executed
rabbitmqctl reset
Logged into the struck pod and executed
rabbitmqctl force_boot
Logged into the struck pod and executed
rm /var/log/rabbitmq/*
None of the above things helped.
Please note that the other 2 rabbitmq nodes are running fine and serving the traffic and showing the failed node as up:
rabbitmq-2 rabbitmq 2021-07-04 12:19:07.233 [info] <0.490.0> node 'rabbit#rabbitmq-0.rabbitmq-service.devops.svc.cluster.local' up
rabbitmq-1 rabbitmq 2021-07-04 12:19:07.208 [info] <0.494.0> node 'rabbit#rabbitmq-0.rabbitmq-service.devops.svc.cluster.local' up
Running the rollout restart of statefulset command worked for me.
kubectl rollout restart statefulset rabbitmq -n devops
After this command the rabbitmq cluster is up and running and all the three nodes joined the cluster without any issue.
Once this is done its required to restart the applications which are connecting to this rabbitmq cluster.
Related
I have deployed a service on Knative. I iterated on the service code/Docker image and I try to redeploy it at the same address. I proceeded as follow:
Pushed the new Docker image on our private Docker repo
Updated the service YAML file to point to the new Docker image (see YAML below)
Delete the service with the command: kubectl -n myspacename delete -f myservicename.yaml
Recreate the service with the command: kubectl -n myspacename apply -f myservicename.yaml
During the deployment, the service shows READY = Unknown and REASON = RevisionMissing, and after a while, READY = False and REASON = ProgressDeadlineExceeded. When looking at the logs of the pod with the following command kubectl -n myspacename logs revision.serving.knative.dev/myservicename-00001, I get the message:
no kind "Revision" is registered for version "serving.knative.dev/v1" in scheme "pkg/scheme/scheme.go:28"
Here is the YAML file of the service:
---
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: myservicename
namespace: myspacename
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
autoscaling.knative.dev/metric: concurrency
autoscaling.knative.dev/target: '1'
autoscaling.knative.dev/minScale: '0'
autoscaling.knative.dev/maxScale: '5'
autoscaling.knative.dev/scaleDownDelay: 60s
autoscaling.knative.dev/window: 600s
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: myspacename-models-pvc
imagePullSecrets:
- name: myrobotaccount-pull-secret
containers:
- name: myservicename
image: quay.company.com/project/myservicename:0.4.0
ports:
- containerPort: 5000
name: user-port
protocol: TCP
resources:
limits:
cpu: "4"
memory: 36Gi
nvidia.com/gpu: 1
requests:
cpu: "2"
memory: 32Gi
volumeMounts:
- name: nfs-volume
mountPath: /tmp/static/
securityContext:
privileged: true
env:
- name: CLOUD_STORAGE_PASSWORD
valueFrom:
secretKeyRef:
name: myservicename-cloud-storage-password
key: key
envFrom:
- configMapRef:
name: myservicename-config
The protocol I followed above is correct, the problem was because of a bug in the code of the Docker image that Knative is serving. I was able to troubleshoot the issue by looking at the logs of the pods as follow:
First run the following command to get the pod name: kubectl -n myspacename get pods. Example of pod name = myservicename-00001-deployment-56595b764f-dl7x6
Then get the logs of the pod with the following command: kubectl -n myspacename logs myservicename-00001-deployment-56595b764f-dl7x6
I am using Cassandra image w.r.t.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: cassandra
labels:
app: cassandra
spec:
serviceName: cassandra
replicas: 3
selector:
matchLabels:
app: cassandra
template:
metadata:
labels:
app: cassandra
spec:
terminationGracePeriodSeconds: 1800
containers:
- name: cassandra
image: gcr.io/google-samples/cassandra:v13
imagePullPolicy: Always
ports:
- containerPort: 7000
name: intra-node
- containerPort: 7001
name: tls-intra-node
- containerPort: 7199
name: jmx
- containerPort: 9042
name: cql
resources:
limits:
cpu: "500m"
memory: 1Gi
requests:
cpu: "500m"
memory: 1Gi
securityContext:
capabilities:
add:
- IPC_LOCK
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- nodetool drain
env:
- name: MAX_HEAP_SIZE
value: 512M
- name: HEAP_NEWSIZE
value: 100M
- name: CASSANDRA_SEEDS
value: "cassandra-0.cassandra.default.svc.cluster.local"
- name: CASSANDRA_CLUSTER_NAME
value: "K8Demo"
- name: CASSANDRA_DC
value: "DC1-K8Demo"
- name: CASSANDRA_RACK
value: "Rack1-K8Demo"
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
readinessProbe:
exec:
command:
- /bin/bash
- -c
- /ready-probe.sh
initialDelaySeconds: 15
timeoutSeconds: 5
# These volume mounts are persistent. They are like inline claims,
# but not exactly because the names need to match exactly one of
# the stateful pod volumes.
volumeMounts:
- name: cassandra-data
mountPath: /cassandra_data
# These are converted to volume claims by the controller
# and mounted at the paths mentioned above.
# do not use these in production until ssd GCEPersistentDisk or other ssd pd
volumeClaimTemplates:
- metadata:
name: cassandra-data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: fast
resources:
requests:
storage: 1Gi
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: fast
provisioner: k8s.io/minikube-hostpath
parameters:
type: pd-ssd
Now I need to add below line to cassandra-env.sh in postStart or in cassandra yaml file:
-JVM_OPTS="$JVM_OPTS
-javaagent:$CASSANDRA_HOME/lib/cassandra-exporter-agent-<version>.jar"
Now I was able to achieve this, but after this step, Cassandra requires a restart but as it's already running as a pod, I don't know how to restart the process. So is there any way that this step is done prior to running the pod and not after it is up?
I was suggested below solution:-
This won’t work. Commands that run postStart don’t impact the running container. You need to change the startup commands passed to Cassandra.
The only way that I know to do this is to create a new container image in the artifactory based on the existing image. and pull from there.
But I don't know how to achieve this.
My application uses apache2 web server. Due to restrictions in the kubernetes cluster, I do not have root previliges inside pod. So I have changed default port of apache2 from 80 to 8080 to be able to run as non-root user.
My problem is that once I build the docker image and run it in local it runs fine, but when I deploy using kubernetes in the cluster it keeps failing with:
Action '-D FOREGROUND' failed.
resulting in CrashLoopBackOff.
So, basically the apache2 server is not able to run in the pod with non-root user, but runs fine in local with docker run.
Any help is appreciated.
I am attaching my deployment and service files for reference:
apiVersion: apps/v1
kind: Deployment
metadata:
name: &DeploymentName app
spec:
replicas: 1
selector:
matchLabels: &appName
app: *DeploymentName
template:
metadata:
name: main
labels:
<<: *appName
spec:
securityContext:
fsGroup: 2000
runAsUser: 1000
runAsGroup: 3000
volumes:
- name: var-lock
emptyDir: {}
containers:
- name: *DeploymentName
image: image:id
ports:
- containerPort: 8080
volumeMounts:
- mountPath: /etc/apache2/conf-available
name: var-lock
- mountPath: /var/lock/apache2
name: var-lock
- mountPath: /var/log/apache2
name: var-lock
- mountPath: /mnt/log/apache2
name: var-lock
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 180
periodSeconds: 60
livenessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 300
periodSeconds: 180
imagePullPolicy: Always
tty: true
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
envFrom:
- configMapRef:
name: *DeploymentName
resources:
limits:
cpu: 1
memory: 2Gi
requests:
cpu: 1
memory: 2Gi
---
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: &hpaName app
spec:
maxReplicas: 1
minReplicas: 1
scaleTargetRef:
apiVersion: extensions/v1beta1
kind: Deployment
name: *hpaName
targetCPUUtilizationPercentage: 60
---
apiVersion: v1
kind: Service
metadata:
labels:
app: app
name: app
spec:
selector:
app: app
ports:
- protocol: TCP
name: http-web-port
port: 80
targetPort: 8080
- protocol: TCP
name: https-web-port
port: 443
targetPort: 443
CrashLoopBackOff is a common error in Kubernetes, indicating a pod constantly crashing in an endless loop.
The CrashLoopBackOff error can be caused by a variety of issues, including:
Insufficient resources-lack of resources prevents the container from loading Locked file—a file was already locked by another container
Locked database-the database is being used and locked by other pods
Failed reference—reference to scripts or binaries that are not present on the container
Setup error- an issue with the init-container setup in Kubernetes
Config loading error—a server cannot load the configuration file.
Misconfigurations- a general file system misconfiguration
Connection issues—DNS or kube-DNS is not able to connect to a third-party service
Deploying failed services—an attempt to deploy services/applications that have already failed (e.g. due to a lack of access to other services)
To fix kubernetes CrashLoopbackoff error refer to this link and also check out stackpost for more information.
To create a RabbitMQ cluster within my kubernetes cluster, I use kubernetes objects definition from the following source: https://wesmorgan.svbtle.com/rabbitmq-cluster-on-kubernetes-with-statefulsets
which are:
---
apiVersion: v1
kind: Service
metadata:
# Expose the management HTTP port on each node
name: rabbitmq-management
labels:
app: rabbitmq
spec:
ports:
- port: 15672
name: http
selector:
app: rabbitmq
type: NodePort # Or LoadBalancer in production w/ proper security
---
apiVersion: v1
kind: Service
metadata:
# The required headless service for StatefulSets
name: rabbitmq
labels:
app: rabbitmq
spec:
ports:
- port: 5672
name: amqp
- port: 4369
name: epmd
- port: 25672
name: rabbitmq-dist
clusterIP: None
selector:
app: rabbitmq
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: rabbitmq
spec:
serviceName: "rabbitmq"
replicas: 5
template:
metadata:
labels:
app: rabbitmq
spec:
containers:
- name: rabbitmq
image: rabbitmq:3.6.6-management-alpine
lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- >
if [ -z "$(grep rabbitmq /etc/resolv.conf)" ]; then
sed "s/^search \([^ ]\+\)/search rabbitmq.\1 \1/" /etc/resolv.conf > /etc/resolv.conf.new;
cat /etc/resolv.conf.new > /etc/resolv.conf;
rm /etc/resolv.conf.new;
fi;
until rabbitmqctl node_health_check; do sleep 1; done;
if [[ "$HOSTNAME" != "rabbitmq-0" && -z "$(rabbitmqctl cluster_status | grep rabbitmq-0)" ]]; then
rabbitmqctl stop_app;
rabbitmqctl join_cluster rabbit#rabbitmq-0;
rabbitmqctl start_app;
fi;
rabbitmqctl set_policy ha-all "." '{"ha-mode":"exactly","ha-params":3,"ha-sync-mode":"automatic"}'
env:
- name: RABBITMQ_ERLANG_COOKIE
valueFrom:
secretKeyRef:
name: rabbitmq-config
key: erlang-cookie
ports:
- containerPort: 5672
name: amqp
volumeMounts:
- name: rabbitmq
mountPath: /var/lib/rabbitmq
volumeClaimTemplates:
- metadata:
name: rabbitmq
annotations:
volume.alpha.kubernetes.io/storage-class: anything
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi # make this bigger in production
With this configuration, pods cannot find each other on the network with the following error:
unable to connect to epmd (port 4369) on rabbitmq-1: nxdomain (non-existing domain)
I am running my kubernetes cluster on AWS EKS which runs kubernetes 1.10.
I am following this guide to deploy elasticsearch in my Cluster
elasticsearch Kubernetes
The first time I deployed it everything worked fine. Now, When I redeploy it gives me the following error.
ERROR: [2] bootstrap checks failed
[1]: max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]
[2018-08-24T18:07:28,448][INFO ][o.e.n.Node ] [es-master-6987757898-5pzz9] stopping ...
[2018-08-24T18:07:28,534][INFO ][o.e.n.Node ] [es-master-6987757898-5pzz9] stopped
[2018-08-24T18:07:28,534][INFO ][o.e.n.Node ] [es-master-6987757898-5pzz9] closing ...
[2018-08-24T18:07:28,555][INFO ][o.e.n.Node ] [es-master-6987757898-5pzz9] closed
Here is my deployment file.
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: es-master
labels:
component: elasticsearch
role: master
spec:
replicas: 3
template:
metadata:
labels:
component: elasticsearch
role: master
spec:
initContainers:
- name: init-sysctl
image: busybox:1.27.2
command:
- sysctl
- -w
- vm.max_map_count=262144
securityContext:
privileged: true
containers:
- name: es-master
image: quay.io/pires/docker-elasticsearch-kubernetes:6.3.2
env:
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: CLUSTER_NAME
value: myesdb
- name: NUMBER_OF_MASTERS
value: "2"
- name: NODE_MASTER
value: "true"
- name: NODE_INGEST
value: "false"
- name: NODE_DATA
value: "false"
- name: HTTP_ENABLE
value: "false"
- name: ES_JAVA_OPTS
value: -Xms512m -Xmx512m
- name: NETWORK_HOST
value: "0.0.0.0"
- name: PROCESSORS
valueFrom:
resourceFieldRef:
resource: limits.cpu
resources:
requests:
cpu: 0.25
limits:
cpu: 1
ports:
- containerPort: 9300
name: transport
livenessProbe:
tcpSocket:
port: transport
initialDelaySeconds: 20
periodSeconds: 10
volumeMounts:
- name: storage
mountPath: /data
volumes:
- emptyDir:
medium: ""
name: "storage"
I have seen a lot of posts talking about increasing the value but I am not sure how to do it. Any help would be appreciated.
Just want to append to this issue:
If you create EKS cluster by eksctl then you can append to NodeGroup creation yaml:
preBootstrapCommand:
- "sed -i -e 's/1024:4096/65536:65536/g' /etc/sysconfig/docker"
- "systemctl restart docker"
This will solve the problem for newly created cluster by fixing docker daemon config.
Update default-ulimit parameter in the file '/etc/docker/daemon.json'
"default-ulimits": {
"nofile": {
"Name": "nofile",
"Soft": 65536,
"Hard": 65536
}
}
and restart docker daemon.
This is the only thing that worked for me using EKS setting up an EFK stack. Add this to your nodegroup creation YAML file under nodeGroups:. Then create your nodegroup and apply your ES pods on it.
preBootstrapCommands:
- "sysctl -w vm.max_map_count=262144"
- "systemctl restart docker"