I am running my kubernetes cluster on AWS EKS which runs kubernetes 1.10.
I am following this guide to deploy elasticsearch in my Cluster
elasticsearch Kubernetes
The first time I deployed it everything worked fine. Now, When I redeploy it gives me the following error.
ERROR: [2] bootstrap checks failed
[1]: max file descriptors [4096] for elasticsearch process is too low, increase to at least [65536]
[2018-08-24T18:07:28,448][INFO ][o.e.n.Node ] [es-master-6987757898-5pzz9] stopping ...
[2018-08-24T18:07:28,534][INFO ][o.e.n.Node ] [es-master-6987757898-5pzz9] stopped
[2018-08-24T18:07:28,534][INFO ][o.e.n.Node ] [es-master-6987757898-5pzz9] closing ...
[2018-08-24T18:07:28,555][INFO ][o.e.n.Node ] [es-master-6987757898-5pzz9] closed
Here is my deployment file.
apiVersion: apps/v1beta1
kind: Deployment
metadata:
name: es-master
labels:
component: elasticsearch
role: master
spec:
replicas: 3
template:
metadata:
labels:
component: elasticsearch
role: master
spec:
initContainers:
- name: init-sysctl
image: busybox:1.27.2
command:
- sysctl
- -w
- vm.max_map_count=262144
securityContext:
privileged: true
containers:
- name: es-master
image: quay.io/pires/docker-elasticsearch-kubernetes:6.3.2
env:
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: CLUSTER_NAME
value: myesdb
- name: NUMBER_OF_MASTERS
value: "2"
- name: NODE_MASTER
value: "true"
- name: NODE_INGEST
value: "false"
- name: NODE_DATA
value: "false"
- name: HTTP_ENABLE
value: "false"
- name: ES_JAVA_OPTS
value: -Xms512m -Xmx512m
- name: NETWORK_HOST
value: "0.0.0.0"
- name: PROCESSORS
valueFrom:
resourceFieldRef:
resource: limits.cpu
resources:
requests:
cpu: 0.25
limits:
cpu: 1
ports:
- containerPort: 9300
name: transport
livenessProbe:
tcpSocket:
port: transport
initialDelaySeconds: 20
periodSeconds: 10
volumeMounts:
- name: storage
mountPath: /data
volumes:
- emptyDir:
medium: ""
name: "storage"
I have seen a lot of posts talking about increasing the value but I am not sure how to do it. Any help would be appreciated.
Just want to append to this issue:
If you create EKS cluster by eksctl then you can append to NodeGroup creation yaml:
preBootstrapCommand:
- "sed -i -e 's/1024:4096/65536:65536/g' /etc/sysconfig/docker"
- "systemctl restart docker"
This will solve the problem for newly created cluster by fixing docker daemon config.
Update default-ulimit parameter in the file '/etc/docker/daemon.json'
"default-ulimits": {
"nofile": {
"Name": "nofile",
"Soft": 65536,
"Hard": 65536
}
}
and restart docker daemon.
This is the only thing that worked for me using EKS setting up an EFK stack. Add this to your nodegroup creation YAML file under nodeGroups:. Then create your nodegroup and apply your ES pods on it.
preBootstrapCommands:
- "sysctl -w vm.max_map_count=262144"
- "systemctl restart docker"
Related
I have deployed a service on Knative. I iterated on the service code/Docker image and I try to redeploy it at the same address. I proceeded as follow:
Pushed the new Docker image on our private Docker repo
Updated the service YAML file to point to the new Docker image (see YAML below)
Delete the service with the command: kubectl -n myspacename delete -f myservicename.yaml
Recreate the service with the command: kubectl -n myspacename apply -f myservicename.yaml
During the deployment, the service shows READY = Unknown and REASON = RevisionMissing, and after a while, READY = False and REASON = ProgressDeadlineExceeded. When looking at the logs of the pod with the following command kubectl -n myspacename logs revision.serving.knative.dev/myservicename-00001, I get the message:
no kind "Revision" is registered for version "serving.knative.dev/v1" in scheme "pkg/scheme/scheme.go:28"
Here is the YAML file of the service:
---
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: myservicename
namespace: myspacename
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
autoscaling.knative.dev/metric: concurrency
autoscaling.knative.dev/target: '1'
autoscaling.knative.dev/minScale: '0'
autoscaling.knative.dev/maxScale: '5'
autoscaling.knative.dev/scaleDownDelay: 60s
autoscaling.knative.dev/window: 600s
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
volumes:
- name: nfs-volume
persistentVolumeClaim:
claimName: myspacename-models-pvc
imagePullSecrets:
- name: myrobotaccount-pull-secret
containers:
- name: myservicename
image: quay.company.com/project/myservicename:0.4.0
ports:
- containerPort: 5000
name: user-port
protocol: TCP
resources:
limits:
cpu: "4"
memory: 36Gi
nvidia.com/gpu: 1
requests:
cpu: "2"
memory: 32Gi
volumeMounts:
- name: nfs-volume
mountPath: /tmp/static/
securityContext:
privileged: true
env:
- name: CLOUD_STORAGE_PASSWORD
valueFrom:
secretKeyRef:
name: myservicename-cloud-storage-password
key: key
envFrom:
- configMapRef:
name: myservicename-config
The protocol I followed above is correct, the problem was because of a bug in the code of the Docker image that Knative is serving. I was able to troubleshoot the issue by looking at the logs of the pods as follow:
First run the following command to get the pod name: kubectl -n myspacename get pods. Example of pod name = myservicename-00001-deployment-56595b764f-dl7x6
Then get the logs of the pod with the following command: kubectl -n myspacename logs myservicename-00001-deployment-56595b764f-dl7x6
I have to run DevOps agent inside Docker container in order to run my DevOps pipeline tasks.
As you can see, after pipeline is initialized, my agent has to build and publish image.
Also this container should run inside rancher as a pod.
On my PC I figured out that I have to use
docker run -v /var/run/docker.sock:/var/run/docker.sock
In order to get it worked, but I don't know how to configure it in rancher.
Here is my actual YAML configuration of this pod where '*****' means sensitive data:
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
workload.user.cattle.io/workloadselector: apps.deployment-**************
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
cattle.io/timestamp: "2022-10-25T11:22:39Z"
creationTimestamp: null
labels:
workload.user.cattle.io/workloadselector: apps.deployment-**************
spec:
affinity: {}
containers:
- env:
- name: AZP_URL
value: ***********************
- name: AZP_TOKEN
valueFrom:
secretKeyRef:
key: AZP_TOKEN
name: pat
optional: false
- name: AZP_AGENT_NAME
value: ********************
- name: AZP_POOL
value: *******************
image: ******************************************
imagePullPolicy: Always
name: *********************
resources:
limits:
cpu: "3"
memory: 6Gi
requests:
cpu: 500m
memory: 512Mi
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/docker.sock
name: dockersock
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: azure-registry
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /var/run/docker.sock
type: ""
name: dockersock
Also here is error message I was reciving from pipeline log:
##[error]Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
##[error]The process '/usr/bin/docker' failed with exit code 1
I am using Cassandra image w.r.t.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: cassandra
labels:
app: cassandra
spec:
serviceName: cassandra
replicas: 3
selector:
matchLabels:
app: cassandra
template:
metadata:
labels:
app: cassandra
spec:
terminationGracePeriodSeconds: 1800
containers:
- name: cassandra
image: gcr.io/google-samples/cassandra:v13
imagePullPolicy: Always
ports:
- containerPort: 7000
name: intra-node
- containerPort: 7001
name: tls-intra-node
- containerPort: 7199
name: jmx
- containerPort: 9042
name: cql
resources:
limits:
cpu: "500m"
memory: 1Gi
requests:
cpu: "500m"
memory: 1Gi
securityContext:
capabilities:
add:
- IPC_LOCK
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- nodetool drain
env:
- name: MAX_HEAP_SIZE
value: 512M
- name: HEAP_NEWSIZE
value: 100M
- name: CASSANDRA_SEEDS
value: "cassandra-0.cassandra.default.svc.cluster.local"
- name: CASSANDRA_CLUSTER_NAME
value: "K8Demo"
- name: CASSANDRA_DC
value: "DC1-K8Demo"
- name: CASSANDRA_RACK
value: "Rack1-K8Demo"
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
readinessProbe:
exec:
command:
- /bin/bash
- -c
- /ready-probe.sh
initialDelaySeconds: 15
timeoutSeconds: 5
# These volume mounts are persistent. They are like inline claims,
# but not exactly because the names need to match exactly one of
# the stateful pod volumes.
volumeMounts:
- name: cassandra-data
mountPath: /cassandra_data
# These are converted to volume claims by the controller
# and mounted at the paths mentioned above.
# do not use these in production until ssd GCEPersistentDisk or other ssd pd
volumeClaimTemplates:
- metadata:
name: cassandra-data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: fast
resources:
requests:
storage: 1Gi
---
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: fast
provisioner: k8s.io/minikube-hostpath
parameters:
type: pd-ssd
Now I need to add below line to cassandra-env.sh in postStart or in cassandra yaml file:
-JVM_OPTS="$JVM_OPTS
-javaagent:$CASSANDRA_HOME/lib/cassandra-exporter-agent-<version>.jar"
Now I was able to achieve this, but after this step, Cassandra requires a restart but as it's already running as a pod, I don't know how to restart the process. So is there any way that this step is done prior to running the pod and not after it is up?
I was suggested below solution:-
This won’t work. Commands that run postStart don’t impact the running container. You need to change the startup commands passed to Cassandra.
The only way that I know to do this is to create a new container image in the artifactory based on the existing image. and pull from there.
But I don't know how to achieve this.
I am running 3 nodes rabbitmq cluster on the Kubernetes. Kubernetes cluster is running on the AWS spot instances and somehow one of the Kubernetes nodes got terminated unexpectedly on with one of the Rabbitmq pods was running. Now the pod git scheduled ona another node and Since then my rabbitmq pod is stuck in the pod initialization state.
Kubernetes event says "FailedPostStartHook".
Logs:
9m46s Warning FailedPostStartHook pod/rabbitmq-0 Exec lifecycle hook ([/bin/sh -c until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done; rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all", "ha-sync-mode": "automatic"}'
]) for Container "rabbitmq" in Pod "rabbitmq-0_devops(c96c1a6e-bf9a-450d-828d-ed0e8a0ad949)" failed - error: command '/bin/sh -c until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done; rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all", "ha-sync-mode": "automatic"}'
' exited with 137: Error: unable to perform an operation on node 'rabbit#rabbitmq-0.rabbitmq-service.devops.svc.cluster.local'. Please see diagnostics information and suggestions below.
Most common reasons for this are:
* Target node is unreachable (e.g. due to hostname resolution, TCP connection or firewall issues)
* CLI tool fails to authenticate with the server (e.g. due to CLI tool's Erlang cookie not matching that of the server)
* Target node is not running
In addition to the diagnostics info below:
* See the CLI, clustering and networking guides on https://rabbitmq.com/documentation.html to learn more
* Consult server logs on node rabbit#rabbitmq-0.rabbitmq-service.devops.svc.cluster.local
* If target node is configured to use long node names, don't forget to use --longnames with CLI tools
DIAGNOSTICS
===========
attempted to contact: ['rabbit#rabbitmq-0.rabbitmq-service.devops.svc.cluster.local']
Kubernetes statefulset manifest:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: rabbitmq
namespace: devops
spec:
podManagementPolicy: OrderedReady
replicas: 3
revisionHistoryLimit: 3
selector:
matchLabels:
app: rabbitmq
serviceName: rabbitmq-service
template:
metadata:
annotations:
labels:
app: rabbitmq
name: rabbitmq
spec:
containers:
- env:
- name: HOSTNAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: RABBITMQ_USE_LONGNAME
value: "true"
- name: RABBITMQ_BASIC_AUTH
valueFrom:
secretKeyRef:
key: password
name: rabbitmq
- name: RABBITMQ_NODENAME
value: rabbit#$(HOSTNAME).rabbitmq-service.$(NAMESPACE).svc.cluster.local
- name: K8S_SERVICE_NAME
value: rabbitmq-service
- name: RABBITMQ_DEFAULT_USER
value: admin
- name: RABBITMQ_DEFAULT_PASS
valueFrom:
secretKeyRef:
key: password
name: rabbitmq
- name: RABBITMQ_ERLANG_COOKIE
value: some-cookie
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
image: rabbitmq:3.8.1-management-alpine
imagePullPolicy: IfNotPresent
lifecycle:
postStart:
exec:
command:
- /bin/sh
- -c
- |
until rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} node_health_check; do sleep 1; done; rabbitmqctl --erlang-cookie ${RABBITMQ_ERLANG_COOKIE} set_policy ha-all "" '{"ha-mode":"all", "ha-sync-mode": "automatic"}'
livenessProbe:
exec:
command:
- rabbitmqctl
- status
failureThreshold: 3
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
name: rabbitmq
ports:
- containerPort: 4369
protocol: TCP
- containerPort: 5672
protocol: TCP
- containerPort: 5671
protocol: TCP
- containerPort: 25672
protocol: TCP
- containerPort: 15672
protocol: TCP
readinessProbe:
exec:
command:
- rabbitmqctl
- status
failureThreshold: 3
initialDelaySeconds: 20
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
resources:
limits:
cpu: "2"
memory: 3Gi
requests:
cpu: "1"
memory: 2Gi
volumeMounts:
- mountPath: /var/lib/rabbitmq/
name: rabbitmq-data
- mountPath: /etc/rabbitmq
name: config
dnsPolicy: ClusterFirst
initContainers:
- command:
- /bin/bash
- -euc
- |
rm -f /var/lib/rabbitmq/.erlang.cookie
cp /rabbitmqconfig/rabbitmq.conf /etc/rabbitmq/rabbitmq.conf
cp /rabbitmqconfig/enabled_plugins /etc/rabbitmq/enabled_plugins
image: rabbitmq:3.8.1-management-alpine
imagePullPolicy: Always
name: copy-rabbitmq-config
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /rabbitmqconfig
name: rabbitmq-configmap
- mountPath: /etc/rabbitmq
name: config
- mountPath: /var/lib/rabbitmq
name: rabbitmq-data
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: rabbitmq
serviceAccountName: rabbitmq
terminationGracePeriodSeconds: 10
volumes:
- configMap:
defaultMode: 420
items:
- key: rabbitmq.conf
path: rabbitmq.conf
- key: enabled_plugins
path: enabled_plugins
name: rabbitmq-configmap
name: rabbitmq-configmap
- emptyDir: {}
name: config
updateStrategy:
type: RollingUpdate
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
creationTimestamp: null
name: rabbitmq-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: gp2
volumeMode: Filesystem
Things I have tried:
Logged into the struck pod and executed(This command just struck without any response)
rabbitmqctl stop_app
Tried deleting the pod forcefully but no luck.
Logged into the struck pod and executed
rabbitmqctl reset
Logged into the struck pod and executed
rabbitmqctl force_boot
Logged into the struck pod and executed
rm /var/log/rabbitmq/*
None of the above things helped.
Please note that the other 2 rabbitmq nodes are running fine and serving the traffic and showing the failed node as up:
rabbitmq-2 rabbitmq 2021-07-04 12:19:07.233 [info] <0.490.0> node 'rabbit#rabbitmq-0.rabbitmq-service.devops.svc.cluster.local' up
rabbitmq-1 rabbitmq 2021-07-04 12:19:07.208 [info] <0.494.0> node 'rabbit#rabbitmq-0.rabbitmq-service.devops.svc.cluster.local' up
Running the rollout restart of statefulset command worked for me.
kubectl rollout restart statefulset rabbitmq -n devops
After this command the rabbitmq cluster is up and running and all the three nodes joined the cluster without any issue.
Once this is done its required to restart the applications which are connecting to this rabbitmq cluster.
I have a local kubernetes cluster where I added a Fluentd Daemonset using the preconfigured elasticsearch image (fluent/fluentd-kubernetes-daemonset:elasticsearch). Step 2 of this article. I also have an elastic cluster running in the cloud. You can pass some env variables to the fluentd-elasticsearch image for configuration. It looks pretty straightforward, but when running the fluentd Pod I keep getting the error:
"Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"fa0acce34bf64db9bc9e46f98743c185.westeurope.azure.elastic-cloud.com\", :port=>9243, :scheme=>\"https\", :user=>\"username\", :password=>\"obfuscated\"})!" plugin_id="out_es"
when I try to reach the elastic cluster from within the pod with
# wget https://fa0acce34bf64db9bc9e46f98743c185.westeurope.azure.elastic-cloud.com:9243/ I get a 401 unauthorized (cuz I havent submitted user/pass here), but it at least shows that the address is reachable.
Why is it failing to connect?
I already set the FLUENT_ELASTICSEARCH_SSL_VERSION to 'TLSv1_2', i saw that that solved some problems for others.
Daemonset configuration:
kind: DaemonSet
metadata:
name: fluentd
namespace: kube-logging
labels:
app: fluentd
k8s-app: fluentd-logging
version: v1
kubernetes.io/cluster-service: "true"
spec:
selector:
matchLabels:
app: fluentd
template:
metadata:
labels:
app: fluentd
k8s-app: fluentd-logging
version: v1
kubernetes.io/cluster-service: "true"
spec:
serviceAccount: fluentd
serviceAccountName: fluentd
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:elasticsearch
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "fa0acce34bf64db9bc9e46f98743c185.westeurope.azure.elastic-cloud.com"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9243"
- name: FLUENT_ELASTICSEARCH_SCHEME
value: "https"
- name: FLUENT_UID
value: "0"
- name: FLUENT_ELASTICSEARCH_SSL_VERIFY
value: "false"
- name: FLUENT_ELASTICSEARCH_SSL_VERSION
value: "TLSv1_2"
- name: FLUENT_ELASTICSEARCH_USER
value: "<user>"
- name: FLUENT_ELASTICSEARCH_PASSWORD
value: "<password>"
resources:
limits:
memory: 100Mi
requests:
cpu: 100m
memory: 100Mi
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
terminationGracePeriodSeconds: 30
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
For anyone else who runs into this problem:
I was following a tutorial that used the 'image: fluent/fluentd-kubernetes-daemonset:elasticsearch' image. When you check their DockerHub (https://hub.docker.com/r/fluent/fluentd-kubernetes-daemonset) you can see that the :elaticsearch tag is a year old and probably outdated.
I changed the image for the DaemonSet to a more recent and stable tag 'fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch' and boom it works now.