Openshift Monitoring - cAdvisor + Prometheus - Docker - docker

I tried to implement a monitoring solution for Openshift cluster based on Prometheus + node-exporter + grafana + cAdvisor.
I have a huge problem with cAdvisor component. I did a lot of configuration (The changes always do with volumes), but none of them work well, containter restarting every ~2min or do not collect all data metrics (processes)
example of configuration(with this config containter do not restart every 2min, but not collect processes) I know, i dont have /rootfs in volumes, but with this container work like 5s and goes down:
containers:
- image: >-
google/cadvisor#sha256:fce642268068eba88c27c666e92ed4144be6188447a23825015884741cf0e352
imagePullPolicy: IfNotPresent
name: cadvisor-new-version
ports:
- containerPort: 8080
protocol: TCP
resources: {}
securityContext:
privileged: true
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: '/sys/fs/cgroup/cpuacct,cpu'
name: sys
readOnly: true
- mountPath: /var/lib/docker
name: docker
readOnly: true
- mountPath: /var/run/containerd/containerd.sock
name: docker-socketd
readOnly: true
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: cadvisor-sa
serviceAccountName: cadvisor-sa
terminationGracePeriodSeconds: 300
volumes:
- hostPath:
path: '/sys/fs/cgroup/cpu,cpuacct'
name: sys
- hostPath:
path: /var/lib/docker
name: docker
- hostPath:
path: /var/run/containerd/containerd.sock
name: docker-socketd
i use a service account in my OS project with scc-privileged.
Openshift version - 3.6
Docker version - 1.12
cAdvisor version - I tried every one from v0.26.3 to newest
I found a post that the problem can be the old version od docker, can anyone confirmed this?
Maybe someone do the right configuration and implement cAdvisor on Openshift?
example of logs:
I0409 08:41:46.661453 1 manager.go:231] Version:
{KernelVersion:3.10.0-693.17.1.el7.x86_64 ContainerOsVersion:Alpine Linux v3.4 DockerVersion:1.12.6 DockerAPIVersion:1.24 CadvisorVersion:v0.28.3 CadvisorRevision:1e567c2}
E0409 08:41:50.823560 1 factory.go:340] devicemapper filesystem stats will not be reported: usage of thin_ls is disabled to preserve iops
I0409 08:41:50.825280 1 factory.go:356] Registering Docker factory
I0409 08:41:50.826394 1 factory.go:54] Registering systemd factory
I0409 08:41:50.826949 1 factory.go:86] Registering Raw factory
I0409 08:41:50.827388 1 manager.go:1178] Started watching for new ooms in manager
I0409 08:41:50.838169 1 manager.go:329] Starting recovery of all containers
W0409 08:41:56.853821 1 container.go:393] Failed to create summary reader for "/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podc323db44_39a9_11e8_accd_005056800e7b.slice/docker-26db795af0fa28047f04194d8169cf0249edf2c918c583422a1404d35ed5b62c.scope": none of the resources are being tracked.
I0409 08:42:03.953261 1 manager.go:334] Recovery completed
I0409 08:42:37.874062 1 cadvisor.go:162] Starting cAdvisor version: v0.28.3-1e567c2 on port 8080
I0409 08:42:56.353574 1 fsHandler.go:135] du and find on following dirs took 1.20076874s: [ /rootfs/var/lib/docker/containers/2afa2c457a9c1769feb6ab542102521d8ad51bdeeb89581e4b7166c1c93e7522]; will not log again for this container unless duration exceeds 2s
I0409 08:42:56.453602 1 fsHandler.go:135] du and find on following dirs took 1.098795382s: [ /rootfs/var/lib/docker/containers/65e4ad3536788b289e2b9a29e8f19c66772b6f38ec10d34a2922e4ef4d67337f]; will not log again for this container unless duration exceeds 2s
I0409 08:42:56.753070 1 fsHandler.go:135] du and find on following dirs took 1.400184357s: [ /rootfs/var/lib/docker/containers/2b0aa12a43800974298a7d0353c6b142075d70776222196c92881cc7c7c1a804]; will not log again for this container unless duration exceeds 2s
I0409 08:43:00.352908 1 fsHandler.go:135] du and find on following dirs took 1.199079344s: [ /rootfs/var/lib/docker/containers/aa977c2cc6105e633369f48e2341a6363ce836cfbe8e7821af955cb0cf4d5f26]; will not log again for this container unless duration exceeds 2s

There's a cAdvisor process embedded in the OpenShift's kubelet. Maybe there's a race condition that makes the pod crash.

I'm seeing something similar in a three node docker swarm where cadvisor on one node - and only that one - keeps dying after a few minutes. I've watched the process and looked at it's resource usage - it's running out of memory.
I've set a 128MB limit but I've tried higher limits as well. It just buys it more time but even at 500MB it soon died because it ran out of memory.
The only thing that seems to be abnormal are those same "du and find on following dirs took" messages:
I0515 15:14:37.109399 1 fsHandler.go:135] du and find on following dirs took 46.19060577s: [/rootfs/var/lib/docker/aufs/diff/69a2bd344a635cde23e6c27a69c165ed001178a9093964d73bebdbb81d90369b /rootfs/var/lib/docker/containers/6fd8113e383f78e20608be807a38e17b14715636b94aa99112dd6d7208764a2e]; will not log again for this container unless duration exceeds 5s
I0515 15:14:35.511417 1 fsHandler.go:135] du and find on following dirs took 58.306835696s: [/rootfs/var/lib/docker/aufs/diff/bed9b7ad307f36ae97659b79912ff081f5b64fb8d57d6a48f143cd3bf9823e64 /rootfs/var/lib/docker/containers/108f4b879f7626023be8790af33ad6b73189a27e7c9bb7d6f219521d43099bbe]; will not log again for this container unless duration exceeds 5s
I0515 15:14:47.513604 1 fsHandler.go:135] du and find on following dirs took 45.911742867s: [/rootfs/var/lib/docker/aufs/diff/c9989697f40789a69be47511c2b931f8949323d144051912206fe719f12e127d /rootfs/var/lib/docker/containers/4cd1baa15522b58f61e9968c1616faa426fb3dfd9ac8515896dcc1ec7a0cb932]; will not log again for this container unless duration exceeds 5s
I0515 15:14:49.210788 1 fsHandler.go:135] du and find on following dirs took 46.406268577s: [/rootfs/var/lib/docker/aufs/diff/7605c354c073800dcbb14df16da4847da3d70107509d27f8f1675aab475eb0df /rootfs/var/lib/docker/containers/00f37c6569bb29c028a90118cf9d12333907553396a95390d925a4c2502ab058]; will not log again for this container unless duration exceeds 5s
I0515 15:14:45.614715 1 fsHandler.go:135] du and find on following dirs took 1m1.573576904s: [/rootfs/var/lib/docker/aufs/diff/62d99773c5d1be97863f90b5be03eb94a4102db4498931863fa3f5c677a06a06 /rootfs/var/lib/docker/containers/bf3e2d8422cda2ad2bcb433e30b6a06f1c67c3a9ce396028cdd41cce3b0ad5d6]; will not log again for this container unless duration exceeds 5s
What's interesting is that it starts out taking only a couple of seconds:
I0515 15:09:48.710609 1 fsHandler.go:135] du and find on following dirs took 1.496309475s: [/rootfs/var/lib/docker/aufs/diff/a11190ca4731bbe6d9cbe1a2480e781490dc4e0e6c91c404bc33d37d7d251564 /rootfs/var/lib/docker/containers/d0b45858ae55b6613c4ecabd8d44e815c898bbb5ac5c613af52d6c1f4804df76]; will not log a
gain for this container unless duration exceeds 2s
I0515 15:09:49.909390 1 fsHandler.go:135] du and find on following dirs took 1.29921035s: [/rootfs/var/lib/docker/aufs/diff/62d99773c5d1be97863f90b5be03eb94a4102db4498931863fa3f5c677a06a06 /rootfs/var/lib/docker/containers/bf3e2d8422cda2ad2bcb433e30b6a06f1c67c3a9ce396028cdd41cce3b0ad5d6]; will not log ag
ain for this container unless duration exceeds 2s
I0515 15:09:51.014721 1 fsHandler.go:135] du and find on following dirs took 1.502355544s: [/rootfs/var/lib/docker/aufs/diff/5264e7a8c3bfb2a4ee491d6e42e41b3300acbcf364455698ab232c1fc9e8ab4e /rootfs/var/lib/docker/containers/da355f40535a001c5ba0e16da61b6340028b4e432e0b2f14b8949637559ff001]; will not log a
gain for this container unless duration exceeds 2s
I0515 15:09:53.309486 1 fsHandler.go:135] du and find on following dirs took 2.19038347s: [/rootfs/var/lib/docker/aufs/diff/8b0fd9287d107580b76354851b75c09ce47e114a70092305d42f8c2b5f5e23b2 /rootfs/var/lib/docker/containers/5fd8ac9fd8d98d402851f2642266ca89598a964f50cfabea9bdf50b87f7cff66]; will not log ag
So something seems to be getting progressively worse until the container dies.

Related

GKE problem when running cronjob by pulling image from Artifact Registry

I created a cronjob with the following spec in GKE:
# cronjob.yaml
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: collect-data-cj-111
spec:
schedule: "*/5 * * * *"
concurrencyPolicy: Allow
startingDeadlineSeconds: 100
suspend: false
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 1
jobTemplate:
spec:
template:
spec:
containers:
- name: collect-data-cj-111
image: collect_data:1.3
restartPolicy: OnFailure
I create the cronjob with the following command:
kubectl apply -f collect_data.yaml
When I later watch if it is running or not (as I scheduled it to run every 5th minute for for the sake of testing), here is what I see:
$ kubectl get pods --watch
NAME READY STATUS RESTARTS AGE
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0/1 Pending 0 0s
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0/1 Pending 0 1s
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0/1 ContainerCreating 0 1s
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0/1 ErrImagePull 0 3s
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0/1 ImagePullBackOff 0 17s
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0/1 ErrImagePull 0 30s
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX 0/1 ImagePullBackOff 0 44s
It does not seem to be able to pull the image from Artifact Registry. I have both GKE and Artifact Registry created under the same project.
What can be the reason? After spending several hours in docs, I still could not make progress and I am quite new in the world of GKE.
If you happen to recommend me to check anything, I really appreciate if you also describe where in GCP I should check/control your recommendation.
ADDENDUM:
When I run the following command:
kubectl describe pods
The output is quite large but I guess the following message should indicate the problem.
Failed to pull image "collect_data:1.3": rpc error: code = Unknown
desc = failed to pull and unpack image "docker.io/library/collect_data:1.3":
failed to resolve reference "docker.io/library/collect_data:1.3": pull
access denied, repository does not exist or may require authorization:
server message: insufficient_scope: authorization failed
How do I solve this problem step by step?
From the error shared, I can tell that the image is not being pulled from Artifact Registry, and the reason for failure is because, by default, GKE pulls it directly from Docker Hub unless specified otherwise. Since there is no collect_data image there, hence the error.
The correct way to specify an image stored in Artifact Registry is as follows:
image: <location>-docker.pkg.dev/<project>/<repo-name>/<image-name:tag>
Be aware that the registry format has to be set to "docker" if you are using a docker-containerized image.
Take a look at the Quickstart for Docker guide, where it is specified how to pull and push docker images to Artifact Registry along with the permissions required.

Eclipse che - volume mount error while launching workspace by 5 users at same time

Configuration:
Google Kubernete Engine (GKE) version - 1.18.12-gke.1210
Nodes count - 2
Node Configuration - 2 core 8 GB memory Machine with 30 GB hardisk
AutoScale is enabaled
Eclipse che Advanced configuration:
server:
CHE_WORKSPACE_POOL_EXACT__SIZE: "60"
CHE_WORKSPACE_STORAGE_PREFERRED__TYPE: ephemeral
allowUserDefinedWorkspaceNamespaces: false
cheDebug: "false"
cheFlavor: che
cheHost: che-eclipse-che.domain.com
cheLogLevel: INFO
cheServerIngress: {}
cheServerRoute: {}
devfileRegistryIngress: {}
devfileRegistryRoute: {}
externalDevfileRegistry: false
externalPluginRegistry: false
gitSelfSignedCert: false
pluginRegistryIngress: {}
pluginRegistryRoute: {}
selfSignedCert: true
tlsSupport: true
useInternalClusterSVCNames: true
workspaceNamespaceDefault: all-che-workspace
storage:
preCreateSubPaths: true
pvcClaimSize: 128Gi
pvcStrategy: common
preferred_type: persistent
Bug Description :
Logged in 10 different users at the same time and launched 10 workspaces of each users at a time. 3 - 5 users are able to launch the workspace successfully, for remaining users getting time out error in mount volume, some users keep on loading the workspace and nothing is initialised in log window.
Error Screenshots :
Error for User 1:
Failed to run the workspace: "Unrecoverable event occurred: 'FailedMount', 'Unable to attach or mount volumes: unmounted volumes=[claim-che-workspace], unattached volumes=[gitconfigvolume remote-endpoint che-workspace-token-dmn68 workspacep72ony0ucs0pqa5c-sshprivatekeys che-ca-certs broker-config-volume5iwa24 ssshkeyconfigvolume claim-che-workspace che-jwtproxy-config-volume]: timed out waiting for the condition', 'workspacep72ony0ucs0pqa5c.maven-d5476444f-6tcgg'"
Error for User 2:
Failed to run the workspace: "Waiting for Kubernetes environment 'default' of the workspace'workspaceo6i4zoqzs1xym88w' reached timeout"
Try to run Che Workspace in debug mode.
See https://www.eclipse.org/che/docs/che-7/end-user-guide/investigating-failures-at-a-workspace-start-using-the-verbose-mode/
It might give you some inputs about what is going wrong.
The most probable cause of the issues in the question (which I've managed to reproduce) could be related to the storage configuration used.
Citing the documentation:
When the common PVC strategy is in use, user-defined PVCs are ignored and volumes that refer to these user-defined PVCs are replaced with a volume that refers to the common PVC. In this strategy, all Che workspaces use the same PVC. When the user runs one workspace, it only binds to one node in the cluster at a time.
-- Eclipse.org: Che: Docs: Configuring storage strategies: The common PVC strategy
By default when you create a PVC with GKE you are in fact creating a Persistent Disk which can be mounted in RWO access mode to a single node. If the workspace is scheduled onto a node that the PVC is not mounted to, the creation process will fail and you will get following message:
Unable to attach or mount volumes: unmounted volumes=[claim-che-workspace], unattached volumes=[che-ca-certs che-workspace-token-5tb9b broker-config-volumeavmw4x claim-che-workspace workspacebgnsca7mkryv3s3m-sshprivatekeys ssshkeyconfigvolume gitconfigvolume]: timed out waiting for the condition
Failed to run the workspace: "Plugins installation process failed. Error: Unrecoverable event occurred: 'FailedMount', 'Unable to attach or mount volumes: unmounted volumes=[claim-che-workspace], unattached volumes=[che-ca-certs che-workspace-token-5tb9b broker-config-volumeavmw4x claim-che-workspace workspacebgnsca7mkryv3s3m-sshprivatekeys ssshkeyconfigvolume gitconfigvolume]: timed out waiting for the condition', 'workspacebgnsca7mkryv3s3m.che-plugin-broker'"
I'd reckon to avoid this issue you could either (not tested):
Use RWX storage solution.
Change the storage configuration for unique.
Use a single node. <-- not great idea but it should work
To have more insight on the issue, you can monitor the state of Pods when creating workspace. It could tell you more about the process:
$ kubectl get pods -n kruk-che:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kruk-che workspace4pfd7goxamv8vvs4.maven-5d9c59746f-swfdd 5/5 Running 0 29m 10.4.0.58 gke-name-pool-9530 <none> <none>
kruk-che workspacebgnsca7mkryv3s3m.che-plugin-broker 0/1 ContainerCreating 0 105s <none> gke-name-pool-qhns <none> <none>
kruk-che workspacezo0l50kaa2zvm4lv.maven-8644fbf959-45xjd 5/5 Running 0 12m 10.4.0.62 gke-name-pool-9530 <none> <none>
In above example you can see Pods in the ContainerCreating state. You can inspect them for more information about their state like:
$ kubectl describe pod -n kruk-che workspacebgnsca7mkryv3s3m.che-plugin-broker
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 83s default-scheduler Successfully assigned kruk-che/workspacebgnsca7mkryv3s3m.che-plugin-broker to gke-name-pool-qhns
Warning FailedAttachVolume 83s attachdetach-controller Multi-Attach error for volume "pvc-869bc565-4dd1-4362-8d63-f3b1fde6f246" Volume is already used by pod(s) workspacezo0l50kaa2zvm4lv.maven-8644fbf959-45xjd, workspace4pfd7goxamv8vvs4.maven-5d9c59746f-swfdd
Warning FailedMount 10s kubelet Unable to attach or mount volumes: unmounted volumes=[claim-che-workspace], unattached volumes=[che-ca-certs che-workspace-token-5tb9b broker-config-volumeavmw4x claim-che-workspace workspacebgnsca7mkryv3s3m-sshprivatekeys ssshkeyconfigvolume gitconfigvolume]: timed out waiting for the condition
Additional resources:
Eclipse.org: Che: Installation guide: Advanced configuration options for the che server component
Cloud.google.com: Kubernetes Engine: Docs: Concepts: Persistent Volumes

Kubernetes - Too many open files

I'm trying to evaluate the performance of one of my go server running inside the pod. However, receiving an error saying too many open files. Is there any way to set the ulimit in kubernetes?
ubuntu#ip-10-0-1-217:~/ppu$ kubectl exec -it go-ppu-7b4b679bf5-44rf7 -- /bin/sh -c 'ulimit -a'
core file size (blocks) (-c) unlimited
data seg size (kb) (-d) unlimited
scheduling priority (-e) 0
file size (blocks) (-f) unlimited
pending signals (-i) 15473
max locked memory (kb) (-l) 64
max memory size (kb) (-m) unlimited
open files (-n) 1048576
POSIX message queues (bytes) (-q) 819200
real-time priority (-r) 0
stack size (kb) (-s) 8192
cpu time (seconds) (-t) unlimited
max user processes (-u) unlimited
virtual memory (kb) (-v) unlimited
file locks (-x) unlimited
Deployment file.
---
apiVersion: apps/v1
kind: Deployment # Type of Kubernetes resource
metadata:
name: go-ppu # Name of the Kubernetes resource
spec:
replicas: 1 # Number of pods to run at any given time
selector:
matchLabels:
app: go-ppu # This deployment applies to any Pods matching the specified label
template: # This deployment will create a set of pods using the configurations in this template
metadata:
labels: # The labels that will be applied to all of the pods in this deployment
app: go-ppu
spec: # Spec for the container which will run in the Pod
containers:
- name: go-ppu
image: ppu_test:latest
imagePullPolicy: Never
ports:
- containerPort: 8081 # Should match the port number that the Go application listens on
livenessProbe: # To check t$(minikube docker-env)he health of the Pod
httpGet:
path: /health
port: 8081
scheme: HTTP
initialDelaySeconds: 35
periodSeconds: 30
timeoutSeconds: 20
readinessProbe: # To check if the Pod is ready to serve traffic or not
httpGet:
path: /readiness
port: 8081
scheme: HTTP
initialDelaySeconds: 35
timeoutSeconds: 20
Pods info:
ubuntu#ip-10-0-1-217:~/ppu$ kubectl get pods
NAME READY STATUS RESTARTS AGE
go-ppu-7b4b679bf5-44rf7 1/1 Running 0 18h
ubuntu#ip-10-0-1-217:~/ppu$ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 100.64.0.1 <none> 443/TCP 19h
ppu-service LoadBalancer 100.64.171.12 74d35bb2a5f30ca13877-1351038893.us-east-1.elb.amazonaws.com 8081:32623/TCP 18h
When I used locust to test the performance of the server receiving the following error.
# fails Method Name Type
3472 POST /supplyInkHistory ConnectionError(MaxRetryError("HTTPConnectionPool(host='74d35bb2a5f30ca13877-1351038893.us-east-1.elb.amazonaws.com', port=8081): Max retries exceeded with url: /supplyInkHistory (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x....>: Failed to establish a new connection: [Errno 24] Too many open files',))",),)
May you have a look at https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/
You but you need enable few features to make it work.
securityContext:
sysctls:
- name: fs.file-max
value: "YOUR VALUE HERE"
There was a few cases regarding setting --ulimit argument, you can find them here or check this article. This resource limit can be set by Docker during the container startup. As you add tag google-kubernetes-engine answer will be related to GKE environment, however on other cloud it could work similar.
If you would like to set unlimit for open files you can modify configuration file /etc/security/limits.conf. However, please not it will not persist across reboots.
Second option would be edit /etc/init/docker.conf and restart docker service. As default it have a few limits like nofile or nproc, you can add it here.
Another option could be to use instance template. Instance template would include a start-up script that set the required limit.
After that, you would need to use this new instance template for the instance group in the GKE. More information here and here.

Relation between LimitRange's default, defaultRequest, max and min limits

I do not understand kubernetes LimitRange configuration.
I created a manifest with following contents:
apiVersion: v1
kind: LimitRange
metadata:
name: cpu-limit-range
spec:
limits:
- default:
cpu: 4
defaultRequest:
cpu: 4
max:
cpu: 6
type: Container
And then I run following commands :
[root#localhost ~]# kubectl delete pods default-cpu-demo-19 1^C
[root#localhost ~]# kubectl get pods -n=limit
NAME READY STATUS RESTARTS AGE
default-cpu-demo-19 0/1 Pending 0 9s
[root#localhost ~]# kubectl describe pods -n=limit
......(omitted unnecessary echo here)......
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 14s (x8 over 9m2s) default-scheduler 0/1 nodes are available: 1 Insufficient cpu.
Of course, I know that most amount of limit:cpu of a container can not be more bigger than that of above configuration. Otherwise, kubernets will report insufficient cpu.
And yet most amount of max:cpu of a container can not be bigger than that of above configuration as well.
I can't distinguish between maximum and limit, because a value is bigger than the limit or max is not allowed.
So, I want to know what they represent respectively and relation among them.
If i understand your question correctly, what you asked is -- what's the difference between min, max, default and defaultRequest limits for containers set in LimitRange resource. The answer is very simple:
defaultRequest — is how much CPU/Memory will be given to Container, if it doesn't specify it's own value
default — is default limit for amount of CPU/Memory for Container, if it doesn't specify it's own value
max — is maximum limit for amount of CPU/Memory that Container can ask for. I.e. it can't set it's own limit more than that
min — is minimum limit amount of CPU/Memory that Container can ask for. I.e. it can't set it's own limit less than that
Here is part of official documentation that has examples and more info on that topic.

Running kubernetes autoscalar

I have a replication controller running with the following spec:
apiVersion: v1
kind: ReplicationController
metadata:
name: owncloud-controller
spec:
replicas: 1
selector:
app: owncloud
template:
metadata:
labels:
app: owncloud
spec:
containers:
- name: owncloud
image: adimania/owncloud9-centos7
ports:
- containerPort: 80
volumeMounts:
- name: userdata
mountPath: /var/www/html/owncloud/data
resources:
requests:
cpu: 400m
volumes:
- name: userdata
hostPath:
path: /opt/data
Now I run a hpa using autoscale command.
$ kubectl autoscale rc owncloud-controller --max=5 --cpu-percent=10
I have also started heapster using kubernetes run command.
$ kubectl run heapster --image=gcr.io/google_containers/heapster:v1.0.2 --command -- /heapster --source=kubernetes:http://192.168.0.103:8080?inClusterConfig=false --sink=log
After all this, the autoscaling never kicks in. From logs, it seems that the actual CPU utilization is not getting reported.
$ kubectl describe hpa owncloud-controller
Name: owncloud-controller
Namespace: default
Labels: <none>
Annotations: <none>
CreationTimestamp: Thu, 26 May 2016 14:24:51 +0530
Reference: ReplicationController/owncloud-controller/scale
Target CPU utilization: 10%
Current CPU utilization: <unset>
Min replicas: 1
Max replicas: 5
ReplicationController pods: 1 current / 1 desired
Events:
FirstSeen LastSeen Count From SubobjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
44m 8s 92 {horizontal-pod-autoscaler } Warning FailedGetMetrics failed to get CPU consumption and request: metrics obtained for 0/1 of pods
44m 8s 92 {horizontal-pod-autoscaler } Warning FailedComputeReplicas failed to get CPU utilization: failed to get CPU consumption and request: metrics obtained for 0/1 of pods
What am I missing here?
Most probably heapster is running in a wrong namespace ("default"). HPA expects heapster to be in "kube-system" namespace. Please, add --namespace=kube-system to kubectl run heapster command.
I installed hepaster under the name space "kube-system" and it worked. After running heapster, make sure it's running before you use HPA for your application.
How to run Heapster with Kubernetes cluster
I put all files here https://gitlab.com/abushoeb/kubernetes/tree/master/heapster. They are collected from the official Kubernetes Repository and made minor changes.
How to run Heapster
Go to the directory heapster where you have grafana.yaml, heapster.yaml and influxdb.yaml and run following command
$ kubectl create -f .
How to stop Heapster
Go to the same heapster directory and then run following command
$ kubectl delete -f .
How to check Heapster is running
You can access heapster metric model from the pod where heapster is running to make sure heapster is working. It can be accessed via web browser by accessing http://heapster-pod-ip:heapster-service-port/api/v1/model/metrics/. The same result can be seen by executing following command.
$ curl -L http://heapster-pod-ip:heapster-service-port/api/v1/model/metrics/
If you see the list of metrics then heapster is running correctly. You can also browse grafana dashboard to see it (find the ip of the pod where grafana is running and the access it http://grafana-pod-ip:grafana-service-port).
Full documentation of Heapster Metric Model are available here.
Also just run ($ kubectl cluster-info) and see if it shows results like this:
Kubernetes master is running at https://cluster-ip:6443
Heapster is running at https://cluster-ip:6443/api/v1/proxy/namespaces/kube-system/services/heapster
kubernetes-dashboard is running at https://cluster-ip:6443/api/v1/proxy/namespaces/kube-system/services/kubernetes-dashboard
monitoring-grafana is running at https://cluster-ip:6443/api/v1/proxy/namespaces/kube-system/services/monitoring-grafana
monitoring-influxdb is running at https://cluster-ip:6443/api/v1/proxy/namespaces/kube-system/services/monitoring-influxdb
Check influxdb
You can also check influxdb if it has data in it. Install Influxdb Client on your local machine to get connected to infuxdb database.
$ influx -host <cluster-ip> -port <influxdb-service-port>
Some Sample influxdb queries
show databases
use db-name
show measurements
select value from "cpu/node_capacity"
Reference and Help
https://github.com/kubernetes/heapster/blob/master/docs/influxdb.md
https://github.com/kubernetes/heapster/blob/master/docs/debugging.md
https://blog.kublr.com/how-to-utilize-the-heapster-influxdb-grafana-stack-in-kubernetes-for-monitoring-pods-4a553f4d36c9
http://www.dasblinkenlichten.com/installing-cadvisor-and-heapster-on-bare-metal-kubernetes/
http://blog.arungupta.me/kubernetes-monitoring-heapster-influxdb-grafana/

Resources