Kubernetes minion not completely connecting - docker

I have a dev kubernetes cluster setup where I have a minion running kube-proxy and kubelet. Both only start if it can connect to the master's apiserver, which it can. Howerver I am getting
error updating node status, will retry: error getting node "10.211.55.126": minion "10.211.55.126" not found
I notice prior to that I get this: Server rejected event '&api.Event followed by a large json object with mostly empty string values.
repeatedly when I try running the minion's kubelet. I have it pointing to a private ip and it is reporting that it can't fin the public ip. I imagine this is an etcd issue but I'm not sure, also it maybe flanneld?
Update 1
I managed getting pass the initial error by registering the minion(node?) with the master. This allow it to receive pods from mast and run the containers,; however, the minion is still not fully connected and resulting in the master to continuously push more pods to the minion. The kubelet process is reporting: Cannot get host IP: Host IP unknown; known addresses: []. Is there a flag to run kubelet with to give it the host ip?

Currently, I have to manually register the minion prior to spinning up the minion instance. This is because there is an open issue as of right now not allowing the minion to self-register in certain cases.
UPDATE
Now I'm using kube-register to register each minion/node on start of the kubelet service.

Related

multi master OKD-3.11 setup fails if master-1 nodes is down

I am trying to install multi-master openshift-3.11 setup in openstack VMs as per the inventory file present in the official documentation.
https://docs.openshift.com/container-platform/3.11/install/example_inventories.html#multi-masters-single-etcd-using-native-ha
OKD Version
[centos#master1 ~]$ oc version
oc v3.11.0+62803d0-1
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://master1.167.254.204.74.nip.io:8443
openshift v3.11.0+ff2bdbd-531
kubernetes v1.11.0+d4cacc0
Steps To Reproduce
Bring up an okd-3.11 multi master setup as per the inventory file mentioned in here,
https://docs.openshift.com/container-platform/3.11/install/example_inventories.html#multi-masters-single-etcd-using-native-ha
Current Result
The setup is successful but struck with two issues as mentioned below,
unable to list down the load balancer nodes on issue of "oc get nodes" command.
[centos#master1 ~]$ oc get nodes
NAME STATUS ROLES AGE VERSION
master1.167.254.204.74.nip.io Ready infra,master 6h v1.11.0+d4cacc0
master2.167.254.204.58.nip.io Ready infra,master 6h v1.11.0+d4cacc0
master3.167.254.204.59.nip.io Ready infra,master 6h v1.11.0+d4cacc0
node1.167.254.204.82.nip.io Ready compute 6h v1.11.0+d4cacc0
The master nodes and the load balancer are totally dependent on master-1 node because if master-1 is down then rest of the master nodes or load balancer unable to run any of the oc commands,
[centos#master2 ~]$ oc get nodes
Unable to connect to the server: dial tcp 167.254.204.74:8443: connect: no route to host
The OKD setup works fine if the other master nodes (other than master-1) or the load balancer are down.
Expected Result
The OKD setup should be up & running though any one of the master nodes went down.
Inventory file:
[OSEv3:children]
masters
nodes
etcd
lb
[masters]
master1.167.254.204.74.nip.io
master2.167.254.204.58.nip.io
master3.167.254.204.59.nip.io
[etcd]
master1.167.254.204.74.nip.io
master2.167.254.204.58.nip.io
master3.167.254.204.59.nip.io
[lb]
lb.167.254.204.111.nip.io
[nodes]
master1.167.254.204.74.nip.io openshift_ip=167.254.204.74 openshift_schedulable=true openshift_node_group_name='node-config-master'
master2.167.254.204.58.nip.io openshift_ip=167.254.204.58 openshift_schedulable=true openshift_node_group_name='node-config-master'
master3.167.254.204.59.nip.io openshift_ip=167.254.204.59 openshift_schedulable=true openshift_node_group_name='node-config-master'
node1.167.254.204.82.nip.io openshift_ip=167.254.204.82 openshift_schedulable=true openshift_node_group_name='node-config-compute'
[OSEv3:vars]
debug_level=4
ansible_ssh_user=centos
ansible_become=true
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
openshift_enable_service_catalog=true
ansible_service_broker_install=true
openshift_node_groups=[{'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/infra=true']}, {'name': 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=true']}]
containerized=false
os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'
openshift_disable_check=disk_availability,docker_storage,memory_availability,docker_image_availability
deployment_type=origin
openshift_deployment_type=origin
openshift_release=v3.11.0
openshift_pkg_version=-3.11.0
openshift_image_tag=v3.11.0
openshift_service_catalog_image_version=v3.11.0
template_service_broker_image_version=v3.11
osm_use_cockpit=true
# put the router on dedicated infra1 node
openshift_master_cluster_method=native
openshift_master_default_subdomain=sub.master1.167.254.204.74.nip.io
openshift_public_hostname=master1.167.254.204.74.nip.io
openshift_master_cluster_hostname=master1.167.254.204.74.nip.io
Please let me know the entire setup dependency on master-node-1 and also any work around to fix this.
You should configure LB hostname to openshift_master_cluster_hostname and openshift_master_cluster_public_hostname, not master hostname.
As your configuration, if you configure it as master1, then all API entrypoint will be master1, so if master1 stopped, then all API service would be down.
In advance you should configure your LB for loadbalancing to your master nodes, and register the LB IP(AKA VIP) to DNS as ocp-cluster.example.com.
This hostname will be entrypoint for OCP API, you can set it using both openshift_master_cluster_hostname and openshift_master_cluster_public_hostname.
openshift_master_cluster_method=native
openshift_master_cluster_hostname=ocp-cluster.example.com
openshift_master_cluster_public_hostname=ocp-cluster.example.com

Jenkins slave pod on kubernetes randomly failing

I have set a Jenkins master (on a VM) and this is provisioning jnlp slaves as kubernetes pods.
In very rare occasions, the pipeline fails, with this message:
java.io.IOException: Pipe closed
at java.io.PipedInputStream.checkStateForReceive(PipedInputStream.java:260)
at java.io.PipedInputStream.receive(PipedInputStream.java:226)
at java.io.PipedOutputStream.write(PipedOutputStream.java:149)
at java.io.OutputStream.write(OutputStream.java:75)
at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.setupEnvironmentVariable(ContainerExecDecorator.java:510)
at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.doLaunch(ContainerExecDecorator.java:474)
at org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1.launch(ContainerExecDecorator.java:333)
at hudson.Launcher$ProcStarter.start(Launcher.java:455)
Viewing kubernetes logs Stackdriver in Stackdriver, one can see that the pod does manage to connect to the master, e.g.
Handshaking
Agent discovery successful
Trying protocol: JNLP4-Connect
Remote Identity confirmed: <some_hash_here>
Connecting to <jenkins-master-url>:49187
started container
loading plugin ...
but after a while it fails and here are the relevant logs:
org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave$SlaveDisconnector call
INFO: Disabled slave engine reconnects.
hudson.remoting.jnlp.Main$CuiListener status
Terminated
hudson.remoting.Request$2 run
Failed to send back a reply to the request hudson.remoting.Request$2#336ec321: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel#29d0e8b2:JNLP4-connect connection to <jenkins-master-url>/<public-ip-of-jenkins-master>:49187": channel is already closed
"Processing signal 'terminated'"
.
.
.
How can I further troubleshoot this random error?
Can you take a look at the Kubernetes Pod-Events with Stackdriver? We had a similar behavior with a different CI-System (GitlabCI). Our builds where also randomly failing. It turned out that the JVM inside the Container exceeded its memory limitation and was killed by Kubernetes (OOMKilled) and the CI-System recognised this as a build error.

Can not run kubernetes dashboard on Master node

I installed kubernetes cluster (include one master and two nodes), and status of nodes are ready on master. When I deploy the dashboard and run it by acccessing the link http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/, I get error
'dial tcp 10.32.0.2:8443: connect: connection refused' Trying to
reach: 'https://10.32.0.2:8443/'
The pod state of dashboard is ready, and I tried to ping to 10.32.0.2 (dashboard's ip) not succesfully
I run dashboard as the Web UI (Dashboard) guide suggests.
How can I fix this ?
There are few options here:
Most of the time if there is some kind of connection refused, timeout or similar error it is most likely a configuration problem. If you can't get the Dashboard running then you should try to deploy another application and try to access it. If you fail then it is not a Dashboard issue.
Check if you are using root/sudo.
Have you properly installed flannel or any other network for containers?
Have you checked your API logs? If not, please do so.
Check the description of the dashboard pod (kubectl describe) if there is anything suspicious.
Analogically check the description of service.
What is your cluster version? Check if any updates are required.
Please let me know if any of the above helped.
Start proxy, if it's not started
kubectl proxy --address='0.0.0.0' --port=8001 --accept-hosts='.*'

kubelet logs flooding even after pods deleted

Kubernetes version : v1.6.7
Network plugin : weave
I recently noticed that my entire cluster of 3 nodes went down. Doing my initial level of troubleshooting revealed that /var on all nodes was 100%.
Doing further into the logs revealed the logs to be flooded by kubelet stating
Jan 15 19:09:43 test-master kubelet[1220]: E0115 19:09:43.636001 1220 kuberuntime_gc.go:138] Failed to stop sandbox "fea8c54ca834a339e8fd476e1cfba44ae47188bbbbb7140e550d055a63487211" before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "<TROUBLING_POD>-1545236220-ds0v1_kube-system" network: CNI failed to retrieve network namespace path: Error: No such container: fea8c54ca834a339e8fd476e1cfba44ae47188bbbbb7140e550d055a63487211
Jan 15 19:09:43 test-master kubelet[1220]: E0115 19:09:43.637690 1220 docker_sandbox.go:205] Failed to stop sandbox "fea94c9f46923806c177e4a158ffe3494fe17638198f30498a024c3e8237f648": Error response from daemon: {"message":"No such container: fea94c9f46923806c177e4a158ffe3494fe17638198f30498a024c3e8237f648"}
The <TROUBLING_POD>-1545236220-ds0v1 was being initiated due to a cronjob and due to some misconfigurations, there were errors occurring during the running of those pods and more pods were being spun up.
So I deleted all the jobs and their related pods. So I had a cluster that had no jobs/pods running related to my cronjob and still see the same ERROR messages flooding the logs.
I did :
1) Restart docker and kubelet on all nodes.
2) Restart the entire control plane
and also
3) Reboot all nodes.
But still the logs are being flooded with the same error messages even though no such pods are even being spun up.
So I dont know how can I stop kubelet from throwing out the errors.
Is there a way for me to reset the network plugin I am using ? Or do something else ?
Check if the pod directory exists under /var/lib/kubelet
You're on a very old version of Kubernetes, upgrading will fix this issue.

Storm:java.lang.RuntimeException: Returned channel was actually not established

I have a storm cluster with 1 nimbus node and 3 supervisor node which are running on docker containers on AWS ec2 instances. I had a topology running with the number of workers equal to 3 and it ran perfectly fine. I stopped and removed this container and started a new one. After this, I seem to have the following error in the supervisor logs:
2016-10-03 21:18:22 b.s.m.n.Client [ERROR] connection attempt 129 to Netty-Client-hostname:6702 failed: java.lang.RuntimeException: Returned channel was actually not established
I have edited "/etc/hosts" to include the hostname as follows:
IP-address hostname
Yet, the problems seems to persist. Although, the same topology runs perfectly fine with the number of workers set to 1. Any pointers on solving this issue is appreciated.
The problem was with the hostname. I changed the hostname to match the DNS name by updating "/etc/hostname" as well as "/etc/hosts" and the rebooted nimbus instance followed by the supervisor instances. This fixed the problem. Hope this helps anyone who is stuck with the same problem!
Please check your supervisor log, sometimes you need to redeploy the app because supervisor has not started the topology yet.

Resources