I am trying to spin minio pods on AKS service, the pod runs for keeps crashing, here are the detailed logs:
Waiting for a minimum of 2 disks to come online (elapsed 0s)
Waiting for a minimum of 2 disks to come online (elapsed 0s)
Waiting for a minimum of 2 disks to come online (elapsed 0s)
Waiting for a minimum of 2 disks to come online (elapsed 2s)
Waiting for all other servers to be online to format the disks.
Unable to connect to http://minio-1.minio.default.svc.cluster.local:9000/data: volume not found
Unable to connect to http://minio-2.minio.default.svc.cluster.local:9000/data: volume not found
Unable to connect to http://minio-3.minio.default.svc.cluster.local:9000/data: Post http://minio-3.minio.default.svc.cluster.local:9000/minio/storage/v8/data/getinstanceid?: dial tcp: lookup minio-3.minio.default.svc.cluster.local on 10.0.0.10:53: no such host
I suspected this is due to the Volumes that I attached, I was using Azure_Managed_Disk as it does not support multiple readWriteMany I am using Azure_Files now but still, I am getting the same error.
my deployment file:
Any leads on this will be appreciated.
Thanks,
Abhishek
Related
Executive summary
For several weeks we sporadically see the following error on all of our AKS Kubernetes clusters:
Failed to pull image "mcr.microsoft.comoss/calico/pod2daemon-flexvol:v3.18.1
Obviously there is a missing "/" after "mcr.microsoft.com".
The problem started after upgrading the clusters from 1.17 to 1.20.
Where does this spelling error come from? Is there anything WE can do about it?
Some details
The full error is:
Failed to pull image "mcr.microsoft.comoss/calico/pod2daemon-flexvol:v3.18.1": rpc error: code = Unknown desc = failed to pull and unpack image "mcr.microsoft.comoss/calico/pod2daemon-flexvol:v3.18.1": failed to resolve reference "mcr.microsoft.comoss/calico/pod2daemon-flexvol:v3.18.1": failed to do request: Head https://mcr.microsoft.comoss/v2/calico/pod2daemon-flexvol/manifests/v3.18.1: dial tcp: lookup mcr.microsoft.comoss on 168.63.129.16:53: no such host
In 50% of the cases the following is logged also:
Pod 'calico-system/calico-typha-685d454c58-pdqkh' triggered a Warning-Event: 'FailedMount'. Warning Message: Unable to attach or mount volumes: unmounted volumes=[typha-ca typha-certs calico-typha-token-424k6], unattached volumes=[typha-ca typha-certs calico-typha-token-424k6]: timed out waiting for the condition
There seems to be no measurable effect on cluster health apart from the warnings - I see no correlating errors in any services.
We did not find a trigger which causes the behavior. It does not seem to be correlated to any change we do from our side (deployments, scaling, ...).
Also there seems to be no pattern as to the frequency. Sometimes there is no problem for several days and then we have the error pop up 10 times per day.
Another observation is that the calico-kube-controller and several pods were restarted. Replicaset and deployments did not change.
Restart time
Since all the pods of the daemonset are running eventually, the problem seems to be solving itself after some time.
Are you behind a firewall, and used this link to set it up
https://learn.microsoft.com/en-us/azure/aks/limit-egress-traffic
If so add HTTP to the mcr.microsoft.com, looks like MS missed the 's' in an update recently
Paul
I am using Azure Kuerbets Services and Im having a huge problem to detect why pods (of specific type) isnt starting... The only thing that happens is that when new pods starts health check is timing out and silently AKS go back to old deployed services that worked... I have made a lot of trace output in service to detect where it fails if its external calls that are blocked etc and I have a global try/catch in Program.cs but no information comes out... AKS listen on stdout and grabbing logs there and push them to external tool.... I have tried to increase values when health check should start etc as below but with no result
livenessProbe:
.
.
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
.
.
initialDelaySeconds: 50
periodSeconds: 15
When running service locally its up in 10-15 sec
Obviously things seems to fail before service is started or something is timing out and I'm wondering...
Can I fetch logs or monitor whats happening and why pods are so slow in AKS when pods are starting?
Is it possible to monitor what comes out on stdout on an virtual machine that belongs to AKS-cluster?
Feels like I have tested everything but I cant find any reason why health-monitoring is refusing requests.
Thanks!
If you have enabled Azure Monitor for Container when you created your cluster, the logs of your application will be pushed to a Log Analytics workspace in the table ContainerLog. If Azure Monitor is not enable, you can use kubectl to see what is output to stdout and sdterr with the following command :
kubectl logs {pod-name} -n {namespace}
You can also check the kubernetes events, you'll see events saying that the probes failed If this is really the problem :
kubectl get events -n {namespace}
Kubernetes version : v1.6.7
Network plugin : weave
I recently noticed that my entire cluster of 3 nodes went down. Doing my initial level of troubleshooting revealed that /var on all nodes was 100%.
Doing further into the logs revealed the logs to be flooded by kubelet stating
Jan 15 19:09:43 test-master kubelet[1220]: E0115 19:09:43.636001 1220 kuberuntime_gc.go:138] Failed to stop sandbox "fea8c54ca834a339e8fd476e1cfba44ae47188bbbbb7140e550d055a63487211" before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "<TROUBLING_POD>-1545236220-ds0v1_kube-system" network: CNI failed to retrieve network namespace path: Error: No such container: fea8c54ca834a339e8fd476e1cfba44ae47188bbbbb7140e550d055a63487211
Jan 15 19:09:43 test-master kubelet[1220]: E0115 19:09:43.637690 1220 docker_sandbox.go:205] Failed to stop sandbox "fea94c9f46923806c177e4a158ffe3494fe17638198f30498a024c3e8237f648": Error response from daemon: {"message":"No such container: fea94c9f46923806c177e4a158ffe3494fe17638198f30498a024c3e8237f648"}
The <TROUBLING_POD>-1545236220-ds0v1 was being initiated due to a cronjob and due to some misconfigurations, there were errors occurring during the running of those pods and more pods were being spun up.
So I deleted all the jobs and their related pods. So I had a cluster that had no jobs/pods running related to my cronjob and still see the same ERROR messages flooding the logs.
I did :
1) Restart docker and kubelet on all nodes.
2) Restart the entire control plane
and also
3) Reboot all nodes.
But still the logs are being flooded with the same error messages even though no such pods are even being spun up.
So I dont know how can I stop kubelet from throwing out the errors.
Is there a way for me to reset the network plugin I am using ? Or do something else ?
Check if the pod directory exists under /var/lib/kubelet
You're on a very old version of Kubernetes, upgrading will fix this issue.
I'm having trouble getting control-center to work. Setted up a 3 node kafka cluster using the following docker image = confluentinc/cp-enterprise-kafka. On a separate machine I've downloaded confluent platform v5.0.1 and I've configured (tried) control-center to monitor the docker cluster.
The kafka broker I'm using for the control-center configuration is the same from the confluent platform v5.0.1, downloaded.(I start the whole stack via bin/confluent start)
But I keep getting the rocket launching page when clicking Monitoring > System health.
My setup : --------------------------------------------------------
3 node kafka cluster using docker images.
docker image used = confluentinc/cp-enterprise-kafka
kafka running on these hostnames for the 3-node cluster :
os0 / running on tcp/29092
os1 / running on tcp/39092
os2 / running on tcp/49092
Control-center is running on a separate machine whose hostname = sb1
Futhermore the brokers have the following directives defined as :
metric.reporters=io.confluent.metrics.reporter.ConfluentMetricsReporter
confluent.metrics.reporter.bootstrap.servers=sb1:9092
For the control-center I added the 3 node cluster config :
confluent.controlcenter.kafka.osd.bootstrap.servers=os0:29092,os1:39092,os2:49092
I'm expecting the kafka brokers writing to the kafka broker # sb1 (used by control-center) topic _confluent-metrics
What I've tried/checked/debugged so far :
dumped the the topic _confluent-metrics, and I have messages being written there
I dont know if logs from control-center (# /tmp/confluent.QJ2C4BmE/control-center/control-center.stdout) do show anyhting useful (at least for what I can interpret)
I can see HTTP/200 for the cluster I'm trying to monitor written down in the blog.
at the log from the kafka brokers I also see written the following, which put me thinking the messages were written to the topic :
[2018-12-15 07:57:59,893] ERROR Failed to submit metrics to Kafka topic __confluent.support.metrics (due to exception): java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for __confluent.support.metrics-0: 30083 ms has passed since batch creation plus linger time (io.confluent.support.metrics.submitters.KafkaSubmitter)
[2018-12-15 07:58:01,088] INFO Successfully submitted metrics to Confluent via secure endpoint (io.confluent.support.metrics.submitters.ConfluentSubmitter)
I run out of viable solutions to debug this, any help would be appreciated.
thanks in advance.
I was accessing control center via an ssh tunnel. (This was a testing environment I was using to set up CC (control center)).
When I accessed directly to the ip:port of CC everything run smoothly.
I'm using Hadoop from docker swarm with 1 namenode and 3 datanodes (on 3 physical machines).
i'm also using kafka and kafka connect + hdfs connector to write messages into HDFS in parquet format.
I'm able to write data to HDFS using HDFS clients (hdfs put).
But when kafka is writing messages, it works at the very beginning, then if fails with this error :
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.8:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1533)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1309)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1262)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
[2018-05-23 10:30:10,125] INFO Abandoning BP-468254989-172.17.0.2-1527063205150:blk_1073741825_1001 (org.apache.hadoop.hdfs.DFSClient:1265)
[2018-05-23 10:30:10,148] INFO Excluding datanode DatanodeInfoWithStorage[10.0.0.8:50010,DS-cd1c0b17-bebb-4379-a5e8-5de7ff7a7064,DISK] (org.apache.hadoop.hdfs.DFSClient:1269)
[2018-05-23 10:31:10,203] INFO Exception in createBlockOutputStream (org.apache.hadoop.hdfs.DFSClient:1368)
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.9:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
at org.apache.hadoop.hdfs.DFSOutputStream.createSocketForPipeline(DFSOutputStream.java:1533)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1309)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1262)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:448)
And then the datanodes are not reachable anymore for the process :
[2018-05-23 10:32:10,316] WARN DataStreamer Exception (org.apache.hadoop.hdfs.DFSClient:557)
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /topics/+tmp/test_hdfs/year=2018/month=05/day=23/hour=08/60e75c4c-9129-454f-aa87-6c3461b54445_tmp.parquet could only be replicated to 0 nodes instead of minReplication (=1). There are 3 datanode(s) running and 3 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1733)
at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2496)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:828)
But if I look into the hadoop web admin console, all the nodes seem to be up and OK.
I've checked the hdfs-site and the "dfs.client.use.datanode.hostname" setting is set to true both on namenode and datanodes. All ips in hadoop configuration files are defined using 0.0.0.0 addresses.
I've tried to format the namenode too, but the error happened again.
Could the problem be that Kafka is writing too fast in HDFS, so it overwhelms it? It would be weird as I've tried the same configuration on a smaller cluster and it worked good even with a big throughputof kafka messages.
Do you have any other idea of the origin of this problem?
Thanks
dfs.client.use.datanode.hostname=true has to be configured also to the client side and, following your log stack:
java.nio.channels.SocketChannel[connection-pending remote=/10.0.0.9:50010]
I guess 10.0.0.9 refers to a private net IP; thus, it seems that the property is not set in your client within hdfs-client.xml.
You can find more detail here.