I am trying to deploy the microsoft fluid framework on Aws eks cluster but the pods go in to CrashLoopBackOff - docker

When I get the logs for one of the pods with the CrashLoopBackOff status
kubectl logs alfred
it returns the following errors.
error: alfred service exiting due to error {"label":"winston","timestamp":"2021-11-08T07:02:02.324Z"}
at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:66:26) {
errno: 'ENOTFOUND',
code: 'ENOTFOUND',
syscall: 'getaddrinfo',
hostname: 'mongodb'
} {"label":"winston","timestamp":"2021-11-08T07:02:02.326Z"}
error: Client Manager Redis Error: getaddrinfo ENOTFOUND redis {"errno":"ENOTFOUND","code":"ENOTFOUND","syscall":"getaddrinfo","hostname":"redis","stack":"Error: getaddrinfo ENOTFOUND redis\n at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:66:26)","label":"winston","timestamp":"2021-11-08T07:02:02.368Z"}
I am new to the Kubernetes and Aws Eks. Would be looking forward to help. Thanks

If you see the error its failing at getaddrinfo, which is a program/function to resolve the dns name and connect with an external service. It is trying to access a redis cluster. Seems like your EKS cluster doesn't have the connectivity.
However if you are running redis as part of your EKS cluster, make sure to provide/update the kubernetes service dns in the application code, or set this as an environment variable which can be set just before deployment.
Its redis and mongodb, also as error says you are providing hostname as redis and mongodb, it won't resolve to an IP address unless you have mapped it in /etc/hosts file which is actually untrue.
Give the correct hostnames, the pods will come up. This is the root-cause.

The errors above were being generated because mongo and redis were not exposed by a service. After I created service.yaml files for the instances the above errors went away. Aws Eks deploys containers in pods which are scattered across different nodes. In order to let mongodb communicate from one pod to another you must expose a service or aka "frontend" for the mongodb deployment.

Related

Kubernetes Service Requests are sending back responses from Pod IP rather than Service IP, CoreDNS not working for istioctl install

Overview of my setup/problem
I'm running a K3s cluster inside a Docker container on a Rhel 7.9 box. This is all on an air gapped network so bare with me if you don't see copy and pasted examples below.
I'm trying to install Istio on the cluster but the install hangs on setting up the ingress gateway deployment. The Istio install hangs on gateway deployment because its unable to resolve the Istiod Kubernetes service from inside the Ingress Gateway pod.
What I've tried
I tested the image on a Ubuntu vagrant box and the Istio install works fine there. I've also tested the install on a Windows 10 machine using Rancher Desktop and it works fine there as well. At one point it worked on the Rhel box but my team did some security hardening over a two period but naturally they have no idea what change broke my cluster. So I'm trying to narrow down the search.
I've determined that the issue is with CoreDNS in my K3s cluster. I used the dnsutils docker image and ran a nslookup kubernetes.default. I checked the logs of the CoreDNS pod and it shows the lookup but the response it sends back to nslookup has the ip of the CoreDNS pod rather than the kube-dns Kubernetes service. nslookup correctly sees that and says
nslookup kubernetes.default
;; reply from unexpected source: 10.42.0.4#53, expected 10.43.0.2#53
;; reply from unexpected source: 10.42.0.4#53, expected 10.43.0.2#53
;; reply from unexpected source: 10.42.0.4#53, expected 10.43.0.2#53
;; connection timed out; no servers could be reached
10.42.0.4 being the CoreDNS pod and the 10.43.0.2 being the kube-dns Kubernetes Service for that pod.
The logs from the failing Istio Ingress Gateway pod are saying that its failing to retrieve a certificate from the Istiod pod because the Istiod kubernetes service connection is timing out. Which makes sense considering I can't resolve kubernetes.default correctly either.
2021-05-27T10:28:07.342344Z warn ca ca request failed, starting attempt 1 in 91.589072ms
2021-05-27T10:28:07.434806Z warn ca ca request failed, starting attempt 2 in 203.792343ms
2021-05-27T10:28:07.639557Z warn ca ca request failed, starting attempt 3 in 364.729652ms
2021-05-27T10:28:08.005300Z warn ca ca request failed, starting attempt 4 in 830.723933ms
And then states that the request to the Istiod service timed out
transport: Error while dialing dial tcp: lookup istiod.istio-system.svc on 10.96.0.10:53: read udp 10.244.153.113:41187->10.96.0.10:53: i/o timeout
Again my setup is on an air gapped network so ignore the IP addresses in the example above. These were copied from other posts that are related to my issue.
Where to go from here?
I'm trying to figure out what could be causing this problem. DNS resolution should be out of the box functionality for K3s and its not working correctly. As I stated before its not the Docker image I'm running k3s out of since I've gotten k3s and Istio to work on other machines.
Any suggestions on what to do next or advice on how to troubleshoot this would be greatly appreciated. Let me know if there is any other info I can provide to help. Thanks!
TLDR - bridge-nf-call-iptables and bridge-nf-callip6tables were disabled. They need to be enabled.
I found this using docker info. This listed a warning about bridge-nf-call-iptables and bridge-nf-callip6tables being disabled. I found lots of talk on the CoreDNS and k3s github about issues caused by iptables and our suspicions were correct.
This link was the solution for us.

Unable to export traces to OpenTelemetry Collector on Kubernetes

I am using the opentelemetry-ruby otlp exporter for auto instrumentation:
https://github.com/open-telemetry/opentelemetry-ruby/tree/main/exporter/otlp
The otel collector was installed as a daemonset:
https://github.com/open-telemetry/opentelemetry-helm-charts/tree/main/charts/opentelemetry-collector
I am trying to get the OpenTelemetry collector to collect traces from the Rails application. Both are running in the same cluster, but in different namespaces.
We have enabled auto-instrumentation in the app, but the rails logs are currently showing these errors:
E, [2022-04-05T22:37:47.838197 #6] ERROR -- : OpenTelemetry error: Unable to export 499 spans
I set the following env variables within the app:
OTEL_LOG_LEVEL=debug
OTEL_EXPORTER_OTLP_ENDPOINT=http://0.0.0.0:4318
I can't confirm that the application can communicate with the collector pods on this port.
Curling this address from the rails/ruby app returns "Connection Refused". However I am able to curl http://<OTEL_POD_IP>:4318 which returns 404 page not found.
From inside a pod:
# curl http://localhost:4318/
curl: (7) Failed to connect to localhost port 4318: Connection refused
# curl http://10.1.0.66:4318/
404 page not found
This helm chart created a daemonset but there is no service running. Is there some setting I need to enable to get this to work?
I confirmed that otel-collector is running on every node in the cluster and the daemonset has HostPort set to 4318.
The problem is with this setting:
OTEL_EXPORTER_OTLP_ENDPOINT=http://0.0.0.0:4318
Imagine your pod as a stripped out host itself. Localhost or 0.0.0.0 of your pod, and you don't have a collector deployed in your pod.
You need to use the address from your collector. I've checked the examples available at the shared repo and for agent-and-standalone and standalone-only you also have a k8s resource of type Service.
With that you can use the full service name (with namespace) to configure your environment variable.
Also, the Environment variable now is called OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, so you will need something like this:
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=<service-name>.<namespace>.svc.cluster.local:<service-port>
The correct solution is to use the Kubernetes Downward API to fetch the node IP address, which will allow you to export the traces directly to the daemonset pod within the same node:
containers:
- name: my-app
image: my-image
env:
- name: HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://$(HOST_IP):4318
Note that using the deployment's service as the endpoint (<service-name>.<namespace>.svc.cluster.local) is incorrect, as it effectively bypasses the daemonset and sends the traces directly to the deployment, which makes the daemonset useless.

Spring Cloud Data Flow: Error org.springframework.dao.InvalidDataAccessResourceUsageException

I am trying to run/configure a Spring Data Cloud Data Flow (SCDF) to schedule a task for a Spring Batch Job.
I am running in a minikube that connects to a local postgresql(localhost:5432). The minikube runs in a virtualbox where I assigned a vnet thru the --cidr, so minikube can connect to the local postgres.
Here is the postgresql service yaml:
https://github.com/msuzuki23/SpringCloudDataFlowDemo/blob/main/postgres-service.yaml
Here is the SCDF config yaml:
https://github.com/msuzuki23/SpringCloudDataFlowDemo/blob/main/server-config.yaml
Here is the SCDF deployment yaml:
https://github.com/msuzuki23/SpringCloudDataFlowDemo/blob/main/server-deployment.yaml
Here is the SCDF server-svc.yaml:
https://github.com/msuzuki23/SpringCloudDataFlowDemo/blob/main/server-svc.yaml
To launch the SCDF server in minikube I do the following kubectl commands:
kubectl apply -f secret.yaml
kubectl apply -f configmap.yaml
kubectl apply -f postgres-service.yaml
kubectl create -f server-roles.yaml
kubectl create -f server-rolebinding.yaml
kubectl create -f service-account.yaml
kubectl apply -f server-config.yaml
kubectl apply -f server-svc.yaml
kubectl apply -f server-deployment.yaml
I am not running Prometeus, Grafana, Kafka/Rabbitmq as I want to test and make sure I can launch the Spring Batch Job from SCDF. I did not run the skipper deployment (Spring Cloud DataFlow server runnning locally pointing to Skipper in Kubernetes) it is not necessary if just running tasks.
This is the error I am getting when trying to add an application from a docker private repo:
And this is the full error stack from the pod:
https://github.com/msuzuki23/SpringCloudDataFlowDemo/blob/main/SCDF_Log_Error
Highlights from the error stack:
2021-07-08 13:04:13.753 WARN 1 --- [-nio-80-exec-10] o.s.c.d.s.controller.AboutController : Skipper Server is not accessible
org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://localhost:7577/api/about": Connect to localhost:7577 [localhost/127.0.0.1] failed: Connection refused (Connection refused); nested exception is org.apache.http.conn.HttpHostConnectException: Connect to localhost:7577 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
Postges Hibernate Error:
2021-07-08 13:05:22.142 WARN 1 --- [p-nio-80-exec-5] o.h.engine.jdbc.spi.SqlExceptionHelper : SQL Error: 0, SQLState: 42P01
2021-07-08 13:05:22.200 ERROR 1 --- [p-nio-80-exec-5] o.h.engine.jdbc.spi.SqlExceptionHelper : ERROR: relation "hibernate_sequence" does not exist
Position: 17
2021-07-08 13:05:22.214 ERROR 1 --- [p-nio-80-exec-5] o.s.c.d.s.c.RestControllerAdvice : Caught exception while handling a request
org.springframework.dao.InvalidDataAccessResourceUsageException: could not extract ResultSet; SQL [n/a]; nested exception is org.hibernate.exception.SQLGrammarException: could
The first couple of errors are from the SCDF trying to connect to the skipper, since it was not configured, that was expected.
The second error is the Postgres JDBC Hibernate. How do I solve that?
Is there a configuration I am missing when setting the SCDF to point into the local postgres?
Also, in my docker jar I have not added any annotation such as #EnableTask.
Any help is appreciated, thx! Markus.
I did a search on
Caused by: org.postgresql.util.PSQLException: ERROR: relation
"hibernate_sequence" does not exist Position: 17
And found this stackoverflow anwer:
Postgres error in batch insert : relation "hibernate_sequence" does not exist position 17
Went to the postgres and created the hibernate_sequence:
CREATE SEQUENCE my_seq_gen START 1;
Then, add application worked.

How to fix "failed to ensure load balancer" error for nginx ingress

When setting a new nginx-ingress using helm and a static ip on Azure the nginx controller never gets the static IP assigned. It always says <pending>.
I install the helm chart as follows -
helm install stable/nginx-ingress --name <my-name> --namespace <my-namespace> --set controller.replicaCount=2 --set controller.service.loadBalancerIP="<static-ip-address>"
It says it installs correctly but there is an error listed as well
E0411 06:44:17.063913 13264 portforward.go:303] error copying from
remote stream to local connection: readfrom tcp4
127.0.0.1:57881->127.0.0.1:57886: write tcp4 127.0.0.1:57881->127.0.0.1:57886: wsasend: An established connection was aborted by the software in your host machine.
I then do a kubectl get all -n <my-namespace> and everything is listed correctly just with the external IP as <pending> for the controller.
I then do a kubectl describe -n <my-namespace> service/<my-name>-nginx-ingress-controller and this error is listed under Events -
Warning CreatingLoadBalancerFailed 11s (x4 over 47s)
service-controller Error creating load balancer (will retry): failed
to ensure load balancer for service
my-namespace/my-name-nginx-ingress-controller: timed out waiting for the
condition.
Thank you kindly
For your issue, the possible reason is that your public IP is not in the same resource group and region with the AKS cluster. See the steps in Create an ingress controller with a static public IP address in Azure Kubernetes Service (AKS).
You can get the AKS group through the CLI command like this:
az aks show --resource-group myResourceGroup --name myAKSCluster --query nodeResourceGroup -o tsv
When your public IP in a different group and region, then it will give the time out error as you.
Make sure that your ingress is in the node resource group, and also that the sku for the ingress is Basic not Standard

Pod creation in ContainerCreating state always

I am trying to create a pod using kubernetes with the following simple command
kubectl run example --image=nginx
It runs and assigns the pod to the minion correctly but the status is always in ContainerCreating status due to the following error. I have not hosted GCR or GCloud on my machine. So not sure why its picking from there only.
1h 29m 14s {kubelet centos-minion1} Warning FailedSync Error syncing pod, skipping:
failed to "StartContainer" for "POD" with ErrImagePull: "image pull failed
for gcr.io/google_containers/pause:2.0, this may be because there are no
credentials on this request. details: (unable to ping registry endpoint
https://gcr.io/v0/\nv2 ping attempt failed with error: Get https://gcr.io/v2/:
http: error connecting to proxy http://87.254.212.120:8080: dial tcp
87.254.212.120:8080: i/o timeout\n v1 ping attempt failed with error:
Get https://gcr.io/v1/_ping: http: error connecting to proxy
http://87.254.212.120:8080: dial tcp 87.254.212.120:8080: i/o timeout)
Kubernetes is trying to create a pause container for your pod; this container is used to create the pod's network namespace. See this question and its answers for more general information on the pause container.
To your specific error: Kubernetes tries to pull the pause container's image (which would be gcr.io/google_containers/pause:2.0, according to your error message) from the Google Container Registry (gcr.io). Apparently, your Docker engine tries to connect to GCR using a HTTP proxy located at 87.254.212.120:8080, to which it apparently cannot connect (i/o timeout).
To correct this error, either make sure that you HTTP proxy server is online and does not block HTTP requests to GCR, or (if you do have public Internet access) disable the proxy connection for your Docker engine (this would typically be done using the http_proxy and https_proxy environment variables, which would have been set in /etc/sysconfig/docker or /etc/default/docker, depending on your Linux distribution).

Resources