Unable to export traces to OpenTelemetry Collector on Kubernetes - ruby-on-rails

I am using the opentelemetry-ruby otlp exporter for auto instrumentation:
https://github.com/open-telemetry/opentelemetry-ruby/tree/main/exporter/otlp
The otel collector was installed as a daemonset:
https://github.com/open-telemetry/opentelemetry-helm-charts/tree/main/charts/opentelemetry-collector
I am trying to get the OpenTelemetry collector to collect traces from the Rails application. Both are running in the same cluster, but in different namespaces.
We have enabled auto-instrumentation in the app, but the rails logs are currently showing these errors:
E, [2022-04-05T22:37:47.838197 #6] ERROR -- : OpenTelemetry error: Unable to export 499 spans
I set the following env variables within the app:
OTEL_LOG_LEVEL=debug
OTEL_EXPORTER_OTLP_ENDPOINT=http://0.0.0.0:4318
I can't confirm that the application can communicate with the collector pods on this port.
Curling this address from the rails/ruby app returns "Connection Refused". However I am able to curl http://<OTEL_POD_IP>:4318 which returns 404 page not found.
From inside a pod:
# curl http://localhost:4318/
curl: (7) Failed to connect to localhost port 4318: Connection refused
# curl http://10.1.0.66:4318/
404 page not found
This helm chart created a daemonset but there is no service running. Is there some setting I need to enable to get this to work?
I confirmed that otel-collector is running on every node in the cluster and the daemonset has HostPort set to 4318.

The problem is with this setting:
OTEL_EXPORTER_OTLP_ENDPOINT=http://0.0.0.0:4318
Imagine your pod as a stripped out host itself. Localhost or 0.0.0.0 of your pod, and you don't have a collector deployed in your pod.
You need to use the address from your collector. I've checked the examples available at the shared repo and for agent-and-standalone and standalone-only you also have a k8s resource of type Service.
With that you can use the full service name (with namespace) to configure your environment variable.
Also, the Environment variable now is called OTEL_EXPORTER_OTLP_TRACES_ENDPOINT, so you will need something like this:
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=<service-name>.<namespace>.svc.cluster.local:<service-port>

The correct solution is to use the Kubernetes Downward API to fetch the node IP address, which will allow you to export the traces directly to the daemonset pod within the same node:
containers:
- name: my-app
image: my-image
env:
- name: HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://$(HOST_IP):4318
Note that using the deployment's service as the endpoint (<service-name>.<namespace>.svc.cluster.local) is incorrect, as it effectively bypasses the daemonset and sends the traces directly to the deployment, which makes the daemonset useless.

Related

How to expose low-numbered ports in the kubernetes mini-cluster that comes with Docker Desktop

I'm using the kubernetes cluster built in to Docker Desktop to develop my application.
I would like to expose services inside the cluster as ports on localhost.
I can do so using kubectl expose deployment foobar --type=NodePort --port=30088, which creates a service like this:
apiVersion: v1
kind: Service
metadata:
labels:
role: web
name: foobar
spec:
externalTrafficPolicy: Cluster
ports:
- nodePort: 30088
port: 80
protocol: TCP
targetPort: 80
selector:
role: web
type: NodePort
But it only works for very high numbered ports. If I try something lower I get:
The Service "kafka-external" is invalid: spec.ports[0].nodePort: Invalid value: 9092: provided port is not in the valid range. The range of valid ports is 30000-32767
It seems there is a kubernetes apiserver setting called ServiceNodePortRange which would allow me to override this restriction, but I can't figure out how to set it on Docker's builtin cluster.
So my question is: how do I expose a specific, low-numbered port (like 9092) on Docker's kubernetes cluster? Is there a way to override that setting? Or a better way to expose the service than NodePort?
NodePort is intended to be a building block for load-balancers or other
ingress modes. This means it didn't matter which port you got as long as
you got one. This makes it a little clunky to use directly - you can't
have just any port. You can change the port range, but you run the risk of
conflicts with real things running on your nodes and with any pod HostPorts.
The default range is indeed 30000-32767 but it can be changed by setting the --service-node-port-range Update the file /etc/kubernetes/manifests/kube-apiserver.yaml and add the line --service-node-port-range=xxxxx-yyyyy.
In the Kubernetes cluster there is a kube-apiserver.yaml file which is in the directory - /etc/kubernetes/manifests/kube-apiserver.yaml but not on the kube-apiserver container/pod but on the master itself.
Login to Docker VM:
Add the following line to the pod spec:
spec:
containers:
- command:
- kube-apiserver
...
- --service-node-port-range=xxxxx-yyyyy # <-- add this line
...
Save and exit. Pod kube-apiserver will be restarted with new parameters.
Exit Docker VM (for screen: Ctrl-a,k , for container: Ctrl-d )
Check the results:
$ kubectl get pod kube-apiserver-docker-desktop -o yaml -n kube-system | less
Take a look: service-pod-range, changing pod range, changing-nodeport-range.

Docker for Desktop Kubernetes Unable to connect to the server: dial tcp [::1]:6445

I am using Docker for Desktop on Windows 10 Professional with Hyper-V, also I am not using minikube. I have installed Kubernetes cluster via Docker for Desktop, as shown below:
It shows the Kubernetes is successfully installed and running.
When I run the following command:
kubectl config view
I get the following output:
apiVersion: v1
clusters:
- cluster:
insecure-skip-tls-verify: true
server: https://localhost:6445
name: docker-for-desktop-cluster
contexts:
- context:
cluster: docker-for-desktop-cluster
user: docker-for-desktop
name: docker-for-desktop
current-context: docker-for-desktop
kind: Config
preferences: {}
users:
- name: docker-for-desktop
user:
client-certificate-data: REDACTED
client-key-data: REDACTED
However when I run the
kubectl cluster-info
I am getting the following error:
Unable to connect to the server: dial tcp [::1]:6445: connectex: No connection could be made because the target machine actively refused it.
It seems like there is some network issue, I am not sure how to resolve this.
I know this is an old question but the following helped me to resolve a similar issue. The root cause was that I had minikube installed previously and that was being used as my default context.
I was getting following error:
Unable to connect to the server: dial tcp 192.168.1.8:8443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
In the power-shell run the following command:
> kubectl config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
docker-desktop docker-desktop docker-desktop
docker-for-desktop docker-desktop docker-desktop
* minikube minikube minikube
this will list all the contexts and see if there are multiple. If you had installed minikube in the past, that will show a * mark as currently selected default context. You can change that to point to docker-desktop context like follows:
> kubectl config use-context docker-desktop
Run the get-contexts command again to verify the * mark.
Now, the following command should work:
> kubectl get pods
Posting a response to this very old question, as I was searching for a solution and later found a different cause for my problem and the solution was simple.
Cause was that the config file was missing from the $HOME$/.kube directory
A simple restart of Docker Desktop restored the file with some defaults and things were back ok.
Side note: The issue started after I upgraded my Docker Desktop Installation to latest (when I got the update available popup). I should also mention that the cluster stopped working and I had to manually remove Docker Desktop and Reinstall the latest version (this was the story before the problem occurred).

How to fix "failed to ensure load balancer" error for nginx ingress

When setting a new nginx-ingress using helm and a static ip on Azure the nginx controller never gets the static IP assigned. It always says <pending>.
I install the helm chart as follows -
helm install stable/nginx-ingress --name <my-name> --namespace <my-namespace> --set controller.replicaCount=2 --set controller.service.loadBalancerIP="<static-ip-address>"
It says it installs correctly but there is an error listed as well
E0411 06:44:17.063913 13264 portforward.go:303] error copying from
remote stream to local connection: readfrom tcp4
127.0.0.1:57881->127.0.0.1:57886: write tcp4 127.0.0.1:57881->127.0.0.1:57886: wsasend: An established connection was aborted by the software in your host machine.
I then do a kubectl get all -n <my-namespace> and everything is listed correctly just with the external IP as <pending> for the controller.
I then do a kubectl describe -n <my-namespace> service/<my-name>-nginx-ingress-controller and this error is listed under Events -
Warning CreatingLoadBalancerFailed 11s (x4 over 47s)
service-controller Error creating load balancer (will retry): failed
to ensure load balancer for service
my-namespace/my-name-nginx-ingress-controller: timed out waiting for the
condition.
Thank you kindly
For your issue, the possible reason is that your public IP is not in the same resource group and region with the AKS cluster. See the steps in Create an ingress controller with a static public IP address in Azure Kubernetes Service (AKS).
You can get the AKS group through the CLI command like this:
az aks show --resource-group myResourceGroup --name myAKSCluster --query nodeResourceGroup -o tsv
When your public IP in a different group and region, then it will give the time out error as you.
Make sure that your ingress is in the node resource group, and also that the sku for the ingress is Basic not Standard

POST larger than 400 Kilobytes payload to a container in Kubernetes fails

I'm using EKS (Kubernetes) in AWS and I have problems with posting a payload at around 400 Kilobytes to any web server that runs in a container in that Kubernetes. I hit some kind of limit but it's not a limit in size, it seems at around 400 Kilobytes many times works but sometimes I get (testing with Python requests)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))
I test this with different containers (python web server on Alpine, Tomcat server on CentOS, nginx, etc).
The more I increase the size over 400 Kilobytes, the more consistent I get: Connection reset by peer.
Any ideas?
Thanks for your answers and comments, helped me get closer to the source of the problem. I did upgrade the AWS cluster from 1.11 to 1.12 and that cleared this error when accessing from service to service within Kubernetes. However, the error still persisted when accessing from outside the Kubernetes cluster using a public dns, thus the load balancer.
So after testing some more I found out that now the problem lies in the ALB or the ALB controller for Kubernetes: https://kubernetes-sigs.github.io/aws-alb-ingress-controller/
So I switched back to a Kubernetes service that generates an older-generation ELB and the problem was fixed. The ELB is not ideal, but it's a good work-around for the moment, until the ALB controller gets fixed or I have the right button to press to fix it.
As you mentioned in this answer that the issue might be caused by ALB or the ALB controller for Kubernetes: https://kubernetes-sigs.github.io/aws-alb-ingress-controller/.
Can you check if Nginx Ingress controller can be used with ALB ?
Nginx has a default value of request size set to 1Mb. It can be changed by using this annotation: nginx.ingress.kubernetes.io/proxy-body-size.
Also are you configuring connection-keep-alive or connection timeouts anywhere ?
The connection reset by peer, even between services inside the cluster, sounds like it may be the known issue with conntrack. The fix involves running the following:
echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal
And you can automate this with the following DaemonSet:
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: startup-script
labels:
app: startup-script
spec:
template:
metadata:
labels:
app: startup-script
spec:
hostPID: true
containers:
- name: startup-script
image: gcr.io/google-containers/startup-script:v1
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
env:
- name: STARTUP_SCRIPT
value: |
#! /bin/bash
echo 1 > /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_be_liberal
echo done
As this answer suggests, you may try to change you kube-proxy mode of operation. To edit your kube-proxy configs:
kubectl -n kube-system edit configmap kube-proxy
Search for mode: "" and try "iptables" , "userspace" or "ipvs". Each time you change your configmap, delete your kube-proxy pod(s) to make sure it is reading the new configmap.
we had a similar issue with Azure and its firewall which prevents to send more than 128KB as patch request.
After researching and thinking about the pro/cons on this approach within the team, our solution is a complete different one.
We put our "bigger" requests into a blob storage. Afterwards we put a message onto a queue with the filename created before. The queue will receive the message with the filename, reads the blob from the storage, converts it into whatever-you-need-to-have as object and is able to apply any business logic on this big object.
After processing the message, the file will be deleted.
The biggest advantage is that our API is not blocked with a big request and its long running job.
Maybe this can be another way to solve your issue within the kubernetes container.
See ya, Leonhard

Kubernetes certbot standalone not working

I'm trying to generate an SSL certificate with certbot/certbot docker container in kubernetes. I am using Job controller for this purpose which looks as the most suitable option. When I run the standalone option, I get the following error:
Failed authorization procedure. staging.ishankhare.com (http-01):
urn:ietf:params:acme:error:connection :: The server could not connect
to the client to verify the domain :: Fetching
http://staging.ishankhare.com/.well-known/acme-challenge/tpumqbcDWudT7EBsgC7IvtSzZvMAuooQ3PmSPh9yng8:
Timeout during connect (likely firewall problem)
I've made sure that this isn't due to misconfigured DNS entries by running a simple nginx container, and it resolves properly. Following is my Jobs file:
apiVersion: batch/v1
kind: Job
metadata:
#labels:
# app: certbot-generator
name: certbot
spec:
template:
metadata:
labels:
app: certbot-generate
spec:
volumes:
- name: certs
containers:
- name: certbot
image: certbot/certbot
command: ["certbot"]
#command: ["yes"]
args: ["certonly", "--noninteractive", "--agree-tos", "--staging", "--standalone", "-d", "staging.ishankhare.com", "-m", "me#ishankhare.com"]
volumeMounts:
- name: certs
mountPath: "/etc/letsencrypt/"
#- name: certs
#mountPath: "/opt/"
ports:
- containerPort: 80
- containerPort: 443
restartPolicy: "OnFailure"
and my service:
apiVersion: v1
kind: Service
metadata:
name: certbot-lb
labels:
app: certbot-lb
spec:
type: LoadBalancer
loadBalancerIP: 35.189.170.149
ports:
- port: 80
name: "http"
protocol: TCP
- port: 443
name: "tls"
protocol: TCP
selector:
app: certbot-generator
the full error message is something like this:
Saving debug log to /var/log/letsencrypt/letsencrypt.log
Plugins selected: Authenticator standalone, Installer None
Obtaining a new certificate
Performing the following challenges:
http-01 challenge for staging.ishankhare.com
Waiting for verification...
Cleaning up challenges
Failed authorization procedure. staging.ishankhare.com (http-01): urn:ietf:params:acme:error:connection :: The server could not connect to the client to verify the domain :: Fetching http://staging.ishankhare.com/.well-known/acme-challenge/tpumqbcDWudT7EBsgC7IvtSzZvMAuooQ3PmSPh9yng8: Timeout during connect (likely firewall problem)
IMPORTANT NOTES:
- The following errors were reported by the server:
Domain: staging.ishankhare.com
Type: connection
Detail: Fetching
http://staging.ishankhare.com/.well-known/acme-challenge/tpumqbcDWudT7EBsgC7IvtSzZvMAuooQ3PmSPh9yng8:
Timeout during connect (likely firewall problem)
To fix these errors, please make sure that your domain name was
entered correctly and the DNS A/AAAA record(s) for that domain
contain(s) the right IP address. Additionally, please check that
your computer has a publicly routable IP address and that no
firewalls are preventing the server from communicating with the
client. If you're using the webroot plugin, you should also verify
that you are serving files from the webroot path you provided.
- Your account credentials have been saved in your Certbot
configuration directory at /etc/letsencrypt. You should make a
secure backup of this folder now. This configuration directory will
also contain certificates and private keys obtained by Certbot so
making regular backups of this folder is ideal.
I've also tried running this as a simple Pod but to no help. Although I still feel running it as a Job to completion is the way to go.
First, be aware your Job definition is valid, but the spec.template.metadata.labels.app: certbot-generate value does not match with your Service definition spec.selector.app: certbot-generator: one is certbot-generate, the second is certbot-generator. So the pod run by the job controller is never added as an endpoint to the service.
Adjust one or the other, but they have to match, and that might just work :)
Although, I'm not sure using a Service with a selector targeting short-lived pods from a Job controller would work, neither with a simple Pod as you tested. The certbot-randomId pod created by the job (or whatever simple pod you create) takes about 15 seconds total to run/fail, and the HTTP validation challenge is triggered after just a few seconds of the pod life: it's not clear to me that would be enough time for kubernetes proxying to be already working between the service and the pod.
We can safely assume that the Service is actually working because you mentioned that you tested DNS resolution, so you can easily ensure that's not a timing issue by adding a sleep 10 (or more!) to give more time for the pod to be added as an endpoint to the service and being proxied appropriately before the HTTP challenge is triggered by certbot. Just change your Job command and args for those:
command: ["/bin/sh"]
args: ["-c", "sleep 10 && certbot certonly --noninteractive --agree-tos --staging --standalone -d staging.ishankhare.com -m me#ishankhare.com"]
And here too, that might just work :)
That being said, I'd warmly recommend you to use cert-manager which you can install easily through its stable Helm chart: the Certificate custom resource that it introduces will store your certificate in a Secret which will make it straightforward to reuse from whatever K8s resource, and it takes care of renewal automatically so you can just forget about it all.

Resources