cert-manager does not issue certificate after upgrading to AKS k8s 1.24.6 - azure-aks

I have an automatic setup with scripts and helm to create a Kubernetes Cluster on MS Azure and to deploy my application to the cluster.
First of all: everything works fine when I create a cluster with Kubernetes 1.23.12, that means after a few minutes everything is installed and I can access my website and there is a certificate issued by letsencrypt.
But when I delete this cluster completely and reinstall it and only change the Kubernetes version from 1.23.12 to 1.24.6. I dont't get a certificate any more.
I see that the acme challenge is not working. I get the following error:
Waiting for HTTP-01 challenge propagation: failed to perform self check GET request 'http://my.hostname.de/.well-known/acme-challenge/2Y25fxsoeQTIqprKNR4iI4X81jPoLknmRNvj9uhcOLk': Get "http://my.hostname.de/.well-known/acme-challenge/2Y25fxsoeQTIqprKNR4iI4X81jPoLknmRNvj9uhcOLk": dial tcp: lookup my.hostname.de on 10.0.0.10:53: no such host
After some time the error message changes to:
'Error accepting authorization: acme: authorization error for my.hostname.de:
400 urn:ietf:params:acme:error:connection: 20.79.77.156: Fetching http://my.hostname.de/.well-known/acme-challenge/2Y25fxsoeQTIqprKNR4iI4X81jPoLknmRNvj9uhcOLk:
Timeout during connect (likely firewall problem)'
10.0.0.10 is the cluster IP of kube-dns in my kubernetes cluster. When I look at "Services and Ingresses" in Azure portal I can see the port 53/UDP;53/TCP for the cluster IP 10.0.0.10
And I can see there that 20.79.77.156 is the external IP of the ingres-nginx-controller (Ports: 80:32284/TCP;443:32380/TCP)
So I do not understand why the acme challenge cannot be performed successfully.
Here some information about the version numbers:
Azure Kubernetes 1.24.6
helm 3.11
cert-manager 1.11.0
ingress-nginx helm-chart: 4.4.2 -> controller-v1.5.1
I have tried to find the same error on the internet. But you don't find it often and the solutions do not seem to fit to my problem.
Of course I have read a lot about k8s 1.24.
It is not a dockershim problem, because I have tested the cluster with the Detector for Docker Socket (DDS) tool.
I have updated cert-manager and ingress-nginx to new versions (see above)
I have also tried it with Kubernetes 1.25.4 -> same error
I have found this on the cert-manager Website: "cert-manager expects that ServerSideApply is enabled in the cluster for all versions of Kubernetes from 1.24 and above."
I think I understood the difference between Server Side Apply and Client Side Apply, but I don't know if and how I can enable it in my cluster and if this could be a solution to my problem.
Any help is appreciated. Thanks in advance!

I've solved this myself recently, try this for your ingress controller:
ingress-nginx:
rbac:
create: true
controller:
service:
annotations:
service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path: /healthz
k8s 1.24+ is using a different endpoint for health probes.

Related

Error on etcd health check while setting up RKE cluster

i'm trying to set up a rke cluster, the connection to the nodes goes well but when it starts to check etcd health returns:
failed to check etcd health: failed to get /health for host [xx.xxx.x.xxx]: Get "https://xx.xxx.x.xxx:2379/health": remote error: tls: bad certificate
If you are trying to upgrade the RKE and facing this issue then it could be due to the missing of kube_config_<file>.yml file from the local directory when you perform rke up.
This similar kind of issue was reported and reproduced in this git link . Can you refer to the work around and reproduce it by using the steps provided in the link and let me know if this works.
Refer to this latest SO and doc for more information.

AKS cluster network problems

I have an AKS cluster. I need to connect to the customer's SFTP server from the node AKS. It worked stably but stopped working about a month ago. I started getting a connection error and the connection is timed out. I tried connecting locally and connecting from another AKS cluster. SFTP connection works fine. I created a test SFTP server and was able to connect without problems from the problematic cluster. I am using Calico. Could you tell me where to look to understand where the connection to the customer's SFTP server is blocked? Thanks.
The default behavior of Calico is to permit all traffic. However, this behavior changes to block all traffic except those that are explicitly allowed by policies when a policy is present. Please Check network policies. steps below mentioned.
Connect to AKS cluster
verify any network policy exist which conflicts SFTP server.
kubectl get networkpolicy -A
Disable policy using below command
kubectl delete networkpolicy -n < SFTP Policy Name>

multi master OKD-3.11 setup fails if master-1 nodes is down

I am trying to install multi-master openshift-3.11 setup in openstack VMs as per the inventory file present in the official documentation.
https://docs.openshift.com/container-platform/3.11/install/example_inventories.html#multi-masters-single-etcd-using-native-ha
OKD Version
[centos#master1 ~]$ oc version
oc v3.11.0+62803d0-1
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO
Server https://master1.167.254.204.74.nip.io:8443
openshift v3.11.0+ff2bdbd-531
kubernetes v1.11.0+d4cacc0
Steps To Reproduce
Bring up an okd-3.11 multi master setup as per the inventory file mentioned in here,
https://docs.openshift.com/container-platform/3.11/install/example_inventories.html#multi-masters-single-etcd-using-native-ha
Current Result
The setup is successful but struck with two issues as mentioned below,
unable to list down the load balancer nodes on issue of "oc get nodes" command.
[centos#master1 ~]$ oc get nodes
NAME STATUS ROLES AGE VERSION
master1.167.254.204.74.nip.io Ready infra,master 6h v1.11.0+d4cacc0
master2.167.254.204.58.nip.io Ready infra,master 6h v1.11.0+d4cacc0
master3.167.254.204.59.nip.io Ready infra,master 6h v1.11.0+d4cacc0
node1.167.254.204.82.nip.io Ready compute 6h v1.11.0+d4cacc0
The master nodes and the load balancer are totally dependent on master-1 node because if master-1 is down then rest of the master nodes or load balancer unable to run any of the oc commands,
[centos#master2 ~]$ oc get nodes
Unable to connect to the server: dial tcp 167.254.204.74:8443: connect: no route to host
The OKD setup works fine if the other master nodes (other than master-1) or the load balancer are down.
Expected Result
The OKD setup should be up & running though any one of the master nodes went down.
Inventory file:
[OSEv3:children]
masters
nodes
etcd
lb
[masters]
master1.167.254.204.74.nip.io
master2.167.254.204.58.nip.io
master3.167.254.204.59.nip.io
[etcd]
master1.167.254.204.74.nip.io
master2.167.254.204.58.nip.io
master3.167.254.204.59.nip.io
[lb]
lb.167.254.204.111.nip.io
[nodes]
master1.167.254.204.74.nip.io openshift_ip=167.254.204.74 openshift_schedulable=true openshift_node_group_name='node-config-master'
master2.167.254.204.58.nip.io openshift_ip=167.254.204.58 openshift_schedulable=true openshift_node_group_name='node-config-master'
master3.167.254.204.59.nip.io openshift_ip=167.254.204.59 openshift_schedulable=true openshift_node_group_name='node-config-master'
node1.167.254.204.82.nip.io openshift_ip=167.254.204.82 openshift_schedulable=true openshift_node_group_name='node-config-compute'
[OSEv3:vars]
debug_level=4
ansible_ssh_user=centos
ansible_become=true
ansible_ssh_common_args='-o StrictHostKeyChecking=no'
openshift_enable_service_catalog=true
ansible_service_broker_install=true
openshift_node_groups=[{'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/infra=true']}, {'name': 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=true']}]
containerized=false
os_sdn_network_plugin_name='redhat/openshift-ovs-multitenant'
openshift_disable_check=disk_availability,docker_storage,memory_availability,docker_image_availability
deployment_type=origin
openshift_deployment_type=origin
openshift_release=v3.11.0
openshift_pkg_version=-3.11.0
openshift_image_tag=v3.11.0
openshift_service_catalog_image_version=v3.11.0
template_service_broker_image_version=v3.11
osm_use_cockpit=true
# put the router on dedicated infra1 node
openshift_master_cluster_method=native
openshift_master_default_subdomain=sub.master1.167.254.204.74.nip.io
openshift_public_hostname=master1.167.254.204.74.nip.io
openshift_master_cluster_hostname=master1.167.254.204.74.nip.io
Please let me know the entire setup dependency on master-node-1 and also any work around to fix this.
You should configure LB hostname to openshift_master_cluster_hostname and openshift_master_cluster_public_hostname, not master hostname.
As your configuration, if you configure it as master1, then all API entrypoint will be master1, so if master1 stopped, then all API service would be down.
In advance you should configure your LB for loadbalancing to your master nodes, and register the LB IP(AKA VIP) to DNS as ocp-cluster.example.com.
This hostname will be entrypoint for OCP API, you can set it using both openshift_master_cluster_hostname and openshift_master_cluster_public_hostname.
openshift_master_cluster_method=native
openshift_master_cluster_hostname=ocp-cluster.example.com
openshift_master_cluster_public_hostname=ocp-cluster.example.com

Run Ambassador in local dev environment without Kubernetes

I am trying to run Ambassador API gateway on my local dev environment so I would simulate what I'll end up with on production - the difference is that on prod my solution will be running in Kubernetes. To do so, I'm installing Ambassador into Docker Desktop and adding the required configuration to route requests to my microservices. Unfortunately, it did not work for me and I'm getting the error below:
upstream connect error or disconnect/reset before headers. reset reason: connection failure
I assume that's due to an issue in the mapping file, which is as follows:
apiVersion: ambassador/v2
kind: Mapping
name: institutions_mapping
prefix: /ins/
service: localhost:44332
So what I'm basically trying to do is rewrite all requests coming to http://{ambassador_url}/ins to a service running locally in IIS Express (through Visual Studio) on port 44332.
What am I missing?
I think you may be better off using another one of Ambassador Labs tools called Telepresence.
https://www.telepresence.io/
With Telepresence you can take your local service you have running on localhost and project it into your cluster to see how it performs. This way you don't need to spin up a local cluster, and can get real time feedback on how your service operates with other services in the cluster.

Can not run kubernetes dashboard on Master node

I installed kubernetes cluster (include one master and two nodes), and status of nodes are ready on master. When I deploy the dashboard and run it by acccessing the link http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/, I get error
'dial tcp 10.32.0.2:8443: connect: connection refused' Trying to
reach: 'https://10.32.0.2:8443/'
The pod state of dashboard is ready, and I tried to ping to 10.32.0.2 (dashboard's ip) not succesfully
I run dashboard as the Web UI (Dashboard) guide suggests.
How can I fix this ?
There are few options here:
Most of the time if there is some kind of connection refused, timeout or similar error it is most likely a configuration problem. If you can't get the Dashboard running then you should try to deploy another application and try to access it. If you fail then it is not a Dashboard issue.
Check if you are using root/sudo.
Have you properly installed flannel or any other network for containers?
Have you checked your API logs? If not, please do so.
Check the description of the dashboard pod (kubectl describe) if there is anything suspicious.
Analogically check the description of service.
What is your cluster version? Check if any updates are required.
Please let me know if any of the above helped.
Start proxy, if it's not started
kubectl proxy --address='0.0.0.0' --port=8001 --accept-hosts='.*'

Resources