Why am I getting "Structure needs cleaning" message on Ceph with Kubernetes? - docker

Sorry to ask this, I am relatively new in Kubernetes and Ceph, only have a little idea about this.
I have setup Kubernetes and Ceph using this tutorial(http://tutorial.kubernetes.noverit.com/content/ceph.html)
I had set up my cluster like this:
1 Kube-Master and 2 worker Nodes(this acts ceph monitor with 2 OSD in each)
The ceph-deploy I used to setup ceph cluster is in the Kube-master.
Everything is working fine, I installed my sample web application(deployment) with 5 replicas, which will create a file when the rest API is hit. The file is getting copied to every node.
But after 10 min, I created one more file using the API, but when I try to list(ls -l) I am getting the following error:
For node1:
ls: cannot access 'previousFile.txt': Structure needs cleaning
previousFile.txt newFile.txt
For node2:
previousFile.txt
For node2 the new file is not created
What might be the issue? I have tried many times still same error pop up.
Any help appreciated.

This totally looks like your filesystem got corrupted. Things to check:
$ kubectl logs <ceph-pod1>
$ kubectl logs <ceph-pod2>
$ kubectl describe deployment <ceph-deployment> # did any of the pods restart?
Some info about the error message here.
Depending on what you have you might need to start from scratch. Or you can take a look a recovering data in Ceph, but may not work if you don't have a snap.
Running Ceph on Kubernetes can be very tricky because any start/restart for a specific node starting on a different Kubernetes node might corrupt the data, so you need to make sure that part is pretty solid possibly using Node Affinity or running Ceph pods on specific Kubernetes nodes with labels.

Related

what is meant by "service level" in docker?

While going through the documentation of getting started with kubernetes on docker desktop, i came through this word called service level , can anyone help me understand what is service level?
ps: i am a beginner in docker and kubernetes
thanks in advance :)
It is not entirely clear what "Service level" references in this case,
It says in your link:
Kubernetes makes sure containers are running to keep your app at the service level you requested in the YAML file
And a little further down:
And now remove that container:
Check back on the result app in your browser at http://localhost:5001 and you’ll see it’s still working. Kubernetes saw that the container had been removed and started a replacement straight away.^
Judging from the context they refer to that the kube-controller-manager in the Kubernetes control plane continuously watches the state of the cluster and compares it to the desired state. When it discovers a difference (for example when a pod was removed) it fixes it by adding a new pod to match the number of replicas defined in the deployment.
For example if the deployment was configured to run in N number of replicas and one is removed, N-1 replicas remain. The kube-controller-manager starts a new pod to achieve the desired state of N replicas.
In this case the service level would refer to the number of replicas running, but as mentioned, it is ambiguous...
There are services in kubernetes which you can use to expose applications (containers) running on pods.
You may read through this blog to learn more
https://medium.com/#naweed.rizvi/kubernetes-setup-local-cluster-with-docker-desktop-7ead3b17bf68
You can also Watch this tutorial
https://www.youtube.com/watch?v=CX8AnwTW2Zs&t=272s

Airflow with mysql_to_gcp negsignal.sigkill

I'm using airflow with composer (GCP) to extract data from cloud sql for gcs and after gcs for bigquery, I have some tables between 100 Mb and 10 Gb. My dag has two tasks to do what I mentioned before. with the smaller tables the dag runs smoothly, but with slightly larger tables the cloud sql extraction task ends in a few seconds with failure, but does not bring any logs except "negsignal.sigkill", I have already tried to increase the composer capacity , among other things, but nothing has worked yet.
I'm using the mysql_to_gcs and gcs_to_bigquery operators
The first thing you should check when you get negsinal.SIGKILL is your Kubernetes resources. This is surely a problem with resources limits.
I think you should monitor your Kubernetes Cluster Nodes. Inside GCP, go to Kubernetes Engine > Clusters. You should have a cluster containing the environment that Cloud Composer uses.
Now, head to the nodes of your cluster. Each node provides you metrics about CPU, memory & disk usage. You will also see the limit for the resources that each node uses. Also, you will see the pods that each node has.
If you are not very familiar with K8s, let me explain this briefly. Airflow uses Pods inside nodes to run your Airflow tasks. These pods are called airflow-worker-[id]. That way you can identify your worker pods inside the Node.
Check your pods list. If you have evicted airflow-worker pods, then Kubernetes is stopping your workers for some reason. Since Composer uses CeleryExecutor, a evicted airflow-worker points to a problem. This is not the case if you use KubernetesExecutor, but that is not available yet in Composer.
If you click in some evicted pod, you will see the reason for eviction. That should give you the answer.
If you don't see a problem with your pod eviction, don't panic, you still have some options. From that point on, your best friend will be logs. Be sure to check your pods logs, node logs and cluster logs, in that order.

Cassandra on Google Cloud

I am new to GCP and I want to deploy Cassandra nodes on Google Cloud. What are the advantages of using Cassandra containers over directly deploying Cassandra on these nodes?
We tried this scenarios:
Running cassandra in kubernetes
Running cassandra in docker on VM instances
Running cassandra on VMs without docker
Short version:
We decided to run on VMs (docker)
Long version
Building a working kubernetes setup takes some time. You need to find out, how to set ip adresses right, how to pick right disk types. And how to access machines.
When it comes to installing sidecars like cassandra reaper we found the configuration being easier when you are on a dedicated vm.
Same story with disaster recovery. We backup attached disks daily and keep them for a certain period. There were cases where we need to reattach a disk from a backup additionally to a running version. That was again easier than in a kubernetes environment. Remember - when we are talking about disaster recovery most likely you are under stress because things just got f... up ;)
In the end both solutions work but a dedicated VM per node is easier to manage.
So:
Docker: yes (or better docker-compose), because you dont have to worry about VM setup.
Kubernetes: rather no (but this is a question of personal taste)

Not able to connect to a container(Created via Rest API) in Kubernetes

I am creating a docker container ( using docker run) in a kubernetes Environment by invoking a rest API.
I have mounted the docker.sock of the host machine and i am building an image and running that image from RESTAPI..
Now i need to connect to this container from some other container which is actually started by Kubectl from deployment.yml file.
But when used kubeclt describe pod (Pod name), my container created using Rest API is not there.. So where is this container running and how can i connect to it from some other container ?
Are you running the container in the same namespace as namespace with deployment.yml? One of the option to check that would be to run -
kubectl get pods --all-namespaces
If you are not able to find the docker container there than I would suggest performing below steps -
docker ps -a {verify running docker status}
Ensuring that while mounting docker.sock there are no permission errors
If there are permission errors, escalate privileges to the appropriate level
To answer the second question, connection between two containers should be possible by referencing cluster DNS in below format -
"<servicename>.<namespacename>.svc.cluster.local"
I would also request you to detail steps, codes and errors(if there are any) for me to better answer the question.
You probably shouldn't be directly accessing the Docker API from anywhere in Kubernetes. Kubernetes will be totally unaware of anything you manually docker run (or equivalent) and as you note normal administrative calls like kubectl get pods won't see it; the CPU and memory used by the pod won't be known about by the node interface and this could cause a node to become over utilized. The Kubernetes network environment is also pretty complicated, and unless you know the details of your specific CNI provider it'll be hard to make your container accessible at all, much less from a pod running on a different node.
A process running in a pod can access the Kubernetes API directly, though. That page notes that all of the official client libraries are aware of the conventions this uses. This means that you should be able to directly create a Job that launches your target pod, and a Service that connects to it, and get the normal Kubernetes features around this. (For example, servicename.namespacename.svc.cluster.local is a valid DNS name that reaches any Pod connected to the Service.)
You should also consider whether you actually need this sort of interface. For many applications, it will work just as well to deploy some sort of message-queue system (e.g., RabbitMQ) and then launch a pool of workers that connects to it. You can control the size of the worker queue using a Deployment. This is easier to develop since it avoids a hard dependency on Kubernetes, and easier to manage since it prevents a flood of dynamic jobs from overwhelming your cluster.

Wrong ip when setting up Redis cluster on Kubernetes "Waiting for the cluster to join..."

I am trying to build a redis cluster with kubernetes on azure. And I am faced with the exact same problem when running different samples : sanderp.nl/running-redis-cluster-on-kubernetes or github.com/zuxqoj/kubernetes-redis-cluster
Everything goes well until I try to have the different nodes join the cluster with the redis-trib command.
At that time I face the infamous infinite "Waiting for the cluster to join ...." message.
Trying to see what is happening, I set up the loglevel of the redis pods to debug level. I then noticed that the pods do not seem to announce their correct ip when communicating together.
In fact it seems that the last byte of the ip is replaced by a zero. Say if pod1 has ip address 10.1.34.9, I will see in pod2 logs:
Accepted clusternode 10.1.34.0:someport
So the pods do not seem to be able to communicate back and the join cluster process never ends.
Now, if before running redis-trib, I enforce the cluster-announce-ip by running on each pod :
redis-cli -h mypod-ip config set cluster-announce-ip mypod-ip
the redis-trib command then completes successfully and the cluster is up and running.
But this not a viable solution as if a pod goes down and comes back, it may have changed ip and I will face the same problem when it will try to join the cluster.
Note that I do not encounter any problem when running the samples with minikube.
I am using flannel for kubernetes networking. Can the problem come from incorrect configuration of flannel ? Has anyone encountered the same issue ?
You can use statefulsets for deploying your replicas, so your pod will have a unique name always.
Moreover, you will be able to use service DNS-names as host. See this official doc DNS for Services and Pods.
The second example you shared, has another part for redis cluster using statefulsets. Try that out.

Resources