Spark on Kubernetes: How to handle missing Config folder - docker

I'm trying to run spark in an kubernetes cluster as described here https://spark.apache.org/docs/latest/running-on-kubernetes.html
It works fine for some basic scripts like the provided examples.
I noticed that the config folder despite being added to the image build by the "docker-image-tool.sh" is overwritten by a mount of a config map volume.
I have two Questions:
What sources does spark use to generate that config map or how do you edit it? As far as I understand the volume gets deleted when the last pod is deleted and regenerated when a new pod is created
How are you supposed to handle the spark-env.sh script which can't be added to a simple config map?

One initially non-obvious thing about Kubernetes is that changing a ConfigMap (a set of configuration values) is not detected as a change to Deployments (how a Pod, or set of Pods, should be deployed onto the cluster) or Pods that reference that configuration. That expectation can result in unintentionally stale configuration persisting until a change to the Pod spec. This could include freshly created Pods due to an autoscaling event, or even restarts after a crash, resulting in misconfiguration and unexpected behaviour across the cluster.
Note: This doesn’t impact ConfigMaps mounted as volumes, which are periodically synced by the kubelet running on each node.
To update configmap execute:
$ kubectl replace -f file.yaml
You must create a ConfigMap before you can use it. So I recommend firstly modify configMap and then redeploy pod.
Note that container using a ConfigMap as a subPath volume mount will not receive ConfigMap updates.
The configMap resource provides a way to inject configuration data into Pods. The data stored in a ConfigMap object can be referenced in a volume of type configMap and then consumed by containerized applications running in a Pod.
When referencing a configMap object, you can simply provide its name in the volume to reference it. You can also customize the path to use for a specific entry in the ConfigMap.
When a ConfigMap already being consumed in a volume is updated, projected keys are eventually updated as well. Kubelet is checking whether the mounted ConfigMap is fresh on every periodic sync. However, it is using its local ttl-based cache for getting the current value of the ConfigMap. As a result, the total delay from the moment when the ConfigMap is updated to the moment when new keys are projected to the pod can be as long as kubelet sync period (1 minute by default) + ttl of ConfigMaps cache (1 minute by default) in kubelet.
But what I strongly recommend you is to use Kubernetes Operator for Spark. It supports mounting volumes and ConfigMaps in Spark pods to customize them, a feature that is not available in Apache Spark as of version 2.4.
A SparkApplication can specify a Kubernetes ConfigMap storing Spark configuration files such as spark-env.sh or spark-defaults.conf using the optional field .spec.sparkConfigMap whose value is the name of the ConfigMap. The ConfigMap is assumed to be in the same namespace as that of the SparkApplication. Spark on K8S provides configuration options that allow for mounting certain volume types into the driver and executor pods. Volumes are "delivered" from Kubernetes side but they can be delivered from local storage in Spark. If no volume is set as local storage, Spark uses temporary scratch space to spill data to disk during shuffles and other operations. When using Kubernetes as the resource manager the pods will be created with an emptyDir volume mounted for each directory listed in spark.local.dir or the environment variable SPARK_LOCAL_DIRS . If no directories are explicitly specified then a default directory is created and configured appropriately.
Useful blog: spark-kubernetes-operator.

Related

Making a copy of docker-desktop cluster or making a new cluster using it as template

Docker Desktop for Windows and macOS come with the docker-desktop cluster. I'm trying to figure out how to either copy it, or make a new cluster based on it as a template. I like to have clusters for each project I'm working on so that things like PVC, PV and secrets are isolated, and I can just switch between them with kubectl config use-context project1. I've been looking through documentation and Google search results and haven't identified how to do this, or if it is possible. Any suggestions?
If there's a set of resources that you want to routinely deploy to new clusters, you can create a source-control repository that contains the YAML files you need. Then when you have a new cluster you can kubectl apply -f your directory of bootstrap artifacts. Using kind, for example:
kind create cluster --name dev2
kubectl apply -f ./bootstrap/
...
kind delete cluster --name dev2
If you need to configure or parameterize this setup in some way, packaging it as a Helm chart can make sense.
This approach also means avoiding the imperative-style kubectl create, kubectl run, and kubectl expose type commands. Create the YAML files you need, check them in, and use kubectl apply to install them.
It can be a little tricky to usefully export a cluster, and this isn't something that's commonly done. For example, if you have a Pod, was it created by a Deployment or directly through a YAML file? Was that PersistentVolume hand-created, or did a provisioner create it, and are its settings specific to a particular Kubernetes environment? Working from a reproducible source-controlled tree avoids these issues.

How to copy files to container in kubernetes yaml

I understand that files / folders can be copied into a container using the command:
kubectl cp /tmp/foo_dir <some-pod>:/tmp/bar_dir
However, I am looking to do this in a yaml file
How would I go about doing this? (Assuming that I am using a deployment for the container)
The way you are going is wrong direction. Kubernetes does this with serveral ways.
first, think about configmap
https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap
You can easily define the configuration files for your application running in container
If you do know the files or folders is exist on worker nodes, you can use hostPath to mount it into container with nominated nodeName: node01 in k8s yaml.
https://kubernetes.io/docs/concepts/storage/volumes/#hostpath
if the files or folders are generated temporarily, you can use emptyDir
https://kubernetes.io/docs/concepts/storage/volumes/#emptydir
You cannot, mapping local files from your workstation is not a feature of Kubernetes.

Where do docker images' new Files get saved to in GCP?

I want to create some docker images that generates text files. However, since images are pushed to Container Registry in GCP. I am not sure where the files will be generated to when I use kubectl run myImage. If I specify a path in the program, like '/usr/bin/myfiles', would they be downloaded to the VM instance where I am typing "kubectl run myImage"? I think this is probably not the case.. What is the solution?
Ideally, I would like all the files to be in one place.
Thank you
Container Registry and Kubernetes are mostly irrelevant to the issue of where a container will persist files it creates.
Some process running within a container that generates files will persist the files to the container instance's file system. Exceptions to this are stdout and stderr which are both available without further ado.
When you run container images, you can mount volumes into the container instance and this provides possible solutions to your needs. Commonly, when running Docker Engine, it's common to mount the host's file system into the container to share files between the container and the host: docker run ... --volume=[host]:[container] yourimage ....
On Kubernetes, there are many types of volumes. An seemingly obvious solution is to use gcePersistentDisk but this has a limitation in that it these disks may only be mounted for write on one pod at a time. A more powerful solution may be to use an NFS-based solution such as nfs or gluster. These should provide a means for you to consolidate files outside of the container instances.
A good solution but I'm unsure whether it is available, would be to write your files as Google Cloud Storage objects.
A tenet of containers is that they should operate without making assumptions about their environment. Your containers should not make assumptions about running on Kubernetes and should not make assumptions about non-default volumes. By this I mean, that your containers will write files to container's file system. When you run the container, you apply the configuration that e.g. provides an NFS volume mount or GCS bucket mount etc. that actually persists the files beyond the container.
HTH!

How to use azure disk in AKS environment

I am trying to setup the AKS in which I have used azure disk to mount the source code of the application. When I am using kubectl describe pods command then also it is showing as mounted but I dont know how may I copy the code into that?
I got some recommendations that use kubectl cp command but my pod name is changing each time whenever I am deploying so please let me know what should i do?
you'd need to copy files to the disk directly (not to the pod). you can use your pod or worker node to do that. You can use kubectl cp to copy files to the pod and then move it to the mounted disk like you normally would. or you can ssh to the worker node and copy files over ssh to the node and put files to the mounted disk.

Mount configmap file onto hostpath volume

I have mounted a hostpath volume in a Kubernetes container. Now I want to mount a configmap file onto the hostpath volume.
Is that possible?
Not really, a larger question would be would you'd want to do that?
The standard way to add configurations in Kubernetes is using ConfigMaps. They are stored in etcd and the size limit is 1MB. When your pod comes up the configuration is mounted on a pod mount point that you can specify in the pod spec.
You may want the opposite which is to use a hostPath that has some configuration and that's possible. Say, that you want to have some config that is larger than 1MB (which is not usual) and have your pod use it. The gotcha here is that you need to put this hostPath and the files in all your cluster nodes where your pod may start.
No. The volume mounts are all about pushing data into pods or persisting data that originates in a pod, and aren't usually a bidirectional data transfer mechanism.
If you want to see what's in a ConfigMap, you can always kubectl get configmap NAME -o yaml to dump it out.
(With some exceptions around things like the Docker socket, hostPath volumes aren't that common in non-Minikube Kubernetes installations, especially once you get into multi-host setups, and I'd investigate other paths to do whatever you're using it for now.)

Resources