DASK CUDA on multi node EMR cluster is unable to detect nodes - dask

I have setup an AWS EMR cluster using 10 core nodes of type g4dn.xlarge (each machine/node conatins 1 GPU). When I run the following commands on Zeppelin Notebook, I see only 1 worker allotted in my LocalCUDACluster:
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
client = Client(cluster)
I tried passing n_workers=10 explicitly but it resulted in an error.
How do I make sure my LocalCUDACluser utilizes all of my other 9 nodes? What is the right way to setup a multi-node DASK-CUDA cluster?
Any help regarding this is appreciated.

There are a few options to setup a multi-worker cluster (with or without GPU), described here.
The docs don't seem to mention third-party solutions, but right now there are two companies offering these services: Coiled and Saturn Cloud.

Related

Airflow with mysql_to_gcp negsignal.sigkill

I'm using airflow with composer (GCP) to extract data from cloud sql for gcs and after gcs for bigquery, I have some tables between 100 Mb and 10 Gb. My dag has two tasks to do what I mentioned before. with the smaller tables the dag runs smoothly, but with slightly larger tables the cloud sql extraction task ends in a few seconds with failure, but does not bring any logs except "negsignal.sigkill", I have already tried to increase the composer capacity , among other things, but nothing has worked yet.
I'm using the mysql_to_gcs and gcs_to_bigquery operators
The first thing you should check when you get negsinal.SIGKILL is your Kubernetes resources. This is surely a problem with resources limits.
I think you should monitor your Kubernetes Cluster Nodes. Inside GCP, go to Kubernetes Engine > Clusters. You should have a cluster containing the environment that Cloud Composer uses.
Now, head to the nodes of your cluster. Each node provides you metrics about CPU, memory & disk usage. You will also see the limit for the resources that each node uses. Also, you will see the pods that each node has.
If you are not very familiar with K8s, let me explain this briefly. Airflow uses Pods inside nodes to run your Airflow tasks. These pods are called airflow-worker-[id]. That way you can identify your worker pods inside the Node.
Check your pods list. If you have evicted airflow-worker pods, then Kubernetes is stopping your workers for some reason. Since Composer uses CeleryExecutor, a evicted airflow-worker points to a problem. This is not the case if you use KubernetesExecutor, but that is not available yet in Composer.
If you click in some evicted pod, you will see the reason for eviction. That should give you the answer.
If you don't see a problem with your pod eviction, don't panic, you still have some options. From that point on, your best friend will be logs. Be sure to check your pods logs, node logs and cluster logs, in that order.

How spark get utilized the nodes of kubernetes when running on pod?

I'm learning spark, and I'm get confused about running docker which contains spark code on Kubernetes cluster.
I read that spark get utilized multiple nodes (servers) and it can run code on different nodes, in order to get complete jobs faster (and get used the memory of each node, when the data is too big)
On the other side, I read that Kubernetes pod (which contains dockers/containers) run on one node.
For example, I'm running the following spark code from docker:
num = [1, 2, 3, 4, 5]
num_rdd = sc.parallelize(num)
double_rdd = num_rdd.map(lambda x: x * 2)
Some notes and reminders (from my understanding):
When using the map command, each value of the num array maps to different spark node (slave worker)
k8s pod run on one node
So I'm confused how spark utilized multiple nodes when the pod run on one node ?
Does the spark slave workers runs on different nodes, and this is how the pod which run the code above can communicate with those nodes in order to utilize the spark framework ?
When you run Spark on Kubernetes, you have a few ways to set things up. The most common way is to set Spark to run in client-mode.
Basically Spark can run on Kubernetes on a Pod.. then the application itself, having the endpoints for the k8s masters, is able to spawn its own worker Pods, as long as everything is correctly configured.
What is needed for this setup is to deploy the Spark application on Kubernetes (usually with a StatefulSet but it's not a must) along with an headless ClusterIP Service (which is required to make worker Pods able to communicate with the master application that spawned them)
You also need to give the Spark application all the correct configurations such as the k8s masters endpoint, the Pod name and other parameters to set things up.
There are other ways to setup Spark, there's no obligation to spawn worker Pods, you can run all the stages of your code locally (and the configuration is easy, if you have small jobs with small amount of data to execute you don't need workers)
Or you can execute the Spark application externally from the Kubernetes cluster, so not on a pod.. but giving it the Kubernetes master endpoints so that it can still spawn workers on the cluster (aka cluster-mode)
You can find a lot more info in the Spark documentation, which explains mostly everything to set things up (https://spark.apache.org/docs/latest/running-on-kubernetes.html#client-mode)
And can read about StatefulSets and their usage of headless ClusterIP Services here (https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/)

Kubernetes Architecture / Design /?

I’m trying to figure out and learn the patterns and best practices on moving a bunch of Docker containers I have for an application into Kubernetes. Things like, pod design, services, deployments, etc. For example, I could create a Pod with the single web and application containers in them, but that’d not be a good design.
Searching for things like architecture and design with Kubernetes just seems to yield topics on the product’s architecture or how to implement a Kubernetes cluster, and not the overlay of designing the pods, services, etc.
What does the community generally refer to this application later design in the Kubernetes world, and can anyone refer me to a 101 on this topic please?
Thanks.
Kubernetes is a complex system, and learning step by step is the best way to gain expertise. What I recommend you is documentation about Kubernetes, from where you can learn about each of components.
Another good option is to review 70 best K8S tutorials, which are categorized in many ways.
Designing and running applications with scalability, portability, and robustness in mind can be challenging. Here are great resources about it:
Architecting applications for Kubernetes
Using Kubernetes in production, lessons learned
Kubernetes Design Principles from Google
Well, there's no Kubernetes approach but rather a Cloud Native one: I would suggest you Designing Distributed Systems: patterns and paradigms by Brendan Burns.
It's really good because it provides several scenarios along with pattern approached and related code.
Most of the examples are obviously based on Kubernetes but I think that the implementation is not so important, since you have to understand why and when to use an Ambassador pattern or a FaaS according to the application needs.
The answer to this can be quite complex and that's why it is important that software/platform architects understand K8s well.
Mostly you will find an answer on that which tells you "put each application component in a single pod". And basically that's correct as the main reason for K8s is high availability, fault tolerance of the infrastructure and things like this. This leads us to, if you put every single component to a single pod and make it with a replica higher than 2 its will reach a batter availability.
But you also need to know why you want to go to K8s. At the moment it is a trending topic. But if you don't want to Ops a cluster and actually don't need HA or so, why you don't run on stuff like AWS ECS, Digital Ocean droplets and co?
Best answers you will currently find are all around how to design and cut microservices as each microservice could be represented in a pod. Also, a good starting point is from RedHat Principles of container-based Application Design
or InfoQ.
Un kubernetes cluster is composed of:
A master server called control plane
Nodes: nodes which execute the applications / Containers or pods
By design, a production kubernetes cluster must have at least a master server and 2 nodes according to the kubernetes documentation.
Here is a summary of the components of a kubernetes cluster:
Master = control plane:
kube-api-server: expose the kubernetes api
etcd: key values store ​​for the cluster
kube-scheduler: distributed the pods on the nodes
kube-controller-manager: controller of nodes, pods, cluster components.
Nodes = Servers that run applications
Kubelet: runs on each node, It makes sure that the containers are running in a pod.
kube-proxy: Allows the pods to communicate in the cluster and outside
Runtine container: allows to run the containers / pods
Complementary modules = addons
DNS: DNS server that serves DNS records for Kubernetes services.
Webui: Graphical dashboard for the cluster
Container Resource Monitoring: Records metrics on containers in a central DB, provides UI to browse them
Cluster-level Logging: Records container logs in a central log with a search / browse interface.

Problem running Dask on AWS Sagemaker and AWS Fargate

I am trying to setup a cluster on AWS to run distributed sklearn model training with dask. To get started, I was trying to follow this tutorial which I hope to tweak: https://towardsdatascience.com/serverless-distributed-data-pre-processing-using-dask-amazon-ecs-and-python-part-1-a6108c728cc4
I have managed to push the docker container to AWS ECR and then launch a CloudFormation template to build a cluster on AWS Fargate. The next step in the tutorial is to launch an AWS Sagemaker Notebook. I have tried this but something is not working because when I run the commands I get errors (see image). What might the problem be? Could it be related to the VPC/subnets? Is it related to AWS Sagemaker internet access? (I have tried enabling and disabling this).
Expected Results: dask to update, scaling up of the Fargate cluster to work.
Actual Results: none of the above.
In my case, when running through the same tutorial, DaskSchedulerService takes too long to complete. The creation was initiated but never finished in CloudFormation.
After 5-6 hours i've got the following:
DaskSchedulerService CREATE_FAILED Dask-Scheduler did not stabilize.
The workers did not run, and, consequently, it was not possible to connect to the Client.

Does it makes sense to manage Docker containers of a/few single hosts with Kubernetes?

I'm using docker on a bare metal server. I'm pretty happy with docker-compose to configure and setup applications.
Still some features are missing, like configuration management and monitoring maybe there are other solutions to solve this issues but I'm a bit overwhelmed by the feature set of Kubernetes and can't judge if it would help me here.
I'm also open for recommendations to solve the requirements separately:
Configuration / Secret management
Monitoring of my docker hostes applications (e.g. having some kind of dashboard)
Remot container control (SSH is okay with only one Server)
Being ready to scale my environment (based on multiple different Dockerized applications) to more than one server in future - already thinking about networking/service discovery issues with a pure docker-compose setup
I'm sure Kubernetes covers some of these features, but I have the feeling that it's too much focused on Cloud platforms where Machines are created on the fly (since I only have at most few bare metal Servers)
I hope the questions scope is not too broad, else please use the comment section and help me to narrow down the question.
Thanks.
I think the Kubernetes is absolutely much your requests and it is what you need.
Let's start one by one.
I have the feeling that it's too much focused on Cloud platforms where Machines are created on the fly (since I only have at most few bare metal Servers)
No, it is not focused on Clouds. Kubernates can be installed almost on any bare-metal platform (include ARM) and have many tools and instructions which can help you to do it. Also, it is easy to deploy it on your local PC using Minikube, which will prepare local cluster for you within VMs or right in your OS (only for Linux).
Configuration / Secret management
Kubernates has a powerful configuration and management based on special objects which can be attached to your containers. You can read more about configuration management in that article.
Moreover, some tools like Helm can provide you more automation and range of preconfigured applications, which you can install using a single command. And you can prepare your own charts for it.
Monitoring of my docker hostes applications (e.g. having some kind of dashboard)
Kubernetes has its own dashboard where you can get many kinds of information: current applications status, configuration, statistics and many more. Also, Kubernetes has great integration with Heapster which can be used with Grafana for powerful visualization of almost anything.
Remot container control (SSH is okay with only one Server)
Kubernetes controlling tool kubectl can get logs and connect to containers in the cluster without any problems. As an example, to connect a container "myapp" you just need to call kubectl exec -it myapp sh, and you will get sh session in the container. Also, you can connect to any application inside your cluster using kubectl proxy command, which will forward a port you need to your PC.
Being ready to scale my environment (based on multiple different Dockerized applications) to more than one server in future - already thinking about networking/service discovery issues with a pure docker-compose setup
Kubernetes can be scaled up to thousands of nodes. Or can have only one. It is your choice. Independent of a cluster size, you will get production-grade networking, service discovery and load balancing.
So, do not afraid, just try to use it locally with Minikube. It will make many of operation tasks more simple, not more complex.

Resources