Apache Hadoop Yarn vs. Kubernetes - docker

Since versions 2.6 (Apache Hadoop) Yarn handles docker containers. Basically it distributes the requested amount of containers on a Hadoop cluster, restart failed containers and so on.
Kubernetes seemed to do the same.
Where are the major differences?

Kubernetes is developed almost from a clean slate for extending Docker container kernel to become a platform. Kubernetes development has taken bottom up approach. It has good optimization on specifying per container/pod resource requirements, but it lacks a effective global scheduler that can partition resources into logical grouping. Kubernetes design allows multiple schedulers to run in the cluster. Each scheduler manages resources within its own pods. However, Kubernetes cluster can suffer from instability when application demands more resources than physical systems can handle. It work best in infrastructure capacity exceeding application demands. Kubernetes scheduler will attempt to fill up the idle nodes with incoming application requests
and terminate low priority and starvation containers to improve resource utilization. Kubernetes containers can integrate with external storage system like S3 to provide resilience to data. Kubernetes framework uses etcd to store cluster data. Etcd cluster nodes and Hadoop Namenode are both single point of failures in Kubernetes or Hadoop platform. Etcd can have more replica than Namenode, hence, from reliability point of view seems to favor Kubernetes in theory. However, Kubernetes security is default open, unless RBAC are defined with fine-grained role binding. Security context is set correctly for pods. If omitted, primary group of the pod will default to root, which can be problematic for system administrators trying to secure the infrastructure.
Apache Hadoop YARN was developed to run isolated java processes to process big data workload then improved to support Docker containers. YARN provides global level resource management like capacity queues for partitioning physical resources into logical units. Each business unit can be assigned with percentage of the cluster resources. Capacity resource sharing system is designed in favor of guarentee resource
availability for Enterprise priority instead of squeezing every available physical resources. YARN does score more points in security. There are more
security featuers in Kerberos, access control for privileged/non-privileged containers, trusted docker images, and placement policy constraints. Most docker
related security are default to close, and system admin needs to manually turn on flags to grant more power to containers. Large enterprises tend to run Hadoop more
than Kubernetes because securing the system cost less. There are more distributed SQL engines built on top of YARN, including Hive, Impala, SparkSQL and IBM BigSQL.
Database options make YARN an attrative option because the ability to run online transaction processing in containers, and online analytical processing using batch workload. Hadoop Developer toolchains can be overwhelming. Mapreduce, Hive, Pig, Spark and etc, each have its own style of development. The user experience is inconsistent and take a while to learn them all. Kubernetes feels less obstructive by comparison because it only deploys docker containers. With introduction of YARN services to run
Docker container workload, YARN can feel less wordy than Kubernetes.
If your plan is to out source IT operations to public cloud, pick Kubernetes. If your plan is to build private/hybrid/multi-clouds, pick Apache YARN.

While this question and answer isn't exactly what you are asking, it does touch on a number of the same points.
Last I saw, Yarn was just a resource sharing mechanism, whereas Kubernetes is an entire platform, encompassing ConfigMaps, declarative environment management, Secret management, Volume Mounts, a super well designed API for interacting with all of those things, Role Based Access Control, and Kubernetes is in wide-spread use, meaning one can very easily find both candidates to hire and tools to buy.
A blog post I found cited a master's thesis that describes some of the fascinating trade-offs between the different scheduler's view of the world. It's a lot of words, so if you're looking for a tl;dr answer, that link may not be it, but if you're looking for actual research on the topic, it seems sound.

Related

Is there a reason running CI builds on kubernetes cluster?

I don't know much about kubernetes, but as far as I know, it is a system that enables you to control and manage containerized applications. So, generally speaking, the essence of the benefit that we get from kubernetes is the ability to "tell" kubernetes what containers we want running, how many of them, on which machines, among other details, and kubernetes will take care of doing that for us. Is that correct?
If so, I just can't see the benefit of running a CI pipeline using a kubernetes pod, as I understand that some people do. Let's say you have your build tools on Docker containers instead of having them installed on a specific machine, that's great - you can just use those containers in the build process, why kubernetes? Is there any performance gain or something like this?
Appreciate some insights.
It is highly recommended to get a good understanding of what Kubernetes is and what it can and cannot do.
Generally, containers combined with an orchestration tools can provide a better management of your machines and services. It can significantly improve the reliability of your application and reduce the time and resources spent on DevOps.
Some of the features worth noting are:
Horizontal infrastructure scaling: New servers can be added or removed easily.
Auto-scaling: Automatically change the number of running containers, based on CPU utilization or other application-provided metrics.
Manual scaling: Manually scale the number of running containers through a command or the interface.
Replication controller: The replication controller makes sure your cluster has an equal amount of pods running. If there are too many pods, the replication controller terminates the extra pods. If there are too few, it starts more pods.
Health checks and self-healing: Kubernetes can check the health of nodes and containers ensuring your application doesn’t run into any failures. Kubernetes also offers self-healing and auto-replacement so you don’t need to worry about if a container or pod fails.
Traffic routing and load balancing: Traffic routing sends requests to the appropriate containers. Kubernetes also comes with built-in load balancers so you can balance resources in order to respond to outages or periods of high traffic.
Automated rollouts and rollbacks: Kubernetes handles rollouts for new versions or updates without downtime while monitoring the containers’ health. In case the rollout doesn’t go well, it automatically rolls back.
Canary Deployments: Canary deployments enable you to test the new deployment in production in parallel with the previous version.
However you should also know what Kubernetes is not:
Kubernetes is not a traditional, all-inclusive PaaS (Platform as a
Service) system. Since Kubernetes operates at the container level
rather than at the hardware level, it provides some generally
applicable features common to PaaS offerings, such as deployment,
scaling, load balancing, and lets users integrate their logging,
monitoring, and alerting solutions. However, Kubernetes is not
monolithic, and these default solutions are optional and pluggable.
Kubernetes provides the building blocks for building developer
platforms, but preserves user choice and flexibility where it is
important.
Especially in your use case note that Kubernetes:
Does not deploy source code and does not build your application.
Continuous Integration, Delivery, and Deployment (CI/CD) workflows are
determined by organization cultures and preferences as well as
technical requirements.
The decision is yours but having in mind the main concepts above will help you make it.
An important detail is that you do not tell Kubernetes what nodes a given pod should run on; it picks itself, and if the cluster is low on resources, in many cases it can actually allocate more nodes on its own (via the cluster autoscaler).
So if your CI system is fairly busy, and uses all containers for everything, it could make more sense to run an individual build job as a Kubernetes Job. If you have 100 builds that all start at the same time, it's possible for the cluster to give itself more hardware, and the build queue will clear out faster. Particularly if you're using Kubernetes for other tasks, this can save you same administrative effort over maintaining a dedicated pool of CI-system workers that need to be separately updated and will sit mostly idle until that big set of builds arrives.
Kubernetes's security settings are also substantially better than Docker's. Say your CI system needs to launch containers as part of a build. In Kubernetes, it can run under a service account, and be given permissions to create and delete deployments in a specific namespace, and nothing else. In Docker the standard approach is to give your CI system access to the host's Docker socket, but this can be easily exploited to take over the host.

Kubernetes Architecture / Design /?

I’m trying to figure out and learn the patterns and best practices on moving a bunch of Docker containers I have for an application into Kubernetes. Things like, pod design, services, deployments, etc. For example, I could create a Pod with the single web and application containers in them, but that’d not be a good design.
Searching for things like architecture and design with Kubernetes just seems to yield topics on the product’s architecture or how to implement a Kubernetes cluster, and not the overlay of designing the pods, services, etc.
What does the community generally refer to this application later design in the Kubernetes world, and can anyone refer me to a 101 on this topic please?
Thanks.
Kubernetes is a complex system, and learning step by step is the best way to gain expertise. What I recommend you is documentation about Kubernetes, from where you can learn about each of components.
Another good option is to review 70 best K8S tutorials, which are categorized in many ways.
Designing and running applications with scalability, portability, and robustness in mind can be challenging. Here are great resources about it:
Architecting applications for Kubernetes
Using Kubernetes in production, lessons learned
Kubernetes Design Principles from Google
Well, there's no Kubernetes approach but rather a Cloud Native one: I would suggest you Designing Distributed Systems: patterns and paradigms by Brendan Burns.
It's really good because it provides several scenarios along with pattern approached and related code.
Most of the examples are obviously based on Kubernetes but I think that the implementation is not so important, since you have to understand why and when to use an Ambassador pattern or a FaaS according to the application needs.
The answer to this can be quite complex and that's why it is important that software/platform architects understand K8s well.
Mostly you will find an answer on that which tells you "put each application component in a single pod". And basically that's correct as the main reason for K8s is high availability, fault tolerance of the infrastructure and things like this. This leads us to, if you put every single component to a single pod and make it with a replica higher than 2 its will reach a batter availability.
But you also need to know why you want to go to K8s. At the moment it is a trending topic. But if you don't want to Ops a cluster and actually don't need HA or so, why you don't run on stuff like AWS ECS, Digital Ocean droplets and co?
Best answers you will currently find are all around how to design and cut microservices as each microservice could be represented in a pod. Also, a good starting point is from RedHat Principles of container-based Application Design
or InfoQ.
Un kubernetes cluster is composed of:
A master server called control plane
Nodes: nodes which execute the applications / Containers or pods
By design, a production kubernetes cluster must have at least a master server and 2 nodes according to the kubernetes documentation.
Here is a summary of the components of a kubernetes cluster:
Master = control plane:
kube-api-server: expose the kubernetes api
etcd: key values store ​​for the cluster
kube-scheduler: distributed the pods on the nodes
kube-controller-manager: controller of nodes, pods, cluster components.
Nodes = Servers that run applications
Kubelet: runs on each node, It makes sure that the containers are running in a pod.
kube-proxy: Allows the pods to communicate in the cluster and outside
Runtine container: allows to run the containers / pods
Complementary modules = addons
DNS: DNS server that serves DNS records for Kubernetes services.
Webui: Graphical dashboard for the cluster
Container Resource Monitoring: Records metrics on containers in a central DB, provides UI to browse them
Cluster-level Logging: Records container logs in a central log with a search / browse interface.

Advantages of dockerizing Java Springboot application?

We are working with a dockerized kafka environment. I would like to know the best practices for deployments of kafka-connectors and kafka-streams applications in such scenerio . Currently we are deploying each connector and stream as springboot applications and are started as systemctl microservices . I do not find a significant advantage in dockerizing each kafka connector and stream . Please provide me insights on the same
To me the Docker vs non-Docker thing comes down to "what does your operations team or organization support?"
Dockerized applications have an advantage in that they all look / act the same: you docker run a Java app the same way as you docker run a Ruby app. Where as with an approach of running programs with systemd, there's not usually a common abstraction layer around "how do I run this thing?"
Dockerized applications may also abstract some small operational details, like port management for example - ie making sure all your app's management.ports don't clash with each other. An application in a Docker container will run as one port inside the container, and you can expose that port as some other number outside. (either random, or one to your choosing).
Depending on the infrastructure support, a normal Docker scheduler may auto-scale a service when that service reaches some capacity. However, in Kafka streams applications the concurrency is limited by the number of partitions in the Kafka topics, so scaling up will just mean some consumers in your consumer groups go idle (if there's more than the number of partitions).
But it also adds complications: if you use RocksDB as your local store, you'll likely want to persist that outside the (disposable, and maybe read only!) container. So you'll need to figure out how to do volume persistence, operationally / organizationally. With plain ol' Jars with Systemd... well you always have the hard drive, and if the server crashes either it will restart (physical machine) or hopefully it will be restored by some instance block storage thing.
By this I mean to say: that kstream apps are not stateless, web apps where auto-scaling will always give you some more power, and that serves HTTP traffic. The people making these decisions at an organization or operations level may not fully know this. Then again, hey if everyone writes Docker stuff then the organization / operations team "just" have some Docker scheduler clusters (like a Kubernetes cluster, or Amazon ECS cluster) to manage, and don't have to manage VMs as directly anymore.
Dockerizing + clustering with kubernetes provide many benefits like auto healing, auto horizontal scaling.
Auto healing: in case spring application crashes, kubernetes will automatically run another instances and will ensure required number of containers are always up.
Auto horizontal scaling: if you get burst of messages, yo can tune spring applications to auto scale up or down using HPA that can use custom metrics also.

Are docker containers safe enough to run third-party untrusted containers side-by-side with production system?

We plan to allow execution of third-party micro-services code on our infrastructure interacting with our api.
Is dockerizing safe enough? Are there solutions for tracking resources(network, ram,cpu)container consumes?
You can install portainer.io (see its demo, password tryportainer)
But to truly isolate those third-party micro-services, you could run them in their own VM defined on your infrastructure. That VM would run a docker daemon and services. As long as the VM has access to the API, those micro-services containers will do fine, and won't lead/have access to anything directly from the infrastructure.
You need to define/size your VM correctly to allocate enough resources for the containers to run, each one assuring their own resource isolation.
Docker (17.03) is a great tool to secure isolate processes. It uses Kernel namespaces, Control groups and some kernel capabilities in order to isolate processes that run in different containers.
But, those processes are not 100% isolated from each other because they use the same kernel resources. Every dockerize process that make an IO call will leave for that period of time its isolated environment and will enter a shared environment, the kernel. Although you can set limits per container, like how much processor or how much RAM it may use you cannot set limits on all kernel resources.
You can read this article for more information.

Cluster management and service discovery

I want to introduce on my deploys a service discovery / cluster management solution. As far as I see Mesos is one solution, but I'm worried about how much it consumes in terms of RAM when installing agents of marathon, cronos, mesos, etc; my boxes have at most 512mb of RAM.
It is feasible to install Mesos on boxes with low resources?
It is Consul a replacement for Mesos?
Your question is really a number of questions:
Mesos is a great solution for cluster management. It is tested in production at high scale at twitter.
Mesos doesn't provide a service discover mechanism.
Mesos requests other components in order to provide a full solution. There is no one solution for all environments / topologies. The leading supplements are provided by mesosphere which include marathon (at a minimum).
Memory requirements will vary based on number of slaves. The starting requirement is 3MB for each the master and the slave. Making it feasible to install on nodes with low resources.
Consul is a service discovery component and does not replace Mesos. They are complementary. In fact Keen Labs has modified marathon to integrate mesos with consul. See: https://github.com/keenlabs/marathon/commit/290036e34337dcd6483550b7ab7d723bc4378d5f

Resources