I'm starting to use PySpark. I'm wondering how to use containerization with PySpark.
I would like to isolate my python application and dependencies in a container.
Can I place my python application within a container and give the image directly to a spark cluster? Will it be able to make his work and distribute the image to the workers and then distribute the work to the multiple "containers"?
For developing Spark applications in a container you could use:
jupyter/pyspark-notebook. There's also SparkUI which you can access on port 4040. More information here: Jupyter Apache Spark
Or AWS Glue docker image: amazon/aws-glue-libs:glue_libs_1.0.0_image_01. More information on how to set it up: Developing AWS Glue ETL jobs locally using a container
When your application is ready, you can just submit it to the cluster. I'm not sure why docker is needed at that point.
Related
I am trying to deploy this app. https://github.com/taigaio/taiga-docker . This container is a collection of various images. It uses docker-compose to create the container. It is my understanding that this cannot be run as an image from a GCP Artifact Repo as a docker Image. This needs a VM perhaps?
My question is if there is a way to deploy this container as an Image in a serverless fashion in GCP or any other cloud platform. Any pointers/help is much appreciated.
You can either use Cloud Run (the most Serverless way) or on a VM.
On Cloud Run you can deploy a single image as a Service (Cloud Run Terminology), if you have more than one image you can deploy multiple Services and make them talk to each other
Or on VM, that would be as if you are deploying on your personal laptop
I know that Docker and Kubernetes aren’t direct competitors. Docker is the container platform and containers are coordinated and scheduled by Kubernetes, which is a tool.
What does it really mean and how can I deploy my app on Docker for Azure ?
Short answer:
Docker (and containers in general) solve the problem of packaging an application and its dependencies. This makes it easy to ship and run everywhere.
Kubernetes is one layer of abstraction above containers. It is a distributed system that controls/manages containers.
My advice: because the landscape is huge... start learning and putting the pieces of the puzzle together by following a course. Below I have added some information from the:
Introduction to Kubernetes, free online course from The Linux Foundation.
Why do we need Kubernetes (and other orchestrators) above containers?
In the quality assurance (QA) environments, we can get away with running containers on a single host to develop and test applications. However, when we go to production, we do not have the same liberty, as we need to ensure that our applications:
Are fault-tolerant
Can scale, and do this on-demand
Use resources optimally
Can discover other applications automatically, and communicate with each other
Are accessible from the external world
Can update/rollback without any downtime.
Container orchestrators are the tools which group hosts together to form a cluster, and help us fulfill the requirements mentioned above.
Nowadays, there are many container orchestrators available, such as:
Docker Swarm: Docker Swarm is a container orchestrator provided by Docker, Inc. It is part of Docker Engine.
Kubernetes: Kubernetes was started by Google, but now, it is a part of the Cloud Native Computing Foundation project.
Mesos Marathon: Marathon is one of the frameworks to run containers at scale on Apache Mesos.
Amazon ECS: Amazon EC2 Container Service (ECS) is a hosted service provided by AWS to run Docker containers at scale on its infrastructrue.
Hashicorp Nomad: Nomad is the container orchestrator provided by HashiCorp.
Kubernetes is built on Docker technology. It is an orchestration tool for Docker container whereas Docker is a technology to create and deploy containers.
Docker, starting with a platform-as-a-service (PaaS) provider named dotCloud.
All in all, Kubernetes is related to the Docker container, allowing you to implement application portability and extensibility in container orchestration.
DOCKER
Easy and fast to install and configure
Functionality is provided and limited by the Docker API
Quick container deployment and scaling even in very large clusters
Automated internal load balancing through any node in the cluster
Simple shared local volumes
Kubernetes
Require some work to get up and running
Client, API and YAML definitions are unique to Kubernetes
Provides strong guarantees to cluster states at the expense of speed
To Enable load balancing requires manual service configuration
Volumes shared within pods
This is just a basic idea which at least explains the difference.If you want to go in depth see my posts
http://www.thecreativedev.com/an-introduction-to-kubernetes/
http://www.thecreativedev.com/learn-docker-works/
Docker and Kubernetes are complementary. Docker provides an open standard for packaging and distributing containerized applications, while Kubernetes provides for the orchestration and management of distributed, containerized applications created with Docker. In other words, Kubernetes provides the infrastructure needed to deploy and run applications built with Docker.
According to the DataDog Docker Integration Docs:
There are two ways to run the [DataDog] Agent: directly on each host, or within a docker-dd-agent container. We recommend the latter.
Why is a Docker-based agent installation preferred over just installing the DataDog agent directly as a service on the box that's running the Docker containers?
One of Dockers main features is portability and it makes sense to bind datadog into that environment. That way they are packaged and deployed together and you don't have the overhead of installing datadog manually everywhere you choose to deploy.
What they are also implying is that you should use docker-compose and turn your application / docker container into an multi-container Docker application, running your image(s) alongside the docker agent. Thus you will not need to write/build/run/manage a container via Dockerfile, but rather add the agent image to your docker-compose.yml along with its configuration. Starting your multi-container application will still be easy via:
docker-compose up
Its really convenient and gives you additional features like their autodiscovery service.
My application is comprised of two separate docker containers. One being a Grails based web application and second being a RESTful Python Flask application. Both docker containers are sitting on my local computer. They are not hosted on docker hub. They are proprietary and I don't want to host them publicly.
I would like to try Cloud Foundry to deploy these docker containers and see how it works. However, from the documentation I get a sense that Cloud Foundry doesn't support deploying docker containers sitting on a local machine.
Question
Is there a way to deploy docker containers sitting on a local computer to CloudFoundry? If not, what is a way to securely host the containers somewhere from CF can fetch them?
Is CloudFoundry capable of running a docker container that is a Python Flask application?
One option you have is to not use Docker images, and just push your code directly, one of the nice features of CF. PCF comes with a python buildpack which should automatically detect your Flask app.
Another option would be run your own trusted docker registry, push your images there, and then when you push your app, tell it to grab the images from your registry. If you google "cloud foundry docker registry" you get the following useful results you should check out:
https://github.com/cloudfoundry-community/docker-registry-boshrelease
http://docs.pivotal.io/pivotalcf/1-8/adminguide/docker.html#caveats
https://docs.pivotal.io/pivotalcf/1-7/opsguide/docker-registry.html
Is it possible to create a docker container in a Google Container Engine cluster, using a dockerfile, which builds an image on-the-fly and deploy in cluster,
rather than creating an image first and uploading it to a Google Container Registry, and then using it from there?
I feel like that is cumbersome, and there should be a way to create containers in cluster directly using a dockerfile.
It is not possible to do this in Google Container Engine. Google Container Engine is designed to help orchestrate container deployment and does not itself provide a source -> deployment workflow.
You may want to look at Google App Engine or Openshift 3 (which is built on Kubernetes) as a more fully featured platform-as-a-service offering.
You can also build this type of tooling on top of a Google Container Engine cluster yourself as all of the building blocks are available.
One service to take a look at when constructing a workflow is Google Container Builder, which can simplify the process of building a container from source and pushing it to Google Container Registry.
It is currently a fairly low level service, but offers some advantages for environments where it may be impractical to run docker build locally.