As a disclaimer let me just say that I am a beginner with Docker and hence the question might sound a bit "dummy".
I am exploring parallelization options to speed-up some computations. I'm working with Python, so I followed the official guidelines to create my first image and then run it as a container.
For the time being, I use a dummy program that generates a very large np random matrix (let's say 4000 x 4000) and then finds how many elements in each row fall into a predefined range [min, max].
I then launched a second container of the same image obviously with a different port and name. I didn't get any speed-ups in the computations which I was expecting somehow, since:
a) I haven't developed any mechanism for the 2 containers to somehow "talk to each other" and share intermediate results
b) I am not sure if the program itself is suitable for speedups in such a way.
So my questions corresponding to a, b above are:
Is parallelism a "feature" supported by docker deployments and in what sense? Is it just load sharing? And if I implement a load balancer, how does docker know how to transfer intermediate results from one container to the other?
If the previous question is not "correct", do I then need to write "parallel" versions of my programs to assign to each container? Isn't this equivalent to writing MPI versions of my program and assign them to different cores in my system? What would be the benefit of a docker architecture then?
thanks in advance
Docker is just a way of deploying your application - it does not in itself allow you to 'support' parallelism just by using Docker. Your application itself needs to support parallelism. Docker (and Kubernetes, etc) can help you scale out in parallel easily but your applications need to be able to support that scaling out.
So if you can run multiple instances of your application in parallel now (however you might do that) and it would not deliver any performance improvement then Docker will not help you. If running multiple instances now does improve performance then Docker will help scale out.
Related
I currently own only one computer, and I won't have another.
I run Spark on its CPU cores : master=local[5], using it directly : I set spark-core and spark-sql for dependencies, do quite no other configuration, and my programs start immediately. It's confortable, of course.
But should I attempt to create an architecture with a master and some workers by the mean of Docker containers or minikube (Kubernetes) on my computer ?
Will solution #2 - with all the settings it requires - reward me with better performances, because Spark is truly designed to work that way, even on a single computer,
or will I loose some time, because the mode I'm currently running it, without network usage, without need of data locality will always give me better performances, and solution #1 will always be the best on a single computer ?
My hypothesis is that #1 is fine. But I have no true measurement for that. No source of comparison. Who have experienced the two manners of doing things on a sigle computer ?
It really depends on your goals - if you always will run your Spark code on the single node with local master, then just use it. But if you intend to run your resulting code in the distributed mode on multiple machines, then emulating cluster with Docker could be useful, as you'll get your code running in truly distributed manner, and you'll able to find problems that not always are found when you run your code with the local master.
Instead of direct Docker usage (that could be tricky to setup, although it's still possible), maybe you can consider to use Spark on Kubernetes, for example, via minikube - there is a plenty of articles found by Google on this topic.
Having done testing on this with executor size, the cutover from when it makes sense to use more multiple executors is # CPUs > 32. AWS EMR spark runtime defaults to at least 4 CPUs per executor and Databricks always uses fat executors which means > 32CPUS on the 8xl instances. Your greatest limitation tends to be the JVMs garbage collection which caps the size of the heap. Local mode has a couple performance advantages compared to cluster mode.
full stage code gen has to be run on both the drive and every single executor. For short queries this can add several 100MS per stage.
driver <-> executor communication has latency.
shared memory between driver and executors. This reduces the chance of OOM and reduces the amount of spilling to disk.
People end up choosing to go with multiple executors/instances not because it would be faster than a single instance but because it is the only way to scale up in terms of data volume and parallization. (also for failure recovery)
If you're feeling ambitious there's a performance testing tool called TPC-DS that runs a set of dataprocessing queries against a standardized dataset
https://github.com/databricks/spark-sql-perf
https://github.com/maropu/spark-tpcds-datagen
Also if you're feeling adventurous the spark code has a script to fire up a mini cluster on minikube if you want a quick and easy way to test this.
I can understand how it is helpful when scaling over multiple different machines.
But here we have just one single machine (or a node). However docker still supports scaling the service to run multiple tasks (each served by one container) like this:
docker service scale serviceName=num_of_replicas
Let's take an example of running a Web API. Really I don't see how scaling in this case can help. One machine hosting a web API can serve with its max power. Using multiple containers in it cannot help increase that maximum power. With the request handling pipeline of Web API, one server can handle multiple requests at the same time and independently as long as the server has enough resources (CPU, RAM). So we don't need multiple (unnecessary) tasks in this case with docker service scaling.
The only benefit I can see here is docker service scaling may provide a better isolation between tasks (containers) compared with serving all the requests by one same server (container).
Could you please let me know some other benefit of scaling docker service this way? Is there anything wrong with my assumption above?
Using multiple containers in it cannot help increase that maximum power.
That really depends on the implementation. Some non efficient implementations may use only single process/thread/cpu and scaling helps with their performance.
Another benefit: scaling on the single node will help also with high availability. There is always small nonzero chance for non recoverable error, out of memory issue, ... which may stop single container. So there will be downtime, until orchestration scheduler restarts container.
I'm experiencing an issue where an image I'm running as part of a Kubernetes deployment is behaving differently from the expected and consistent behavior of the same image run with docker run <...>. My understanding of the main purpose of containerizing a project is that it will always run the same way, regardless of the host environment (ignoring the influence of the user and of outside data. Is this wrong?
Without going into too much detail about my specific problem (since I feel the solution may likely be far too specific to be of help to anyone else on SO, and because I've already detailed it here), I'm curious if someone can detail possible reasons to look into as to why an image might run differently in a Kubernetes environment than locally through Docker.
The general answer of why they're different is resources, but the real answer is that they should both be identical given identical resources.
Kubernetes uses docker for its container runtime, at least in most cases I've seen. There are some other runtimes (cri-o and rkt) that are less widely adopted, so using those may also contribute to variance in how things work.
On your local docker it's pretty easy to mount things like directories (volumes) into the image, and you can populate the directory with some content. Doing the same thing on k8s is more difficult, and probably involves more complicated mappings, persistent volumes or an init container.
Running docker on your laptop and k8s on a server somewhere may give you different hardware resources:
different amounts of RAM
different size of hard disk
different processor features
different core counts
The last one is most likely what you're seeing, flask is probably looking up the core count for both systems and seeing two different values, and so it runs two different thread / worker counts.
I'm trying to understand the benefits of Docker better and I am not really understanding how it would work in production.
Let's say I have a web frontend, a rest api backend and a db. That makes 3 containers.
Let's say that I want 3 of the front end, 5 of the backend and 7 of the db. (Minor question: Does it ever make sense to have less dbs than backend servers?)
Now, given the above scenario, if I package them all on the same host then I gain the benefit of efficiently using the resources of the host, but then I am DOA when that machine fails or has a network partition.
If I separate them into 1 full application (ie 1 FE, 1 BE & 1 DB) per host, and put extra containers on their own host, I get some advantages of using resources efficiently, but it seems to me that I still lose significantly when I have a network partition since it will take down multiple services.
Hence I'm almost leaning to the conclusion that I should be putting in 1 container per host, but then that means I am using my resources pretty inefficiently and then what are the benefits of containers in production? I mean, an OS might be an extra couple gigs per machine in storage size, but most cloud providers give you a minimum of 10 gigs storage. And let's face it, a rest api backend or a web front end is not gonna even come close to the 10 gigs...even including the OS.
So, after all that, I'm trying to figure out if I'm missing the point of containers? Are the benefits of keeping all containers of an application on 1 host, mostly tied to testing and development benefits?
I know there are benefits from moving containers amongst different providers/machines easily, but for the most part, I don't see that as a huge gain personally since that was doable with images...
Are there any other benefits for containers in production that I am missing? Or are the main benefits for testing and development? (Am I thinking about containers in production wrong)?
Note: The question is very broad and could fill an entire book but I'll shed some light.
Benefits of containers
The exciting part about containers is not about their use on a single host, but their use across hosts connected on a large cluster. Do not look at your machines as independent docker hosts, but as a pool of resource to host your containers.
Containers alone are not ground-breaking (ie. Docker's CTO stating at the last DockerCon that "nobody cares about containers"), but coupled to state of the art schedulers and container orchestration frameworks, they become a very powerful abstraction to handle production-grade software.
As to the argument that it also applies to Virtual Machines, yes it does, but containers have some technical advantage (See: How is Docker different from a normal virtual machine) over VMs that makes them convenient to use.
On a Single host
On a single host, the benefits you can get from containers are (amongst many others):
Use as a development environment mimicking the behavior on a real production cluster.
Reproducible builds independent of the host (convenient for sharing)
Testing new software without bloating your machine with packages you won't use daily.
Extending from a single host to a pool of machines (cluster)
When time comes to manage a production cluster, there are two approaches:
Create a couple of docker hosts and run/connect containers together "manually" through scripts or using solutions like docker-compose. Monitoring the lifetime of your services/containers is at your charge, and you should be prepared to handle service downtime.
Let a container orchestrator deal with everything and monitor the lifetime of your services to better cope with failures.
There are plenty of container orchestrators: Kubernetes, Swarm, Mesos, Nomad, Cloud Foundry, and probably many others. They power many large-scale companies and infrastructures, like Ebay, so they sure found a benefit in using these.
Pick the right replication strategy
A container is better used as a disposable resource meaning you can stop and restart the DB independently and it shouldn't impact the backend (other than throwing an error because the DB is down). As such you should be able to handle any kind of network partition as long as your services are properly replicated across several hosts.
You need to pick a proper replication strategy, to make sure your service stays up and running. You can for example replicate your DB across Cloud provider Availability Zones so that when an entire zone goes down, your data remains available.
Using Kubernetes for example, you can put each of your containers (1 FE, 1 BE & 1 DB) in a pod. Kubernetes will deal with replicating this pod on many hosts and monitor that these pods are always up and running, if not a new pod will be created to cope with the failure.
If you want to mitigate the effect of network partitions, specify node affinities, hinting the scheduler to place containers on the same subset of machines and replicate on an appropriate number of hosts.
How many containers per host?
It really depends on the number of machines you use and the resources they have.
The rule is that you shouldn't bloat a host with too many containers if you don't specify any resource constraint (in terms of CPU or Memory). Otherwise, you risk compromising the host and exhaust its resources, which in turn will impact all the other services on the machine. A good replication strategy is not only important at a single service level, but also to ensure good health for the pool of services that are sharing a host.
Resource constraint should be dealt with depending on the type of your workload: a DB will probably use more resources than your Front-end container so you should size accordingly.
As an example, using Swarm, you can explicitely specify the number of CPUs or Memory you need for a given service (See docker service documentation). Although there are many possibilities and you can also give an upper bound/lower bound in terms of CPU or Memory usage. Depending on the values chosen, the scheduler will pin the service to the right machine with available resources.
Kubernetes works pretty much the same way and you can specify limits for your pods (See documentation).
Mesos has more fine grained resource management policies with frameworks (for specific workloads like Hadoop, Spark, and many more) and with over-commiting capabilities. Mesos is especially convenient for Big Data kind of workloads.
How should services be split?
It really depends on the orchestration solution:
In Docker Swarm, you would create a service for each component (FE, BE, DB) and set the desired replication number for each service.
In Kubernetes, you can either create a pod encompassing the entire application (FE, BE, DB and the volume attached to the DB) or create separate pods for the FE, BE, DB+volume.
Generally: use one service per type of container. Regarding groups of containers, evaluate if it is more convenient to scale the entire group of container (as an atomic unit, ie. a pod) than to manage them separately.
Sum up
Containers are better used with an orchestration framework/platform. There are plenty of available solutions to deal with container scheduling and resource management. Pick one that might fit your use case, and learn how to use it. Always pick an appropriate replication strategy, keeping in mind possible failure modes. Specify resource constraints for your containers/services when possible to avoid resource exhaustion which could potentially lead to bringing a host down.
This depends on the type of application you run in your containers. From the top of my head I can think of a couple different ways to look at this:
is your application diskspace heavy?
do you need the application fail save on multiple machines?
can you run multiple different instance of different applications on the same host without decreasing performance of them?
do you use software like kubernetes or swarm to handle your machines?
I think most of the question are interesting to answer even without containers. Containers might free you of thinking about single hosts, but you still have to decide and measure the load of your host machines yourself.
Minor question: Does it ever make sense to have less dbs than backend servers?
Yes.
Consider cases where you hit normal(without many joins) SQL select statements to get data from the database but your Business Logic demands too much computation. In those cases you might consider keeping your Back-End Service count high and Database Service count low.
It all depends on the use case which is getting solved.
The number of containers per host depends on the design ratio of the host and the workload ratio of the containers. Both ratios are
Throughput/Capacity ratios. In the old days, this was called E/B for execution/bandwidth. Execution was cpu and banwidth was I/o. Solutions were said to be cpu or I/o bound.
Today memories are very large the critical factor is usually cpu/nest
capacity. We describe workloads as cpu intense or nest intense. A useful proxy for nest capacity is the size of highest level cache. A useful design ratio estimator is (clock x cores)/cache. Fir the same core count the machine with a lower design ratio will hold more containers. In part this is because the machine with more cache will scale better and see less saturation at higher utilization. By
I am just one day old to docker , so it is relatively very new to me .
I read the docker.io but could not get the answers to few basic questions . Here is what it is:
Docker is basically a tool which allows you to make use of the images and spin up your own customised images by installing softwares so that you can use to create the VMs using that .
Is this what docker is all about from a 10000 ft bird's eye piont of view?
2 . What exactly is the meaning of a container ? Is it synonymn for image?
3 . I remember reading somewhere that it allows you to deploy applications. Is this correct ? In other words will it behave like IIS for deploying the .net applications?
Please answer my questions above , so that I can understand it better and take it forward.
1) What docker is all about from a 10000 ft bird's eye point of view?
From the website: Docker is an open-source engine that automates the deployment of any application as a lightweight, portable, self-sufficient container that will run virtually anywhere.
Drill down a little bit more and a thorough explanation of the what/why docker addresses:
https://www.docker.io/the_whole_story/
https://www.docker.io/the_whole_story/#Why-Should-I-Care-(For-Developers)
https://www.docker.io/the_whole_story/#Why-Should-I-Care-(For-Devops)
Further depth can be found in the technology documentation:
http://docs.docker.io/introduction/technology/
2) What exactly is the meaning of a container ? Is it synonymn for image?
An image is the set of layers that are built up and can be moved around. Images are read-only.
http://docs.docker.io/en/latest/terms/image/
http://docs.docker.io/en/latest/terms/layer/
A container is an active (or inactive if exited) stateful instantiation of an image.
http://docs.docker.io/en/latest/terms/container/
See also: In Docker, what's the difference between a container and an image?
3)I remember reading somewhere that it allows you to deploy applications. Is this correct ? In other words will it behave like IIS for deploying the .net applications?
Yes, Docker can be used to deploy applications. You can deploy single components of the application stack or multiple components within a container. It depends on the use case. See the First steps with Docker page here: http://docs.docker.io/use/basics/
See also:
http://docs.docker.io/examples/nodejs_web_app/
http://docs.docker.io/examples/python_web_app/
http://docs.docker.io/examples/running_redis_service/
http://docs.docker.io/examples/using_supervisord/
So.
It's about providing the separation of processes that you get with virtualisation without the overhead. Of course this doesn't come without a cost - which in this case the largest one is that your docked containers will all be running under the same kernel.
A container is roughly a chroot (with better process encapsulation) and some ethernet virtualisation. The image is the filesystem (plus a few bits) that is mounted to provide the root filesystem^1
deploy is just the term docker uses for spinning up a container instance.
Effectively, each running instance of a container thinks that it is the only thing^2 running on that machine (much like a cloud appliance is typically designed). It provides more separation of processes than running on the host OS would provide, and allows for easily spinning up multiple separate copies of the container as needed; while providing much, much lower overheads than using full virtualisation would need.
^1: Actually there may be several layers of file-system sandwiched together to form the root file system.
^2: Docker does support multiple processes running within a single instance, but that is generally considered to be somewhat advanced usage.