A representative example of container is LXC (now it may be Docker.)
LXC is called container-virtualization.
Xen is called para-virtualization.
But LXC is not called paravirtualization, and Xen is not called containers.
What is the difference between the two techniques?
Is there a technical difference in implementing two virtualization?
Although containers and virtual machines are similar resources,they have somehow different approach.
Virtualization- resources like CPU,RAM,networking are somehow 'virtualized down to the hardware' and represent multiple separate resources.
Containers-represent layers above the OS level. With Docker containers all dependancies are inside one container. That suit for microservices,where you separate Node.js,Python or Psql resources into separate containers which can communicate each other. Containers pros are speed,cause they are lightweight(especially with multistage build),but there are also security risks,they share the hardware and many images can be compromised so exploit in one container leads to security risk for others.
Image from Medium article
Related
We are moving some of our internal services to rely on Docker instead of direct installation on the host OS (good thing, right :).
We use docker stack command with compose file (as it felt to us it is the modern approach). But we are not sure about how to properly make our stacks modular, while allowing composition:
Let's imagine we have two stacks: stackA and stackB. Those two can perfectly be used in isolation, so for the moment we decided to host them in two separate repositories, each containing the docker-compose.yml of the corresponding stack.
Yet, there is also a mode where stackB can communicate with stackA to provide additional features. On some nodes, we might want to deploy both, and have them communicate.
By default, when we start both stacks on the same node with:
docker stack deploy -c stackA/ A-stack
docker stack deploy -c stackB/ B-stack
Both end up on different overlay networks, and cannot easily communicate.
It seems we are faced with a choice, for which we could only find 3 options at the moment:
We have seen ways to add external networks to stackB in its compose file, but that means now the stackB can only be deployed if stackA already runs (because it wants to join an external network)
We could define another compose file, manually merging both. But that leads us to maintain another repo, and duplicate changes.
We could have the stack communicate over the host network through exposed ports, but it might feel a bit weird.
Is there a best/recommended approach to keep different stacks modular, while allowing to easily compose them?
Or is it an implicit assumption that as soon as two containers are supposed to communicate, they have to be deployed from the same compose file?
I handle usually handle more than one stack in cases when I want handle them separate. Common situations are horizontal scaling of same web service image for different customer installations with different configurations f.e. databases.
The separated stacks allow me easy to shutdown them without any impact of other installations
I also like the standard naming conventions in multiple stack installations. Same services have same names beside the stack prefix.
To let the stack communicate over the boundaries thy only have to share the same network.
The first stack defines in my cases implizit a network and the other stack join that network by compose file configuration.
...
networks:
default:
external:
name: FIRST_STACK_NAME_default
...
I'm trying to understand the benefits of Docker better and I am not really understanding how it would work in production.
Let's say I have a web frontend, a rest api backend and a db. That makes 3 containers.
Let's say that I want 3 of the front end, 5 of the backend and 7 of the db. (Minor question: Does it ever make sense to have less dbs than backend servers?)
Now, given the above scenario, if I package them all on the same host then I gain the benefit of efficiently using the resources of the host, but then I am DOA when that machine fails or has a network partition.
If I separate them into 1 full application (ie 1 FE, 1 BE & 1 DB) per host, and put extra containers on their own host, I get some advantages of using resources efficiently, but it seems to me that I still lose significantly when I have a network partition since it will take down multiple services.
Hence I'm almost leaning to the conclusion that I should be putting in 1 container per host, but then that means I am using my resources pretty inefficiently and then what are the benefits of containers in production? I mean, an OS might be an extra couple gigs per machine in storage size, but most cloud providers give you a minimum of 10 gigs storage. And let's face it, a rest api backend or a web front end is not gonna even come close to the 10 gigs...even including the OS.
So, after all that, I'm trying to figure out if I'm missing the point of containers? Are the benefits of keeping all containers of an application on 1 host, mostly tied to testing and development benefits?
I know there are benefits from moving containers amongst different providers/machines easily, but for the most part, I don't see that as a huge gain personally since that was doable with images...
Are there any other benefits for containers in production that I am missing? Or are the main benefits for testing and development? (Am I thinking about containers in production wrong)?
Note: The question is very broad and could fill an entire book but I'll shed some light.
Benefits of containers
The exciting part about containers is not about their use on a single host, but their use across hosts connected on a large cluster. Do not look at your machines as independent docker hosts, but as a pool of resource to host your containers.
Containers alone are not ground-breaking (ie. Docker's CTO stating at the last DockerCon that "nobody cares about containers"), but coupled to state of the art schedulers and container orchestration frameworks, they become a very powerful abstraction to handle production-grade software.
As to the argument that it also applies to Virtual Machines, yes it does, but containers have some technical advantage (See: How is Docker different from a normal virtual machine) over VMs that makes them convenient to use.
On a Single host
On a single host, the benefits you can get from containers are (amongst many others):
Use as a development environment mimicking the behavior on a real production cluster.
Reproducible builds independent of the host (convenient for sharing)
Testing new software without bloating your machine with packages you won't use daily.
Extending from a single host to a pool of machines (cluster)
When time comes to manage a production cluster, there are two approaches:
Create a couple of docker hosts and run/connect containers together "manually" through scripts or using solutions like docker-compose. Monitoring the lifetime of your services/containers is at your charge, and you should be prepared to handle service downtime.
Let a container orchestrator deal with everything and monitor the lifetime of your services to better cope with failures.
There are plenty of container orchestrators: Kubernetes, Swarm, Mesos, Nomad, Cloud Foundry, and probably many others. They power many large-scale companies and infrastructures, like Ebay, so they sure found a benefit in using these.
Pick the right replication strategy
A container is better used as a disposable resource meaning you can stop and restart the DB independently and it shouldn't impact the backend (other than throwing an error because the DB is down). As such you should be able to handle any kind of network partition as long as your services are properly replicated across several hosts.
You need to pick a proper replication strategy, to make sure your service stays up and running. You can for example replicate your DB across Cloud provider Availability Zones so that when an entire zone goes down, your data remains available.
Using Kubernetes for example, you can put each of your containers (1 FE, 1 BE & 1 DB) in a pod. Kubernetes will deal with replicating this pod on many hosts and monitor that these pods are always up and running, if not a new pod will be created to cope with the failure.
If you want to mitigate the effect of network partitions, specify node affinities, hinting the scheduler to place containers on the same subset of machines and replicate on an appropriate number of hosts.
How many containers per host?
It really depends on the number of machines you use and the resources they have.
The rule is that you shouldn't bloat a host with too many containers if you don't specify any resource constraint (in terms of CPU or Memory). Otherwise, you risk compromising the host and exhaust its resources, which in turn will impact all the other services on the machine. A good replication strategy is not only important at a single service level, but also to ensure good health for the pool of services that are sharing a host.
Resource constraint should be dealt with depending on the type of your workload: a DB will probably use more resources than your Front-end container so you should size accordingly.
As an example, using Swarm, you can explicitely specify the number of CPUs or Memory you need for a given service (See docker service documentation). Although there are many possibilities and you can also give an upper bound/lower bound in terms of CPU or Memory usage. Depending on the values chosen, the scheduler will pin the service to the right machine with available resources.
Kubernetes works pretty much the same way and you can specify limits for your pods (See documentation).
Mesos has more fine grained resource management policies with frameworks (for specific workloads like Hadoop, Spark, and many more) and with over-commiting capabilities. Mesos is especially convenient for Big Data kind of workloads.
How should services be split?
It really depends on the orchestration solution:
In Docker Swarm, you would create a service for each component (FE, BE, DB) and set the desired replication number for each service.
In Kubernetes, you can either create a pod encompassing the entire application (FE, BE, DB and the volume attached to the DB) or create separate pods for the FE, BE, DB+volume.
Generally: use one service per type of container. Regarding groups of containers, evaluate if it is more convenient to scale the entire group of container (as an atomic unit, ie. a pod) than to manage them separately.
Sum up
Containers are better used with an orchestration framework/platform. There are plenty of available solutions to deal with container scheduling and resource management. Pick one that might fit your use case, and learn how to use it. Always pick an appropriate replication strategy, keeping in mind possible failure modes. Specify resource constraints for your containers/services when possible to avoid resource exhaustion which could potentially lead to bringing a host down.
This depends on the type of application you run in your containers. From the top of my head I can think of a couple different ways to look at this:
is your application diskspace heavy?
do you need the application fail save on multiple machines?
can you run multiple different instance of different applications on the same host without decreasing performance of them?
do you use software like kubernetes or swarm to handle your machines?
I think most of the question are interesting to answer even without containers. Containers might free you of thinking about single hosts, but you still have to decide and measure the load of your host machines yourself.
Minor question: Does it ever make sense to have less dbs than backend servers?
Yes.
Consider cases where you hit normal(without many joins) SQL select statements to get data from the database but your Business Logic demands too much computation. In those cases you might consider keeping your Back-End Service count high and Database Service count low.
It all depends on the use case which is getting solved.
The number of containers per host depends on the design ratio of the host and the workload ratio of the containers. Both ratios are
Throughput/Capacity ratios. In the old days, this was called E/B for execution/bandwidth. Execution was cpu and banwidth was I/o. Solutions were said to be cpu or I/o bound.
Today memories are very large the critical factor is usually cpu/nest
capacity. We describe workloads as cpu intense or nest intense. A useful proxy for nest capacity is the size of highest level cache. A useful design ratio estimator is (clock x cores)/cache. Fir the same core count the machine with a lower design ratio will hold more containers. In part this is because the machine with more cache will scale better and see less saturation at higher utilization. By
I've been making some tests with docker and so far I'm wondering why it's considered a good practice to separate the DB and the app in two containers.
Having two containers seems to be cumbersome to manage and I don't really see the value in it.
Whereas I like the idea of having a self sustainable container per app.
One reason is the separation of data storage and application. If you you put both in their own container, you can update them independently. In my experience this is a common process, because usually the application will evolve faster than the underlying database.
It also frees you to run the containers in different places, which might be a constraint in your operations. Or to run multiple containers from the same database image with different applications.
Often it is also a good thing to be able to scale the UI from one instance to multiple instance, all connected to the same database (or cache instance or HTTP backend). This is mentioned briefly in the docker best practices.
I also understand the urge to run multiple processes in one container. That's why so many minimalist init systems/supervisors like s6 came up lately. I prefer this for demos of applications which require a couple things, like an nginx for frontend, a database and maybe a redis instance. But you could also write a basic docker-compose file and run the demo with multiple containers.
It depends on what you consider your "DB", is it the database application or the content.
The latter is easy, the content needs to be persisted outside the lifetime of the application. The convention used to be to have a "data" container, which simplified linking it with the application (e.g. using the Docker Engine create command --volumes-from parameter). With Docker 1.9 there is a new volume API which has superceded the concept of "data" containers. But you should never store your data in the overlay filesystem (if not only for persistence, but for performance).
If you are referring to a database application, you really enter a semi-religious debate with the microservices crowd. Docker is built to run single process. It is built for 12-factor apps. It is built for microservices. It is definitely possible to run more than one process in a container, but with it you have to consider the additional complexity of managing/monitoring these processes (e.g. using an init process like supervisord), dealing with logging, etc.
I've delivered both. If you are managing the container deployment (e.g. you are hosting the app), it is actually less work to use multiple containers. This allows you to use Docker's abstraction layers for networking and persistent storage. It also provides maximum portability as you scale the application (perhaps you may consider using convoy or flocker volume drivers or an overlay network for hosting containers across multiple servers). If you are developing a product for distribution, it is more convenient to deliver a single Docker Repository (with one Image). This minimizes the support costs as you guide customers through deployment.
I'm trying to get a good understanding of container technologies but am somewhat confused. It seems like certain technologies overlap different portions of the stack and different pieces of different technologies can be used as the DevOps team sees fit (e.g., can use Docker containers but don't have to use the Docker engine, could use engine from cloud provider instead). My confusion lies in understanding what each layer of the "Container Stack" provides and who the key providers are of each solution.
Here's my layman's understanding; would appreciate any corrections and feedback on holes in my understanding
Containers: self-contained package including application, runtime environment, system libraries, etc.; like a mini-OS with an application
It seems like Docker is the de-facto standard. Any others that are notable and widely used?
Container Clusters: groups of containers that share resources
Container Engine: groups containers into clusters, manages resources
Orchestrator: is this any different from a container engine? How?
Where do Docker Engine, rkt, Kubernetes, Google Container Engine, AWS Container Service, etc. fall between #s 2-4?
This may be a bit long and present some oversimplification but should be sufficient to get the idea across.
Physical machines
Some time ago, the best way to deploy simple applications was to simply buy a new webserver, install your favorite operating system on it, and run your applications there.
The cons of this model are:
The processes may interfere with each other (because they share CPU and file system resources), and one may affect the other's performance.
Scaling this system up/down is difficult as well, taking a lot of effort and time in setting up a new physical machine.
There may be differences in the hardware specifications, OS/kernel versions and software package versions of the physical machines, which make it difficult to manage these application instances in a hardware-agnostic manner.
Applications, being directly affected by the physical machine specifications, may need specific tweaking, recompilation, etc, which means that the cluster administrator needs to think of them as instances at an individual machine level. Hence, this approach does not scale. These properties make it undesirable for deploying modern production applications.
Virtual Machines
Virtual machines solve some of the problems of the above:
They provide isolation even while running on the same machine.
They provide a standard execution environment (the guest OS) irrespective of the underlying hardware.
They can be brought up on a different machine (replicated) quite quickly when scaling (order of minutes).
Applications typically do not need to be rearchitected for moving from physical hardware to virtual machines.
But they introduce some problems of their own:
They consume large amounts of resources in running an entire instance of an operating system.
They may not start/go down as fast as we want them to (order of seconds).
Even with hardware assisted virtualization, application instances may see significant performance degradation over an application running directly on the host.
(This may be an issue only for certain kinds of applications)
Packaging and distributing VM images is not as simple as it could be.
(This is not as much a drawback of the approach, as it is of the existing tooling for virtualization.)
Containers
Then, somewhere along the line, cgroups (control groups) were added to the linux kernel. This feature lets us isolate processes in groups, decide what other processes and file system they can see, and perform resource accounting at the group level.
Various container runtimes and engines came along which make the process of creating a "container", an environment within the OS, like a namespace which has limited visibility, resources, etc, very easy. Common examples of these include docker, rkt, runC, LXC, etc.
Docker, for example, includes a daemon which provides interactions like creating an "image", a reusable entity that can be launched into a container instantly. It also lets one manage individual containers in an intuitive way.
The advantages of containers:
They are light-weight and run with very little overhead, as they do not have their own instance of the kernel/OS and are running on top of a single host OS.
They offer some degree of isolation between the various containers and the ability to impose limits on various resources consumed by them (using the cgroup mechanism).
The tooling around them has evolved rapidly to allow easy building of reusable units (images), repositories for storing image revisions (container registries) and so on, largely due to docker.
It is encouraged that a single container run a single application process, in order to maintain and distribute it independently. The light-weight nature of a container make this preferable, and leads to faster development due to decoupling.
There are some cons as well:
The level of isolation provided is a less than that in case of VMs.
They are easiest to use with stateless 12-factor applications being built afresh and a slight struggle if one tries to deploy legacy applications, clustered distributed databases and so on.
They need orchestration and higher level primitives to be used effectively and at scale.
Container Orchestration
When running applications in production, as the complexity grows, it tends to have many different components, some of which scale up/down as necessary, or may need to be scaled. The containers themselves do not solve all our problems. We need a system that solves problems associated with real large-scale applications such as:
Networking between containers
Load balancing
Managing storage attached to these containers
Updating containers, scaling them, spreading them across nodes in a multi-node cluster and so on.
When we want to manage a cluster of containers, we use a container orchestration engine. Examples of these are Kubernetes, Mesos, Docker Swarm etc. They provide a host of functionality in addition to those listed above and the goal is to reduce the effort involved in dev-ops.
GKE (Google Container Engine) is hosted Kubernetes on Google Cloud Platform. It lets a user simply specify that they need an n-node kubernetes cluster and exposes the cluster itself as a managed instance. Kubernetes is open source and if one wanted to, one could also set it up on Google Compute Engine, a different cloud provider, or their own machines in their own data-center.
ECS is a proprietary container management/orchestration system built and operated by Amazon and available as part of the AWS suite.
To answer your questions specifically:
Docker engine: A tool to manage the lifecycle of a docker container and docker images. Create, restart, delete docker containers. Create, rename, delete docker images.
rkt: Analogous to docker engine, but different implementation
Kubernetes: A collection of tools to manage the lifecycle of a distributed application that uses containers. Contains tooling to manage containers, groups of containers, configuration for containers, orchestrating containers, scheduling them on actual instances, tooling to help developers write and maintain other services/tools to deal with containers.
Google Container Engine: Instead of getting VMs, installing "docker-engine" on them, installing kubernetes on them and getting it all to work with things like the right permissions to your infrastructure etc. imagine if it all came together so that you can choose the types of machines and the size of your cluster that has all of this just working. Things like pulling images from your project specific docker repository (google container registry) or claiming persistent volumes, or provisioning load-balancers just work without worrying about service accounts and permissions and what not.
ECS: Analogous to GKE (4) but without Kubernetes.
To address the points in your understanding: you are loosely right about things (except container engine I think). It's important to understand that the only important thing to understand is what a container is. The rest of it is just marketing/product names. It's also important to understand that today's understanding of containers is very warped by what Docker containers are and a lot of the opinions enforced by Docker and tooling around Docker. Containers have been around for a long time.
So once you understand what a (docker) container is, a container engine is just a tool to manage them, a container cluster is a just a group of containers, an orchestrator is just a tool to manage where containers run based on some parameters. IMHO, you really don't need to worry too much about what the rest of the tooling is once you understand and build a solid mental model around containers. The rest will just fit in automatically.
The best way to understand all of this? Build & deploy a decently complex application with Docker (persist data/use a database in your app) and everything will make sense.
We have a little farm of docker containers, spread over several Amazon instances.
Would it make sense to have fewer big host images (in terms of ram and size) to host multiple smaller containers at once, or to have one host instance per container, sized according to container needs?
EDIT #1
The issue here is that we need to decide up-front. I understand that we can decide later using various monitoring stats, but we need to make some architecture and infrastructure decisions before it is going to be used. More over, we do not have control over what content is going to be deployed.
You should read
An Updated Performance Comparison of Virtual Machines
and Linux Containers
http://domino.research.ibm.com/library/cyberdig.nsf/papers/0929052195DD819C85257D2300681E7B/$File/rc25482.pdf
and
Resource management in Docker
https://goldmann.pl/blog/2014/09/11/resource-management-in-docker/
You need to check how much memory, CPU, I/O,... your containers consume, and you will draw your conclusions
You can easily, at least, check a few things with docker stats and docker top my_container
the associated docs
https://docs.docker.com/engine/reference/commandline/stats/
https://docs.docker.com/engine/reference/commandline/top/