Docker Swarm CPU overload on deploy with Spring Boot containers

Docker Swarm CPU overload on deploy with Spring Boot containers - docker

I have created a number of Spring Boot application, which all work like magic in isolation or when started up one of the other manually.
My challenge is that I want to deploy a stack with all the services in a Docker Swarm.
Initially I didn't understand what was going on, as it seemed like all my containers were hanging.
Turns out running a single Spring Boot application spikes up my CPU utilization to max it out for a good couple of seconds (20s+ to start up).
Now the issue is that Docker Swarm is launching 10 of these containers simultaneously and my load average goes above 80 and the system grinds to a halt. The container HEALTHCHECKS starts timing out and eventually Docker restarts them. This is an endless cycle and may or may not stabilize and if it does stabilize it takes a minimum of 30 minutes. So much for micro services vs big fat Java EE applications :(
Is there any way to convince Docker to rollout the containers one by one? I'm sure this will help a lot.
There is a rolling update parameter - https://docs.docker.com/engine/swarm/swarm-tutorial/rolling-update/ - but is does not seem applicable to startup deployment.
Your help will be greatly appreciated.
I've also tried systemd (which isn't ideal for distributed micro services). It worked slightly better than Docker, but have the same issue when deploying all the applications at once.
Initially I wanted to try Kubernetes, but I've got enough on my plate and if I can get away with Docker Swarm, that would be awesome.
Thanks!

Related

Is there any better way of deploying Docker Compose stack/services other than Swarm in scenario where autoscaling/load balancing isn't required?

Let me brief about my project, I'm building containerized security training environments (Future - Randomized Security Environments), aimed to help local students, organizations on their Information security training needs.
Current working - I have an instance group which auto scales according to load running a script to add and remove nodes from swarm, I use pub/sub topics to cater deployment needs which are deployed through (docker stack deploy command). It was tested by 4-5 people and was thought to be working perfectly until, we started trails on my own college students.
It got issues such as port numbers not being assigned to new deployments after 20-25 people deployed onto swarm, I am not understanding why, I mean resource usage is optimal, but swarm isn't assigning ports, after restarting the whole instance, swarm was updated with ports.
I knew there was a swarm option for task-history-limit which default set to 5, maybe that was the issue and it that's why it wasn't able to concurrently deploy. Later same thing happened (After 40 deployments) even after setting it to a higher number (Upgraded infra, low utilization in logs). Even now I'm getting nightmares on not knowing the correct reason of why this is happening.
Sample Deployment stack - https://gist.github.com/Mre11i0t/d16ed39e543094b50019d58d7e4bff99, Aim is to deploy this environment on-demand basis, each environment being isolated to respective user.

Parallel Docker Container Creation

I am using a Docker Setup that consists of 14 different containers. Every container gets a cpu_limit of 2 and a mem_limit of 2g.
To create and run these containers, I've written a Python script that uses the docker-py library. As of now, the containers are created sequentially, which takes approximately 2 minutes.
Now I'm thinking about parallelizing the process. So now instead of doing (its pseudocode):
for container in containers_to_start:
create_container(container)
I do
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(4)
pool.map(create_container, containers_to_start)
And as a result the 14 containers are created 2x faster. BUT: The applications within the containers take a significant longer time to boot. At the end of the day, i dont gain really much, the time until every application is reachable is more or less the same, no matter if with or without multithreading.
But I don't really know why, because every container gets the same amount of CPU and memory resources, so I would expect the same boot time no matter how many containers are starting at the same time. Clearly this is not the case. Maybe I'm missing some knowledge here, any explanation would be greatly appreciated.
System Specs
CPU: intel i7 # 2.90 GHz
32GB RAM
I am using Windows 10 with Docker installed in WSL2 backend.

IIS Process Vs Container Process

I am trying to understand what is the difference between Docker Container Process and IIS Process? From Container perspective, it is not advisable to have more than one process in a single container and in IIS also you can not do that. Each application is executed in it's own process. So if IIS provides me the the same Process Isolation then why should I use Containers?

Although it is process isolation per app on iis docker provides another layer of isolation where memory used and kernel access is also isolated. Bear in mind that containers contain all that is needed to run something including the os. Only thing is the physical os memory and kernel that is used together as is the case with vm-s for example. So in a way containers give you even higher isolation than just having a separate process for an application.
But that is not the main selling point of containers. The main selling point is that they are scalable solutions that are basically infrastructure as code and thus easier to manage and deploy on any environment. Also being that means that it will work the same wherever you deploy it to since you include all is needed in them. And if your apps have lots of traffic with load balancing you ca deploy multiples of the same container in a cluster and do not have those bottlenecks.
Second point is that during development there is historical data of what was added to the container and removed to have a stable environment. That and the ability to deploy dev instances alongside prod instances and just switching over makes it reduce possible downtime for unforeseen error minimal as you can just redirect to old prod container until the fix is out.
A bit of a rant there and yet there is more.

Slow install / upgrade through Helm (for Kubernetes)

Our application consists of circa 20 modules. Each module contains a (Helm) chart with several deployments, services and jobs. Some of those jobs are defined as Helm pre-install and pre-upgrade hooks. Altogether there are probably about 120 yaml files, which eventualy result in about 50 running pods.
During development we are running Docker for Windows version 2.0.0.0-beta-1-win75 with Docker 18.09.0-ce-beta1 and Kubernetes 1.10.3. To simplify management of our Kubernetes yaml files we use Helm 2.11.0. Docker for Windows is configured to use 2 CPU cores (of 4) and 8GB RAM (of 24GB).
When creating the application environment for the first time, it takes more that 20 minutes to become available. This seems far to slow; we are probably making an important mistake somewhere. We have tried to improve the (re)start time, but to no avail. Any help or insights to improve the situation would be greatly appreciated.
A simplified version of our startup script:
#!/bin/bash
# Start some infrastructure
helm upgrade --force --install modules/infrastructure/chart
# Start ~20 modules in parallel
helm upgrade --force --install modules/module01/chart &
[...]
helm upgrade --force --install modules/module20/chart &
await_modules()
Executing the same startup script again later to 'restart' the application still takes about 5 minutes. As far as I know, unchanged objects are not modified at all by Kubernetes. Only the circa 40 hooks are run by Helm.
Running a single hook manually with docker run is fast (~3 seconds). Running that same hook through Helm and Kubernetes regularly takes 15 seconds or more.
Some things we have discovered and tried are listed below.
Linux staging environment
Our staging environment consists of Ubuntu with native Docker. Kubernetes is installed through minikube with --vm-driver none.
Contrary to our local development environment, the staging environment retrieves the application code through a (deprecated) gitRepo volume for almost every deployment and job. Understandibly, this only seems to worsen the problem. Starting the environment for the first time takes over 25 minutes, restarting it takes about 20 minutes.
We tried replacing the gitRepo volume with a sidecar container that retrieves the application code as a TAR. Although we have not modified the whole application, initial tests indicate this is not particularly faster than the gitRepo volume.
This situation can probably be improved with an alternative type of volume that enables sharing of code between deployements and jobs. We would rather not introduce more complexity, though, so we have not explored this avenue any further.
Docker run time
Executing a single empty alpine container through docker run alpine echo "test" takes roughly 2 seconds. This seems to be overhead of the setup on Windows. That same command takes less 0.5 seconds on our Linux staging environment.
Docker volume sharing
Most of the containers - including the hooks - share code with the host through a hostPath. The command docker run -v <host path>:<container path> alpine echo "test" takes 3 seconds to run. Using volumes seems to increase runtime with aproximately 1 second.
Parallel or sequential
Sequential execution of the commands in the startup script does not improve startup time. Neither does it drastically worsen.
IO bound?
Windows taskmanager indicates that IO is at 100% when executing the startup script. Our hooks and application code are not IO intensive at all. So the IO load seems to originate from Docker, Kubernetes or Helm. We have tried to find the bottleneck, but were unable to pinpoint the cause.
Reducing IO through ramdisk
To test the premise of being IO bound further, we exchanged /var/lib/docker with a ramdisk in our Linux staging environment. Starting the application with this configuration was not significantly faster.

To compare Kubernetes with Docker, you need to consider that Kubernetes will run more or less the same Docker command on a final step. Before that happens many things are happening.
The authentication and authorization processes, creating objects in etcd, locating correct nodes for pods scheduling them and provisioning storage and many more.
Helm itself also adds an overhead to the process depending on size of chart.
I recommend reading One year using Kubernetes in production: Lessons learned. Author goes into explaining what have they achieved by switching to Kubernetes as well differences in overhead:
Cost calculation
Looking at costs, there are two sides to the story. To run Kubernetes, an etcd cluster is required, as well as a master node. While these are not necessarily expensive components to run, this overhead can be relatively expensive when it comes to very small deployments. For these types of deployments, it’s probably best to use a hosted solution such as Google's Container Service.
For larger deployments, it’s easy to save a lot on server costs. The overhead of running etcd and a master node aren’t significant in these deployments. Kubernetes makes it very easy to run many containers on the same hosts, making maximum use of the available resources. This reduces the number of required servers, which directly saves you money. When running Kubernetes sounds great, but the ops side of running such a cluster seems less attractive, there are a number of hosted services to look at, including Cloud RTI, which is what my team is working on.

What are the limitations of docker swarm over kubernetes?

I need to make a decision for container orchestration , and needed help in finding out limitation in real world scenarios that can occur using docker swarm over kubernetes, if anyone ever faced any such limitation please suggest.
The containers cluster may reach a value of approx 50-100 containers.

Docker swarm is young and there are a lot of features introduced relatively quickly. This however causes more issues and open "serious" bugs. For a production system that should be up 100% that might be an issue. I personally experienced a bug that made it impossible to start new containers because they were assigned an IP that is already taken. This forced me to shut down my swarm (it's a dev system so I didn't mind too much).
I suggest having a look at the most commented swarm bugs/issues in github.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart