I have a following question. I was designing recently a java application on Spring, that works with a database. And I have decided to perform a stress testing. Both the application and the database reside on a virtual Debian machine. I tested it with gatling, and here is what I got:
request count 600 (OK=600 KO=0 )
min response time 12 (OK=12 KO=- )
max response time 159 (OK=159 KO=- )
mean response time 21 (OK=21 KO=- )
std deviation 13 (OK=13 KO=- )
response time 50th percentile 17 (OK=17 KO=- )
response time 75th percentile 22 (OK=22 KO=- )
mean requests/sec 10.01 (OK=10.01 KO=- )
t < 800 ms 600 (100%)
800 ms < t < 5000 ms 0 ( 0%)
t > 5000 ms 0 ( 0%)
failed 0 ( 0%)
So far, so good. Ater that, I decided to put the database and jar into two containers. Here is a docker-compose.yml sample for that:
prototype-db:
build: prototype-db
volumes:
- ./prototype-db/data:/var/lib/mysql:rw
- ./prototype-db/scripts:/docker-entrypoint-initdb.d:ro
ports:
- "3306"
prototype:
image: openjdk:8
command: bash -c "cd /deploy && java -jar application.jar"
volumes:
- ./application/target:/deploy
depends_on:
- prototype-db
ports:
- "8080:8080"
dns:
- 172.16.10.1
- 172.16.10.2
The Dockerfile looks like this:
FROM mysql:5.7.15
ENV MYSQL_DATABASE=document \
MYSQL_ROOT_PASSWORD=root \
MYSQL_USER=testuser \
MYSQL_PASSWORD=12345
EXPOSE 3306
Now, after testing that with gatling I've got the followin results:
---- Global Information --------------------------------------------------------
request count 6000 (OK=3946 KO=2054 )
min response time 0 (OK=124 KO=0 )
max response time 18336 (OK=18336 KO=77 )
mean response time 5021 (OK=7630 KO=10 )
std deviation 4136 (OK=2478 KO=9 )
response time 50th percentile 6516 (OK=8694 KO=9 )
response time 75th percentile 8732 (OK=8905 KO=14 )
mean requests/sec 87.433 (OK=57.502 KO=29.931)
---- Response Time Distribution ------------------------------------------------
t < 800 ms 65 ( 1%)
800 ms < t < 5000 ms 532 ( 9%)
t > 5000 ms 3349 ( 56%)
failed 2054 ( 34%)
---- Errors --------------------------------------------------------------------
java.io.IOException: Remotely closed 1494 (72.74%)
status.find.is(200), but actually found 500 560 (27.26%)
This is amazing - the mean response time exceeded drastically, and a lot of errors, but this docker compose system runs on the very same virtual debian machine. What could cause exactly such an overhead, I thought that docker containers are a lot like native processed, they should not be running that slow.
Related
So I have a GKE private cluster production ready environment, where I host my Django Rest Framework microservices.
It all works fine, but after 64 API requests, the server does a timeout and the pod is unreachable.
I am not sure why this is happening.
I use the following stack:
Django 3.2.13
GKE v1.22.8-hke.201
Postgresql Cloud SQL
Docker
My Django application is a simple one. No authentication on POST. Just a small json-body is sent, and the server saves it to the PostgreSQL database.
The server connects via cloud_sql_proxy to the database, but I also tried to use the IP and the PySQL library. It works, but same error/timeout.
The workloads that are having the issue are any workloads that do a DB call, it does not matter if I do a SELECT * or INSERT call.
However, when I do a load balancing test (locust python) and test the home page of any microserver within the cluster (readiness) I do not experience any API calls timeout and server restart.
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST /api/logserver/logs/ 64 0(0.00%) | 96 29 140 110 | 10.00 0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 64 0(0.00%) | 96 29 140 110 | 10.00 0.00
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST /api/logserver/logs/ 77 13(19.48%) | 92 17 140 100 | 0.90 0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 77 13(19.48%) | 92 17 140 100 | 0.90 0.00
So It looks like it has something to do with the way the DB is connected to the pods?
I use cloud_sql_proxy to connect to the DB. And this also results in a Timeout and restart of the pod.
I have tried updating gunicorn in the docker environment for Django to:
CMD gunicorn -b :8080 --log-level debug --timeout 90 --workers 6 --access-logfile '-' --error-logfile '-' --graceful-timeout 30 helloadapta.wsgi
And I have tried replacing gunicorn with uwsgi.
I also tried using plain python manage.py runserver 0.0.0.0:8080
They all serve the backend, and I can connect to it. But the issue on timeout persists.
This is the infrastructure:
Private GKE cluster which uses subnetwork in GCP.
Cloud NAT on network for outbound external static IP (needed to whitelist microservers in third party servers)
The Cluster has more than enough memory and cpu:
nodes: 3
total vCPUs: 24
total memory: 96GB
Each node has:
CPU allocatable: 7.91 CPU
Memory allocatable: 29.79 GB
The config in the yaml file states that the pod gets:
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "1"
memory: "2Gi"
Only when I do a readiness call to the server, there is no Timeout.
So it really points to a direction that the Cloud SQL breaks after 64 API calls.
The Cloud SQL Database stack is:
1 sql instance
1 database within the instance
4 vCPUs
15 GB memory
max_connections 85000
The CPU utilisation never goes above 5%
I've run mongodb service via docker-compose like this:
version: '2'
services:
mongo:
image: mongo
environment:
MONGO_INITDB_ROOT_USERNAME: root
MONGO_INITDB_ROOT_PASSWORD: example
mem_limit: 4GB
If I run docker stats I can see 4 GB allocated:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
cf3ccbd17464 michal_mongo_1 0.64% 165.2MiB / 4GiB 4.03% 10.9kB / 4.35kB 0B / 483kB 35
But I run this command I get RAM from my laptop which is 32 GB:
~$ docker exec michal_mongo_1 free -g
total used free shared buff/cache available
Mem: 31 4 17 0 9 24
Swap: 1 0 1
How does mem_limit affect the memory size then?
free (and other utilities like top) will not report correct numbers inside a memory-constraint container because it gathers its information from /proc/meminfo which is not namespaced.
If you want the actual limit, you must use the entries populated by cgroup pseudo-filesystem under /sys/fs/cgroup.
For example:
docker run --rm -i --memory=128m busybox cat /sys/fs/cgroup/memory/memory.limit_in_bytes
The real-time usage information is available under /sys/fs/cgroup/memory/memory.stat.
You will probably need the resident-set-size (rss), for example (inside the container):
grep -E -e '^rss\s+' /sys/fs/cgroup/memory/memory.stat
For a more in-depth explanation, see also this article
I cannot get past 1200 RPS no matter if I use 4 or 5 workers.
I tried to start locust in 3 variations -- one, four, and five worker processes (docker-compose up --scale worker_locust=num_of_workers). I use 3000 clients with a hatch rate of 100. The service that I am loading is a dummy that just always returns yo and HTTP 200, i.e., it's not doing anything, but returning a constant string. When I have one worker I get up to 600 RPS (and start to see some HTTP errors), when I have 4 workers I can get up to the ~1200 RPS (without a single HTTP error):
When I have 5 workers I get the same ~1200 RPS, but with a lower CPU usage:
I suppose that if the CPU went down in the 5-worker case (with respect to 4-worker case), than it's not the CPU that is bounding the RPS.
I am running this on a 6-core MacBook.
The locustfile.py I use posts essentially almost empty requests (just a few parameters):
from locust import HttpUser, task, between, constant
class QuickstartUser(HttpUser):
wait_time = constant(1) # seconds
#task
def add_empty_model(self):
self.client.post(
"/models",
json={
"grouping": {
"grouping": "a/b"
},
"container_image": "myrepo.com",
"container_tag": "0.3.0",
"prediction_type": "prediction_type",
"model_state_base64": "bXkgc3RhdGU=",
"model_config": {},
"meta": {}
}
)
My docker-compose.yml:
services:
myservice:
build:
context: ../
ports:
- "8000:8000"
master_locust:
image: locustio/locust
ports:
- "8089:8089"
volumes:
- ./:/mnt/locust
command: -f /mnt/locust/locustfile.py --master
worker_locust:
image: locustio/locust
volumes:
- ./:/mnt/locust
command: -f /mnt/locust/locustfile.py --worker --master-host master_locust
Can someone suggest the direction of getting towards the 2000 RPS?
You should check out the FAQ.
https://github.com/locustio/locust/wiki/FAQ#increase-my-request-raterps
It's probably your server not being able to handle more requests, at least from your one machine. There are other things you can do to make more sure that's the case. You can try FastHttpUser, running on multiple machines, or just upping the number of users. But if you can, check to see how the server is handling the load and see what you can optimize there.
You will need more workers to generate more RPS. I thought one worker will have limited local port range when creating tcp connection to the destination.
You may check this value in your linux worker:
net.ipv4.ip_local_port_range
Try to tweak that number it on your each linux worker, or simply create hundreds of new worker with another powerful machine (your 6-core cpu macbook is to small)
To create many workers you could try Locust in kubernetes with horizontal pod autoscaling for the workers deployment.
Here is some helm chart to start play arround with Locust k8s deployment:
https://github.com/deliveryhero/helm-charts/tree/master/stable/locust
You may need to check these args for it:
worker.hpa.enabled
worker.hpa.maxReplicas
worker.hpa.minReplicas
worker.hpa.targetCPUUtilizationPercentage
simply set the maxReplicas value to get more workers when the load testing is started. Or you can scale it manually with kubectl command to scale worker pods to your desired number.
I've done to generate minimal 8K rps (stable value for my app, it can't serve better) with 1000 pods/worker, with Locust load test parameter like 200K users with 2000 spawn per second.
You may have to scale out your server when you reach higher throughput, but with 1000 pods/worker i thought you can easily reach 15K-20K rps.
Server background : 3 node elasticsearch cluster + kibana + logstash running on docker environment. host server runs rhel7.7(2cpu, 8GB RAM + 200GB fileshare).
Versions :
elasticsearch 7.5.1
kibana 7.5.1
logstash 7.5.1
filebeat 7.5.1 (runs on separate server)
## Cluster health status
{
"cluster_name" : "es-cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 116,
"active_shards" : 232,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
## Node status
172.20.1.3 60 91 13 0.98 1.30 1.45 dilm - elasticsearch2
172.20.1.4 57 91 13 0.98 1.30 1.45 dilm - elasticsearch3
172.20.1.2 61 91 14 0.98 1.30 1.45 dilm * elasticsearch
## Host server TOP output
top - 11:37:10 up 11 days, 22:30, 3 users, load average: 0.74, 1.29, 1.47
Tasks: 210 total, 1 running, 209 sleeping, 0 stopped, 0 zombie
%Cpu(s): 4.4 us, 0.8 sy, 0.0 ni, 94.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 7999840 total, 712736 free, 5842300 used, 1444804 buff/cache
KiB Swap: 3071996 total, 2794496 free, 277500 used. 1669472 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
48491 vtha837 20 0 4003724 713564 23880 S 5.0 8.9 1:57.53 java
59023 vtha837 20 0 6796456 1.5g 172556 S 2.7 20.3 28:18.07 java
59006 vtha837 20 0 6827100 1.5g 176840 S 1.7 19.9 30:03.59 java
760 vtha837 20 0 6877220 1.5g 180752 S 0.7 19.9 24:37.88 java
59610 vtha837 20 0 1663436 258152 7336 S 0.3 3.2 16:51.84 node
## Kibana environment variables I used for kibana docker image
environment:
SERVER_NAME: "kibana"
SERVER_PORT: 9548
ELASTICSEARCH_PASSWORD: ${ES_PASSWORD}
ELASTICSEARCH_HOSTS: "http://elasticsearch:9550"
KIBANA_DEFAULTAPPID: "dashboard/Default"
LOGGING_QUIET: "true"
XPACK_SECURITY_ENCRYPTIONKEY: ${KIBANA_XPACK_SEC_KEY}
XPACK_SECURITY_SESSIONTIMEOUT: 600000
Issue :
A. When I run elasticsearch queries via kibana console at least took 20000 ms return output to the console. But if I run the same query directly(elasticsearch) via curl or postman or chrome it took only less than 200 ms to get the output
B. even this happening when I loading kibana dashboad(not all the time), get following error message and not loading some graphs. but I can't see any exceptions or errors from console logs
Error in visualization
[esaggs] > Request to Elasticsearch failed: {"error":{}}
If I refresh the page, I can see all the graphs.
Chrome performance profile directly hitting elasticsearch query URL: http://testnode.mycompany.com.nz:9550/_cat/indices
Chrome performance profile via kibana dev console elasticsearch query Query : GET /_cat/indices
What I not understand is If I run same docker compose file in my laptop(windoes 10, 16GB, i7 2cpu, docker desktop running) i'm not facing any slowness either kibana dev console query or directly query elasticseach.
Anyone having this issue and appreciate let me know how to fix this?
Thanks in advance.
Issue is the docker service discovery. Due to some reason docker service discovery not happened. as soon as I change the elasticsearch host to IP get the real performance.
In docker-compose Previous kibana configuration
ELASTICSEARCH_HOSTS: "http://elasticsearch:9550"
New configuration
ELASTICSEARCH_HOSTS: "http://172.20.1.2:9550"
more details refer to the elasticsearch discuss page
I'm using docker 1.12 and docker-compose 1.12, on OSX.
I created a docker-compose.yml file which runs two containers:
the first, named spark, builds and runs a sparkjava application
the second, named behave, runs some functional tests on the API exposed by the first container.
version: "2"
services:
behave:
build:
context: ./src/test
container_name: "behave"
links:
- spark
depends_on:
- spark
entrypoint: ./runtests.sh spark:9000
spark:
build:
context: ./
container_name: "spark"
ports:
- "9000:9000"
As recommended by Docker Compose documentation, I use a simple shell script to test if the spark server is ready. This script is name runtest.sh, and runs into the container named "behave". It is launched by docker-compose (see above):
#!/bin/bash
# This scripts waits for the API server to be ready before running functional tests with Behave
# the parameter should be the hostname for the spark server
set -e
host="$1"
echo "runtests host is $host"
until curl -L "http://$host"; do
>&2 echo "Spark server is not ready - sleeping"
sleep 5
done
>&2 echo "Spark server is up - starting tests"
behave
```
The DNS resolution does not seem to work. curl makes a request to spark.com instead of a request to my container named "spark".
UPDATE:
By setting an alias for my link (links: -spark:myserver), I've seen the DNS resolution is not done by Docker: I received an error message from a corporate network equipment (I'm running this from behind a corporate proxy, with Docker for Mac). Here is an extract of the output:
Recreating spark
Recreating behave
Attaching to spark, behave
behave | runtests host is myserver:9000
behave | % Total % Received % Xferd Average Speed Time Time Time Current
behave | Dload Upload Total Spent Left Speed
100 672 100 672 0 0 348 0 0:00:01 0:00:01 --:--:-- 348
behave | <HTML><HEAD>
behave | <TITLE>Network Error</TITLE>
behave | </HEAD>
behave | <BODY>
behave | ...
behave | <big>Network Error (dns_unresolved_hostname)</big>
behave | Your requested host "myserver" could not be resolved by DNS.
behave | ...
behave | </BODY></HTML>
behave | Spark server is up - starting tests
To solve this, I added an environment variable no_proxy for the name of the container I wanted to join.
In the dockerfile for the container behave, I have:
ENV http_proxy=http://proxy.mycompany.com:8080
ENV https_proxy=http://proxy.mycompany.com:8080
ENV no_proxy=127.0.0.1,localhost,spark