Elasticsearch query slow response via kibana console - docker

Server background : 3 node elasticsearch cluster + kibana + logstash running on docker environment. host server runs rhel7.7(2cpu, 8GB RAM + 200GB fileshare).
Versions :
elasticsearch 7.5.1
kibana 7.5.1
logstash 7.5.1
filebeat 7.5.1 (runs on separate server)
## Cluster health status
{
"cluster_name" : "es-cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 3,
"number_of_data_nodes" : 3,
"active_primary_shards" : 116,
"active_shards" : 232,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
## Node status
172.20.1.3 60 91 13 0.98 1.30 1.45 dilm - elasticsearch2
172.20.1.4 57 91 13 0.98 1.30 1.45 dilm - elasticsearch3
172.20.1.2 61 91 14 0.98 1.30 1.45 dilm * elasticsearch
## Host server TOP output
top - 11:37:10 up 11 days, 22:30, 3 users, load average: 0.74, 1.29, 1.47
Tasks: 210 total, 1 running, 209 sleeping, 0 stopped, 0 zombie
%Cpu(s): 4.4 us, 0.8 sy, 0.0 ni, 94.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 7999840 total, 712736 free, 5842300 used, 1444804 buff/cache
KiB Swap: 3071996 total, 2794496 free, 277500 used. 1669472 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
48491 vtha837 20 0 4003724 713564 23880 S 5.0 8.9 1:57.53 java
59023 vtha837 20 0 6796456 1.5g 172556 S 2.7 20.3 28:18.07 java
59006 vtha837 20 0 6827100 1.5g 176840 S 1.7 19.9 30:03.59 java
760 vtha837 20 0 6877220 1.5g 180752 S 0.7 19.9 24:37.88 java
59610 vtha837 20 0 1663436 258152 7336 S 0.3 3.2 16:51.84 node
## Kibana environment variables I used for kibana docker image
environment:
SERVER_NAME: "kibana"
SERVER_PORT: 9548
ELASTICSEARCH_PASSWORD: ${ES_PASSWORD}
ELASTICSEARCH_HOSTS: "http://elasticsearch:9550"
KIBANA_DEFAULTAPPID: "dashboard/Default"
LOGGING_QUIET: "true"
XPACK_SECURITY_ENCRYPTIONKEY: ${KIBANA_XPACK_SEC_KEY}
XPACK_SECURITY_SESSIONTIMEOUT: 600000
Issue :
A. When I run elasticsearch queries via kibana console at least took 20000 ms return output to the console. But if I run the same query directly(elasticsearch) via curl or postman or chrome it took only less than 200 ms to get the output
B. even this happening when I loading kibana dashboad(not all the time), get following error message and not loading some graphs. but I can't see any exceptions or errors from console logs
Error in visualization
[esaggs] > Request to Elasticsearch failed: {"error":{}}
If I refresh the page, I can see all the graphs.
Chrome performance profile directly hitting elasticsearch query URL: http://testnode.mycompany.com.nz:9550/_cat/indices
Chrome performance profile via kibana dev console elasticsearch query Query : GET /_cat/indices
What I not understand is If I run same docker compose file in my laptop(windoes 10, 16GB, i7 2cpu, docker desktop running) i'm not facing any slowness either kibana dev console query or directly query elasticseach.
Anyone having this issue and appreciate let me know how to fix this?
Thanks in advance.

Issue is the docker service discovery. Due to some reason docker service discovery not happened. as soon as I change the elasticsearch host to IP get the real performance.
In docker-compose Previous kibana configuration
ELASTICSEARCH_HOSTS: "http://elasticsearch:9550"
New configuration
ELASTICSEARCH_HOSTS: "http://172.20.1.2:9550"
more details refer to the elasticsearch discuss page

Related

GKE private cluster Django server timeout 502 after 64 API requests

So I have a GKE private cluster production ready environment, where I host my Django Rest Framework microservices.
It all works fine, but after 64 API requests, the server does a timeout and the pod is unreachable.
I am not sure why this is happening.
I use the following stack:
Django 3.2.13
GKE v1.22.8-hke.201
Postgresql Cloud SQL
Docker
My Django application is a simple one. No authentication on POST. Just a small json-body is sent, and the server saves it to the PostgreSQL database.
The server connects via cloud_sql_proxy to the database, but I also tried to use the IP and the PySQL library. It works, but same error/timeout.
The workloads that are having the issue are any workloads that do a DB call, it does not matter if I do a SELECT * or INSERT call.
However, when I do a load balancing test (locust python) and test the home page of any microserver within the cluster (readiness) I do not experience any API calls timeout and server restart.
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST /api/logserver/logs/ 64 0(0.00%) | 96 29 140 110 | 10.00 0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 64 0(0.00%) | 96 29 140 110 | 10.00 0.00
Type Name # reqs # fails | Avg Min Max Med | req/s failures/s
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
POST /api/logserver/logs/ 77 13(19.48%) | 92 17 140 100 | 0.90 0.00
--------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|-----------
Aggregated 77 13(19.48%) | 92 17 140 100 | 0.90 0.00
So It looks like it has something to do with the way the DB is connected to the pods?
I use cloud_sql_proxy to connect to the DB. And this also results in a Timeout and restart of the pod.
I have tried updating gunicorn in the docker environment for Django to:
CMD gunicorn -b :8080 --log-level debug --timeout 90 --workers 6 --access-logfile '-' --error-logfile '-' --graceful-timeout 30 helloadapta.wsgi
And I have tried replacing gunicorn with uwsgi.
I also tried using plain python manage.py runserver 0.0.0.0:8080
They all serve the backend, and I can connect to it. But the issue on timeout persists.
This is the infrastructure:
Private GKE cluster which uses subnetwork in GCP.
Cloud NAT on network for outbound external static IP (needed to whitelist microservers in third party servers)
The Cluster has more than enough memory and cpu:
nodes: 3
total vCPUs: 24
total memory: 96GB
Each node has:
CPU allocatable: 7.91 CPU
Memory allocatable: 29.79 GB
The config in the yaml file states that the pod gets:
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "1"
memory: "2Gi"
Only when I do a readiness call to the server, there is no Timeout.
So it really points to a direction that the Cloud SQL breaks after 64 API calls.
The Cloud SQL Database stack is:
1 sql instance
1 database within the instance
4 vCPUs
15 GB memory
max_connections 85000
The CPU utilisation never goes above 5%

What is the real memory available in Docker container?

I've run mongodb service via docker-compose like this:
version: '2'
services:
mongo:
image: mongo
environment:
MONGO_INITDB_ROOT_USERNAME: root
MONGO_INITDB_ROOT_PASSWORD: example
mem_limit: 4GB
If I run docker stats I can see 4 GB allocated:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
cf3ccbd17464 michal_mongo_1 0.64% 165.2MiB / 4GiB 4.03% 10.9kB / 4.35kB 0B / 483kB 35
But I run this command I get RAM from my laptop which is 32 GB:
~$ docker exec michal_mongo_1 free -g
total used free shared buff/cache available
Mem: 31 4 17 0 9 24
Swap: 1 0 1
How does mem_limit affect the memory size then?
free (and other utilities like top) will not report correct numbers inside a memory-constraint container because it gathers its information from /proc/meminfo which is not namespaced.
If you want the actual limit, you must use the entries populated by cgroup pseudo-filesystem under /sys/fs/cgroup.
For example:
docker run --rm -i --memory=128m busybox cat /sys/fs/cgroup/memory/memory.limit_in_bytes
The real-time usage information is available under /sys/fs/cgroup/memory/memory.stat.
You will probably need the resident-set-size (rss), for example (inside the container):
grep -E -e '^rss\s+' /sys/fs/cgroup/memory/memory.stat
For a more in-depth explanation, see also this article

why docker image with elasticsearch status restarting always?

unbuntu 16.04, ram 1gb, on aws instance
I had to run old instance of elasticsearch so I wanted use a docker image of elasticsearch 5.3.3 version. by seeing the stackoverflow multiple links with same title, i have modified my installation of docker image based elasticsearch as below
sudo docker run -p 9200:9200 -p 9300:9300 -d -e "http.host=0.0.0.0" -e "transport.host=127.0.0.1" -e "xpack.security.enabled=false" --restart=unless-stopped --name careerassistant-elastic docker.elastic.co/elasticsearch/elasticsearch:5.3.3
the installation is finished and have problem with accessing the elasticsearch, though I had mutliple modifications as above command, I couldnt resolve the issue. when on
sudo docker ps
the status is still --> restarting(1) 48 seconds ago
when i was checking the log of the docker i couldnt understand anything as i am new to docker and its utilization
> docker logs --tail 50 --follow --timestamps careerassistant-elastic
i got the following output
2020-05-04T09:36:00.552415247Z CmaTotal: 0 kB
2020-05-04T09:36:00.552418314Z CmaFree: 0 kB
2020-05-04T09:36:00.552421364Z HugePages_Total: 0
2020-05-04T09:36:00.552424343Z HugePages_Free: 0
2020-05-04T09:36:00.552427401Z HugePages_Rsvd: 0
2020-05-04T09:36:00.552430358Z HugePages_Surp: 0
2020-05-04T09:36:00.552433336Z Hugepagesize: 2048 kB
2020-05-04T09:36:00.552436334Z DirectMap4k: 67584 kB
2020-05-04T09:36:00.552439415Z DirectMap2M: 980992 kB
2020-05-04T09:36:00.552442390Z
2020-05-04T09:36:00.552445460Z
2020-05-04T09:36:00.552448777Z CPU:total 1 (initial active 1) (1 cores per cpu, 1 threads per core) family 6 model 63 stepping 2, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, avx, avx2, aes, clmul, erms, lzcnt, tsc, bmi1, bmi2
2020-05-04T09:36:00.552452312Z
2020-05-04T09:36:00.552455227Z /proc/cpuinfo:
2020-05-04T09:36:00.552458471Z processor : 0
2020-05-04T09:36:00.552461695Z vendor_id : GenuineIntel
2020-05-04T09:36:00.552464872Z cpu family : 6
2020-05-04T09:36:00.552467992Z model : 63
2020-05-04T09:36:00.552471311Z model name : Intel(R) Xeon(R) CPU E5-2676 v3 # 2.40GHz
2020-05-04T09:36:00.552474616Z stepping : 2
2020-05-04T09:36:00.552477715Z microcode : 0x43
2020-05-04T09:36:00.552480781Z cpu MHz : 2400.040
2020-05-04T09:36:00.552483934Z cache size : 30720 KB
2020-05-04T09:36:00.552486978Z physical id : 0
2020-05-04T09:36:00.552490023Z siblings : 1
2020-05-04T09:36:00.552493103Z core id : 0
2020-05-04T09:36:00.552496146Z cpu cores : 1
2020-05-04T09:36:00.552511390Z apicid : 0
2020-05-04T09:36:00.552515457Z initial apicid : 0
2020-05-04T09:36:00.552518523Z fpu : yes
2020-05-04T09:36:00.552521677Z fpu_exception : yes
2020-05-04T09:36:00.552524702Z cpuid level : 13
2020-05-04T09:36:00.552527802Z wp : yes
2020-05-04T09:36:00.552531691Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single kaiser fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt
2020-05-04T09:36:00.552535638Z bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
2020-05-04T09:36:00.552538954Z bogomips : 4800.08
2020-05-04T09:36:00.552545171Z clflush size : 64
2020-05-04T09:36:00.552548419Z cache_alignment : 64
2020-05-04T09:36:00.552551514Z address sizes : 46 bits physical, 48 bits virtual
2020-05-04T09:36:00.552554916Z power management:
2020-05-04T09:36:00.552558030Z
2020-05-04T09:36:00.552561090Z
2020-05-04T09:36:00.552564141Z
2020-05-04T09:36:00.552567135Z Memory: 4k page, physical 1014424k(76792k free), swap 0k(0k free)
2020-05-04T09:36:00.552570458Z
2020-05-04T09:36:00.552573441Z vm_info: OpenJDK 64-Bit Server VM (25.131-b11) for linux-amd64 JRE (1.8.0_131-b11), built on Jun 16 2017 13:51:29 by "buildozer" with gcc 6.3.0
2020-05-04T09:36:00.552576947Z
2020-05-04T09:36:00.552579894Z time: Mon May 4 09:36:00 2020
2020-05-04T09:36:00.552582956Z elapsed time: 0 seconds (0d 0h 0m 0s)
2020-05-04T09:36:00.552586052Z
can someone help me out to figure what could be the problem for docker status to be restarting ?
I run my docker container on AWS ec2 t2.small which has 2 GB RAM as t2.micro memory(1GB) isn't enough for running the Elasticsearch container, so it should be fine for you as well until you have configured a lot more things. Looked into your logs but I don't see any error, hence difficult to debug without your docker-file.
Below is my docker-compose file for running Elasticsearch 7.6 in a docker container in AWS t2.small instance, let me know if it doesn't work for you and happy to help further.
version: '2.2'
services:
  #Elasticsearch Docker Images: https://www.docker.elastic.co/
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.6.0
    container_name: elasticsearch
    environment:
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    cap_add:
      - IPC_LOCK
    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
      - 9300:9300
volumes:
  elasticsearch-data:
    driver: local
You can run it using docker-compose up -d -e "discovery.type=single-node" command. and refer my this Elasticsearch docker container in non-prod mode to eliminate vm.max_map_count=262144 requirement answer if you face any memory-related issue like vm.max_map_count=262144 requirement

Why such an overhead for system with docker containers usage?

I have a following question. I was designing recently a java application on Spring, that works with a database. And I have decided to perform a stress testing. Both the application and the database reside on a virtual Debian machine. I tested it with gatling, and here is what I got:
request count 600 (OK=600 KO=0 )
min response time 12 (OK=12 KO=- )
max response time 159 (OK=159 KO=- )
mean response time 21 (OK=21 KO=- )
std deviation 13 (OK=13 KO=- )
response time 50th percentile 17 (OK=17 KO=- )
response time 75th percentile 22 (OK=22 KO=- )
mean requests/sec 10.01 (OK=10.01 KO=- )
t < 800 ms 600 (100%)
800 ms < t < 5000 ms 0 ( 0%)
t > 5000 ms 0 ( 0%)
failed 0 ( 0%)
So far, so good. Ater that, I decided to put the database and jar into two containers. Here is a docker-compose.yml sample for that:
prototype-db:
build: prototype-db
volumes:
- ./prototype-db/data:/var/lib/mysql:rw
- ./prototype-db/scripts:/docker-entrypoint-initdb.d:ro
ports:
- "3306"
prototype:
image: openjdk:8
command: bash -c "cd /deploy && java -jar application.jar"
volumes:
- ./application/target:/deploy
depends_on:
- prototype-db
ports:
- "8080:8080"
dns:
- 172.16.10.1
- 172.16.10.2
The Dockerfile looks like this:
FROM mysql:5.7.15
ENV MYSQL_DATABASE=document \
MYSQL_ROOT_PASSWORD=root \
MYSQL_USER=testuser \
MYSQL_PASSWORD=12345
EXPOSE 3306
Now, after testing that with gatling I've got the followin results:
---- Global Information --------------------------------------------------------
request count 6000 (OK=3946 KO=2054 )
min response time 0 (OK=124 KO=0 )
max response time 18336 (OK=18336 KO=77 )
mean response time 5021 (OK=7630 KO=10 )
std deviation 4136 (OK=2478 KO=9 )
response time 50th percentile 6516 (OK=8694 KO=9 )
response time 75th percentile 8732 (OK=8905 KO=14 )
mean requests/sec 87.433 (OK=57.502 KO=29.931)
---- Response Time Distribution ------------------------------------------------
t < 800 ms 65 ( 1%)
800 ms < t < 5000 ms 532 ( 9%)
t > 5000 ms 3349 ( 56%)
failed 2054 ( 34%)
---- Errors --------------------------------------------------------------------
java.io.IOException: Remotely closed 1494 (72.74%)
status.find.is(200), but actually found 500 560 (27.26%)
This is amazing - the mean response time exceeded drastically, and a lot of errors, but this docker compose system runs on the very same virtual debian machine. What could cause exactly such an overhead, I thought that docker containers are a lot like native processed, they should not be running that slow.

Rethinkdb container: rethinkdb process takes less RAM than the whole container

I'm running my rethinkdb container in Kubernetes cluster. Below is what I notice:
Running top in the host which is CoreOS, rethinkdb process takes about 3Gb:
$ top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
981 root 20 0 53.9m 34.5m 20.9m S 15.6 0.4 1153:34 hyperkube
51139 root 20 0 4109.3m 3.179g 22.5m S 15.0 41.8 217:43.56 rethinkdb
579 root 20 0 707.5m 76.1m 19.3m S 2.3 1.0 268:33.55 kubelet
But running docker stats to check the rethinkdb container, it takes about 7Gb!
$ docker ps | grep rethinkdb
eb9e6b83d6b8 rethinkdb:2.1.5 "rethinkdb --bind al 3 days ago Up 3 days k8s_rethinkdb-3.746aa_rethinkdb-rc-3-eiyt7_default_560121bb-82af-11e5-9c05-00155d070266_661dfae4
$ docker stats eb9e6b83d6b8
CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O
eb9e6b83d6b8 4.96% 6.992 GB/8.169 GB 85.59% 0 B/0 B
$ free -m
total used free shared buffers cached
Mem: 7790 7709 81 0 71 3505
-/+ buffers/cache: 4132 3657
Swap: 0 0 0
Can someone explain why the container is taking a lot more memory than the rethinkdb process itself?
I'm running docker v1.7.1, CoreOS v773.1.0, kernel 4.1.5
In top command, your are looking at physical memory amount. in stats command, this also include the disk cached ram, so it's always bigger than the physical amount of ram. When you really need more RAM, the disk cached will be released for the application to use.
In deed, the memmory usage is pulled via cgroup memory.usage_in_bytes, you can access it in /sys/fs/cgroup/memory/docker/long_container_id/memory.usage_in_bytes. And acording to linux doc https://www.kernel.org/doc/Documentation/cgroups/memory.txt section 5.5:
5.5 usage_in_bytes
For efficiency, as other kernel components, memory cgroup uses some
optimization to avoid unnecessary cacheline false sharing.
usage_in_bytes is affected by the method and doesn't show 'exact'
value of memory (and swap) usage, it's a fuzz value for efficient
access. (Of course, when necessary, it's synchronized.) If you want to
know more exact memory usage, you should use RSS+CACHE(+SWAP) value in
memory.stat(see 5.2).

Resources