How to use the Kafka exporter Docker image? - docker

I'm trying to use the Kafka Exporter packaged by Bitnami, https://github.com/bitnami/bitnami-docker-kafka-exporter, together with the Bitnami image for Kafka, https://github.com/bitnami/bitnami-docker-kafka. I'm trying to run the following docker-compose.yml:
version: '2'
networks:
app-tier:
driver: bridge
services:
zookeeper:
image: 'bitnami/zookeeper:latest'
environment:
- 'ALLOW_ANONYMOUS_LOGIN=yes'
networks:
- app-tier
kafka:
image: 'bitnami/kafka:latest'
environment:
- KAFKA_CFG_ZOOKEEPER_CONNECT=zookeeper:2181
- ALLOW_PLAINTEXT_LISTENER=yes
networks:
- app-tier
kafka-exporter:
image: bitnami/kafka-exporter:latest
ports:
- "9308:9308"
command:
- --kafka.server=kafka:9092
However, if I run this with docker-compose up, I get the following error:
bitnami-docker-kafka-kafka-exporter-1 | F0103 17:44:12.545739 1 kafka_exporter.go:865] Error Init Kafka Client: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)
I've tried to use the answer to How to pass arguments to entrypoint in docker-compose.yml to specify a command for the kafka-exporter service which - assuming the entrypoint is defined in exec form - should append additional flags to the invocation of the Docker Exporter binary. However, it seems that either the value of kafka:9092 is not right for the value of the kafka.server flag, or the flag is not getting picked up, or perhaps there is some kind of race condition where the exporter fails and exits before Kafka is up and running. Any ideas on how to get this example to work?

It would appear that this is just caused by a race condition with the Kafka Exporter trying to connect to Kafka before it has started up. If I just run docker-compose up and allow the Kafka Exporter to fail, and then separately run the danielqsh/kafka-exporter container, it works:
> docker run -it -p 9308:9308 --network bitnami-docker-kafka_app-tier danielqsj/kafka-exporter
I0103 18:49:04.694898 1 kafka_exporter.go:774] Starting kafka_exporter (version=1.4.2, branch=HEAD, revision=15e4ad6a9ea8203135d4b974e825f22e31c750e5)
I0103 18:49:04.703058 1 kafka_exporter.go:934] Listening on HTTP :9308
and on http://localhost:9308 I can see metrics:
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 8.82e-05
go_gc_duration_seconds{quantile="0.25"} 8.82e-05
go_gc_duration_seconds{quantile="0.5"} 8.82e-05
go_gc_duration_seconds{quantile="0.75"} 8.82e-05
go_gc_duration_seconds{quantile="1"} 8.82e-05
go_gc_duration_seconds_sum 8.82e-05
go_gc_duration_seconds_count 1
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 20
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.17.3"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 3.546384e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 4.492048e+06
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 1.448119e+06
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 6512
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 1.6141668631896834e-06
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 4.835688e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 3.546384e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 2.686976e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 5.07904e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 4687
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 1.736704e+06
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 7.766016e+06
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.6412358823440428e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 11199
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 9600
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 71536
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 81920
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 5.283728e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 1.499625e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 622592
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 622592
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 1.6270344e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 8
# HELP kafka_brokers Number of Brokers in the Kafka Cluster.
# TYPE kafka_brokers gauge
kafka_brokers 1
# HELP kafka_exporter_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which kafka_exporter was built.
# TYPE kafka_exporter_build_info gauge
kafka_exporter_build_info{branch="HEAD",goversion="go1.17.3",revision="15e4ad6a9ea8203135d4b974e825f22e31c750e5",version="1.4.2"} 1
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.11
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 12
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.703936e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.64123574422e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 7.35379456e+08
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 3
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
Update
A more reliable way to do this is to use a wrapper script in order to perform an application-specific health check as described in https://docs.docker.com/compose/startup-order/. With the following directory structure,
.
├── docker-compose.yml
└── kafka-exporter
├── Dockerfile
└── run.sh
the following Dockerfile,
FROM bitnami/kafka-exporter:latest
COPY run.sh /opt/bitnami/kafka-exporter/bin
ENTRYPOINT ["run.sh"]
and the following run.sh,
#!/bin/sh
while ! bin/kafka_exporter; do
echo "Waiting for the Kafka cluster to come up..."
sleep 1
done
and the following docker-compose.yml,
version: '2'
networks:
app-tier:
driver: bridge
services:
zookeeper:
image: 'bitnami/zookeeper:latest'
environment:
- 'ALLOW_ANONYMOUS_LOGIN=yes'
networks:
- app-tier
kafka:
image: 'bitnami/kafka:latest'
environment:
- KAFKA_CFG_ZOOKEEPER_CONNECT=zookeeper:2181
- ALLOW_PLAINTEXT_LISTENER=yes
networks:
- app-tier
kafka-exporter:
build: kafka-exporter
ports:
- "9308:9308"
networks:
- app-tier
entrypoint: ["run.sh"]
myapp:
image: 'bitnami/kafka:latest'
environment:
- KAFKA_CFG_ZOOKEEPER_CONNECT=zookeeper:2181
- ALLOW_PLAINTEXT_LISTENER=yes
networks:
- app-tier
Upon running docker-compose build && docker-compose up, I can see from the logs that after ~2 seconds (on the third attempt) the Kafka Exporter starts successfully:
> docker logs bitnami-docker-kafka-kafka-exporter-1 -f
I0104 16:05:39.765921 8 kafka_exporter.go:769] Starting kafka_exporter (version=1.4.2, branch=non-git, revision=non-git)
F0104 16:05:40.525065 8 kafka_exporter.go:865] Error Init Kafka Client: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)
Waiting for the Kafka cluster to come up...
I0104 16:05:41.538482 16 kafka_exporter.go:769] Starting kafka_exporter (version=1.4.2, branch=non-git, revision=non-git)
F0104 16:05:42.295872 16 kafka_exporter.go:865] Error Init Kafka Client: kafka: client has run out of available brokers to talk to (Is your cluster reachable?)
Waiting for the Kafka cluster to come up...
I0104 16:05:43.307293 24 kafka_exporter.go:769] Starting kafka_exporter (version=1.4.2, branch=non-git, revision=non-git)
I0104 16:05:43.686798 24 kafka_exporter.go:929] Listening on HTTP :9308

Its bit difficult to run all the required dependencies for kafka-exporter.
Myself did few simple steps as following below.
Step 1
docker pull danielqsj/kafka-exporter:latest
Step 2
./bin/zookeeper-server-start.sh config/zookeeper.properties
Step 3
./bin/kafka-server-start.sh config/server.properties
Step 4
docker run -ti --rm -p 9308:9308 danielqsj/kafka-exporter --kafka.server=host.docker.internal:9092 --log.enable-sarama
If you note the above command i used "host.docker.internal" which will helps kafka-exporter to listen on my machines localhost.

Related

PySpark doesn't find Kafka source

I am trying to deploy a docker container with Kafka and Spark and would like to read to Kafka Topic from a pyspark application. Kafka is working and I can write to a topic and also spark is working. But when I try to read the Kafka stream I get the error message:
pyspark.sql.utils.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
My Docker Compose yaml looks like this:
---
version: '3.7'
services:
zookeeper:
image: bitnami/zookeeper:3
ports:
- 2181:2181
environment:
ALLOW_ANONYMOUS_LOGIN: "yes"
kafka:
image: bitnami/kafka:2
ports:
- 9092:9092
environment:
KAFKA_CFG_ZOOKEEPER_CONNECT: zookeeper:2181
ALLOW_PLAINTEXT_LISTENER: "yes"
KAFKA_LISTENERS: >-
INTERNAL://:29092,EXTERNAL://:9092
KAFKA_ADVERTISED_LISTENERS: >-
INTERNAL://kafka:29092,EXTERNAL://localhost:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: >-
INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: "INTERNAL"
depends_on:
- zookeeper
spark:
image: docker.io/bitnami/spark:3-debian-10
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
ports:
- '8080:8080'
volumes:
- ./:/home/workspace/
- ./spark/jars:/opt/bitnami/spark/.ivy2
spark-worker-1:
image: docker.io/bitnami/spark:3-debian-10
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
volumes:
- ./:/home/workspace/
- ./spark/jars:/opt/bitnami/spark/.ivy2
kafdrop:
image: obsidiandynamics/kafdrop:latest
ports:
- 9000:9000
environment:
KAFKA_BROKERCONNECT: kafka:29092
depends_on:
- kafka
and the pyspark app:
from pyspark.sql import SparkSession
import os
#os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0,org.apache.kafka:kafka-clients:2.8.1'
# the source for this data pipeline is a kafka topic, defined below
spark = SparkSession.builder.appName("fuel-level").master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel('WARN')
kafkaRawStreamingDF = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe","SimLab-KUKA") \
.option("startingOffsets","earliest")\
.load()
#this is necessary for Kafka Data Frame to be readable, into a single column value
kafkaStreamingDF = kafkaRawStreamingDF.selectExpr("cast(key as string) key", "cast(value as string) value")
kafkaStreamingDF.writeStream.outputMode("append").format("console").start().awaitTermination()
I am new to Spark and docker, so maybe It's an obvious mistake, I hope you can help me
EDIT
When I uncomment os.env I get the following error:
Error: Missing application resource.
Usage: spark-submit [options] <app jar | python file | R file> [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]
Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn,
k8s://https://host:port, or local (Default: local[*]).
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or
on one of the worker machines inside the cluster ("cluster")
(Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of jars to include on the driver
and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while
resolving the dependencies provided in --packages to avoid
dependency conflicts.
--repositories Comma-separated list of additional remote repositories to
search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor. File paths of these files
in executors can be accessed via SparkFiles.get(fileName).
--archives ARCHIVES Comma-separated list of archives to be extracted into the
working directory of each executor.
--conf, -c PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the
classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
This argument does not work with --principal / --keytab.
--help, -h Show this help message and exit.
--verbose, -v Print additional debug output.
--version, Print the version of current Spark.
Cluster deploy mode only:
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
Spark standalone or Mesos with cluster deploy mode only:
--supervise If given, restarts the driver on failure.
Spark standalone, Mesos or K8s with cluster deploy mode only:
--kill SUBMISSION_ID If given, kills the driver specified.
--status SUBMISSION_ID If given, requests the status of the driver specified.
Spark standalone, Mesos and Kubernetes only:
--total-executor-cores NUM Total cores for all executors.
Spark standalone, YARN and Kubernetes only:
--executor-cores NUM Number of cores used by each executor. (Default: 1 in
YARN and K8S modes, or all available cores on the worker
in standalone mode).
Spark on YARN and Kubernetes only:
--num-executors NUM Number of executors to launch (Default: 2).
If dynamic allocation is enabled, the initial number of
executors will be at least NUM.
--principal PRINCIPAL Principal to be used to login to KDC.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above.
Spark on YARN only:
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
Traceback (most recent call last):
File "/Users/janikbischoff/Documents/Uni/PuL/BA/Code/Tests/spark-test.py", line 6, in <module>
spark = SparkSession.builder.appName("fuel-level").master("local[*]").getOrCreate()
File "/Users/janikbischoff/Library/Python/3.8/lib/python/site-packages/pyspark/sql/session.py", line 228, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "/Users/janikbischoff/Library/Python/3.8/lib/python/site-packages/pyspark/context.py", line 392, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "/Users/janikbischoff/Library/Python/3.8/lib/python/site-packages/pyspark/context.py", line 144, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "/Users/janikbischoff/Library/Python/3.8/lib/python/site-packages/pyspark/context.py", line 339, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "/Users/janikbischoff/Library/Python/3.8/lib/python/site-packages/pyspark/java_gateway.py", line 108, in launch_gateway
raise RuntimeError("Java gateway process exited before sending its port number")
RuntimeError: Java gateway process exited before sending its port number
Missing application resource
This implies you're running the code using python rather than spark-submit
I was able to reproduce the error by copying your environment, as well as using findspark, it seems PYSPARK_SUBMIT_ARGS aren't working in that container, even though the variable does get loaded...
The workaround would be to pass the argument at execution time.
spark-submit \
--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 \
script.py

Valid docker-compose file not deploying as stack when using yaml anchors

I have been refactoring some docker-compose files to try and take advantage of tip #82 and hit a problem I haven't been able to find a solution to; I'm hoping someone can assist.
Using the following stripped example test-compose.yml file:
version: '3'
x-test: &test
deploy:
mode: replicated
services:
hello-world:
<<: *test
image: alpine
command: ["ping", "www.google.com"]
deploy:
replicas: 2
Running under docker-compose works as expected:
root#docker01:~# docker-compose -f test-compose.yml up
Recreating root_hello-world_1 ... done
Recreating root_hello-world_2 ... done
Attaching to root_hello-world_2, root_hello-world_1
hello-world_1 | PING www.google.com (172.217.16.228): 56 data bytes
hello-world_1 | 64 bytes from 172.217.16.228: seq=0 ttl=114 time=6.704 ms
hello-world_2 | PING www.google.com (172.217.16.228): 56 data bytes
hello-world_2 | 64 bytes from 172.217.16.228: seq=0 ttl=114 time=6.595 ms
However launching the same as a stack, fails:
root#docker01:~# docker stack deploy --compose-file test-compose.yml hello-world
(root) Additional property x-test is not allowed
Is there a way to get the same extensions ("x-* properties) working for both docker-compose and stack?
So, two things are going to bite you here:
First, docker stack deploy is fussy about the version you specify, so you need to strictly specify a valid compose version equal to or higher than the feature you are trying to use. Not sure when anchor support was added, but it definitely works when the version is specified as "3.9".
Your next problem is that merging is shallow. In your example case this isn't a problem because x-test contains only one setting which is already on its default value, but more generally, to handle complex cases, something like this is needed:
version: "3.9"
x-defaults:
service: &service-defaults
deploy: &deploy-defaults
placement:
constraints:
- node.role==worker
services:
hello-world:
<<: *service-defaults
image: alpine
deploy:
<<: *deploy-defaults
replicas: 2
As adding "deploy" to the hello-world map completely overrides any entry set by the default-service, it needs its own anchor reference to import sub-settings.

How to increase RPS in distributed locust load test

I cannot get past 1200 RPS no matter if I use 4 or 5 workers.
I tried to start locust in 3 variations -- one, four, and five worker processes (docker-compose up --scale worker_locust=num_of_workers). I use 3000 clients with a hatch rate of 100. The service that I am loading is a dummy that just always returns yo and HTTP 200, i.e., it's not doing anything, but returning a constant string. When I have one worker I get up to 600 RPS (and start to see some HTTP errors), when I have 4 workers I can get up to the ~1200 RPS (without a single HTTP error):
When I have 5 workers I get the same ~1200 RPS, but with a lower CPU usage:
I suppose that if the CPU went down in the 5-worker case (with respect to 4-worker case), than it's not the CPU that is bounding the RPS.
I am running this on a 6-core MacBook.
The locustfile.py I use posts essentially almost empty requests (just a few parameters):
from locust import HttpUser, task, between, constant
class QuickstartUser(HttpUser):
wait_time = constant(1) # seconds
#task
def add_empty_model(self):
self.client.post(
"/models",
json={
"grouping": {
"grouping": "a/b"
},
"container_image": "myrepo.com",
"container_tag": "0.3.0",
"prediction_type": "prediction_type",
"model_state_base64": "bXkgc3RhdGU=",
"model_config": {},
"meta": {}
}
)
My docker-compose.yml:
services:
myservice:
build:
context: ../
ports:
- "8000:8000"
master_locust:
image: locustio/locust
ports:
- "8089:8089"
volumes:
- ./:/mnt/locust
command: -f /mnt/locust/locustfile.py --master
worker_locust:
image: locustio/locust
volumes:
- ./:/mnt/locust
command: -f /mnt/locust/locustfile.py --worker --master-host master_locust
Can someone suggest the direction of getting towards the 2000 RPS?
You should check out the FAQ.
https://github.com/locustio/locust/wiki/FAQ#increase-my-request-raterps
It's probably your server not being able to handle more requests, at least from your one machine. There are other things you can do to make more sure that's the case. You can try FastHttpUser, running on multiple machines, or just upping the number of users. But if you can, check to see how the server is handling the load and see what you can optimize there.
You will need more workers to generate more RPS. I thought one worker will have limited local port range when creating tcp connection to the destination.
You may check this value in your linux worker:
net.ipv4.ip_local_port_range
Try to tweak that number it on your each linux worker, or simply create hundreds of new worker with another powerful machine (your 6-core cpu macbook is to small)
To create many workers you could try Locust in kubernetes with horizontal pod autoscaling for the workers deployment.
Here is some helm chart to start play arround with Locust k8s deployment:
https://github.com/deliveryhero/helm-charts/tree/master/stable/locust
You may need to check these args for it:
worker.hpa.enabled
worker.hpa.maxReplicas
worker.hpa.minReplicas
worker.hpa.targetCPUUtilizationPercentage
simply set the maxReplicas value to get more workers when the load testing is started. Or you can scale it manually with kubectl command to scale worker pods to your desired number.
I've done to generate minimal 8K rps (stable value for my app, it can't serve better) with 1000 pods/worker, with Locust load test parameter like 200K users with 2000 spawn per second.
You may have to scale out your server when you reach higher throughput, but with 1000 pods/worker i thought you can easily reach 15K-20K rps.

Flink TaskManager Docker Swarm doesn't recover

I'm Running a Flink v1.10 with 1 JobManager and 3 Taskmanagers in Docker Swarm, without Zookeeper. I've a Job running taking 12 Slots and i've 3 TM's with 20 Slots each (60 total).
After some tests everything went well except one test.
So, the test failing is, if i cancel the job manually i've a side-car retrying the job and the Taskmanager on the Browser Console doesn't recover and keeps decreasing.
More pratical example, so, i've a job running, consuming 12 slots of 60 total.
The web console shows me 48 Slots free and 3 TM's.
I cancel the job manually the side-car retriggers the job and the web
console shows me 36 Slots free and 2 TM's
The job enter's in a fail state and the Slot's will keep dreasing until 0 Slots free and 1 TM shows on the console.
The solution is scale down and scale up all the 3 TM's and everything get back to normal.
Everything work's fine with this configuration, the jobmanager recover's if i remove it, or if i scale up or down the TM's, but if i cancel the job the TM's looks like they loose the connection to the JM.
Any suggestions what i'm doing wrong?
Here is my flink-conf.yaml.
env.java.home: /usr/local/openjdk-8
env.log.dir: /opt/flink/
env.log.file: /var/log/flink.log
jobmanager.rpc.address: jobmanager1
jobmanager.rpc.port: 6123
jobmanager.heap.size: 2048m
#taskmanager.memory.process.size: 2048m
#env.java.opts.taskmanager: 2048m
taskmanager.memory.flink.size: 2048m
taskmanager.numberOfTaskSlots: 20
parallelism.default: 2
#==============================================================================
# High Availability
#==============================================================================
# The high-availability mode. Possible options are 'NONE' or 'zookeeper'.
#
high-availability: NONE
#high-availability.storageDir: file:///tmp/storageDir/flink_tmp/
#high-availability.zookeeper.quorum: zookeeper1:2181,zookeeper2:2181,zookeeper3:2181
#high-availability.zookeeper.quorum:
# ACL options are based on https://zookeeper.apache.org/doc/r3.1.2/zookeeperProgrammers.html#sc_BuiltinACLSchemes
# high-availability.zookeeper.client.acl: open
#==============================================================================
# Fault tolerance and checkpointing
#==============================================================================
# state.checkpoints.dir: hdfs://namenode-host:port/flink-checkpoints
# state.savepoints.dir: hdfs://namenode-host:port/flink-checkpoints
# state.backend.incremental: false
jobmanager.execution.failover-strategy: region
#==============================================================================
# Rest & web frontend
#==============================================================================
rest.port: 8080
rest.address: jobmanager1
# rest.bind-port: 8081
rest.bind-address: 0.0.0.0
#web.submit.enable: false
#==============================================================================
# Advanced
#==============================================================================
# io.tmp.dirs: /tmp
# classloader.resolve-order: child-first
# taskmanager.memory.network.fraction: 0.1
# taskmanager.memory.network.min: 64mb
# taskmanager.memory.network.max: 1gb
#==============================================================================
# Flink Cluster Security Configuration
#==============================================================================
# security.kerberos.login.use-ticket-cache: false
# security.kerberos.login.keytab: /mobi.me/flink/conf/smart3.keytab
# security.kerberos.login.principal: smart_user
# security.kerberos.login.contexts: Client,KafkaClient
#==============================================================================
# ZK Security Configuration
#==============================================================================
# zookeeper.sasl.login-context-name: Client
#==============================================================================
# HistoryServer
#==============================================================================
#jobmanager.archive.fs.dir: hdfs:///completed-jobs/
#historyserver.web.address: 0.0.0.0
#historyserver.web.port: 8082
#historyserver.archive.fs.dir: hdfs:///completed-jobs/
#historyserver.archive.fs.refresh-interval: 10000
blob.server.port: 6124
query.server.port: 6125
taskmanager.rpc.port: 6122
high-availability.jobmanager.port: 50010
zookeeper.sasl.disable: true
#recovery.mode: zookeeper
#recovery.zookeeper.quorum: zookeeper1:2181,zookeeper2:2181,zookeeper3:2181
#recovery.zookeeper.path.root: /
#recovery.zookeeper.path.namespace: /cluster_one
The solution was to increate the metaspace size in the flink-conf.yaml.
Br,
André.

Elastic in docker stack/swarm

I have swarm of two nodes
[ra#speechanalytics-test ~]$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
mlwwmkdlzbv0zlapqe1veq3uq speechanalytics-preprod Ready Active 18.09.3
se717p88485s22s715rdir9x2 * speechanalytics-test Ready Active Leader 18.09.3
I am trying to run container with elastic in stack. Here is my docker-compose.yml file
version: '3.4'
services:
elastic:
image: docker.elastic.co/elasticsearch/elasticsearch:6.7.0
environment:
- cluster.name=single-node
- bootstrap.memory_lock=true
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- esdata:/usr/share/elasticsearch/data
deploy:
placement:
constraints:
- node.hostname==speechanalytics-preprod
volumes:
esdata:
driver: local
after start with docker stack
docker stack deploy preprod -c docker-compose.yml
container crashes in 20 seconds
docker service logs preprod_elastic
...
| OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
| OpenJDK 64-Bit Server VM warning: UseAVX=2 is not supported on this CPU, setting it to UseAVX=0
| [2019-04-03T16:41:30,044][WARN ][o.e.b.JNANatives ] [unknown] Unable to lock JVM Memory: error=12, reason=Cannot allocate memory
| [2019-04-03T16:41:30,049][WARN ][o.e.b.JNANatives ] [unknown] This can result in part of the JVM being swapped out.
| [2019-04-03T16:41:30,049][WARN ][o.e.b.JNANatives ] [unknown] Increase RLIMIT_MEMLOCK, soft limit: 16777216, hard limit: 16777216
| [2019-04-03T16:41:30,050][WARN ][o.e.b.JNANatives ] [unknown] These can be adjusted by modifying /etc/security/limits.conf, for example:
| OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
| # allow user 'elasticsearch' mlockall
| OpenJDK 64-Bit Server VM warning: UseAVX=2 is not supported on this CPU, setting it to UseAVX=0
| elasticsearch soft memlock unlimited
| [2019-04-03T16:41:02,949][WARN ][o.e.b.JNANatives ] [unknown] Unable to lock JVM Memory: error=12, reason=Cannot allocate memory
| elasticsearch hard memlock unlimited
| [2019-04-03T16:41:02,954][WARN ][o.e.b.JNANatives ] [unknown] This can result in part of the JVM being swapped out.
| [2019-04-03T16:41:30,050][WARN ][o.e.b.JNANatives ] [unknown] If you are logged in interactively, you will have to re-login for the new limits to take effect.
| [2019-04-03T16:41:02,954][WARN ][o.e.b.JNANatives ] [unknown] Increase RLIMIT_MEMLOCK, soft limit: 16777216, hard limit: 16777216
preprod
on both nodes I have
ra#speechanalytics-preprod:~$ sysctl vm.max_map_count
vm.max_map_count = 262144
Any ideas how to fix ?
The memlock errors you're seeing from Elasticsearch is a common issue not unique to having used Docker, but occurs when Elasticsearch is told to lock its memory, but is unable to do so. You can circumvent the error by removing the following environment variable from the docker-compose.yml file:
- bootstrap.memory_lock=true
Memlock may be used with Docker Swarm Mode, but with some caveats.
Not all options that work with docker-compose (Docker Compose) work with docker stack deploy (Docker Swarm Mode), and vice versa, despite both sharing the docker-compose YAML syntax. One such option is ulimits:, which when used with docker stack deploy, will be ignored with a warning message, like so:
Ignoring unsupported options: ulimits
My guess is that with your docker-compose.yml file, Elasticsearch runs fine with docker-compose up, but not with docker stack deploy.
With Docker Swarm Mode, by default, the Elasticsearch instance as you have defined will have trouble with memlock. Currently, setting of ulimits for docker swarm services is not yet officially supported. There are ways to get around the issue, though.
If the host is Ubuntu, unlimited memlock can be enabled across the docker service (see here and here). This can be achieved via the commands:
echo -e "[Service]\nLimitMEMLOCK=infinity" | SYSTEMD_EDITOR=tee systemctl edit docker.service
systemctl daemon-reload
systemctl restart docker
However, setting memlock to infinity is not without its drawbacks, as spelt out by Elastic themselves here.
Based on my testing, the solution works on Docker 18.06, but not on 18.09. Given the inconsistency and the possibility of Elasticsearch failing to start, the better option would be to not use memlock with Elasticsearch when deploying on Swarm. Instead, you can opt for any of the other methods mentioned in Elasticsearch Docs to achieve similar results.

Resources