KSQL Query number of Thread - ksqldb

is there a way to specify the number of Threads that a KSQL query running on a KSQL Server should consume ? Is other words the parallelism of the query.
Is there any limit to the number of application that can be run on a KSQL Server ? When or how to decide to Scale out ?

Yes, you can specify ksql-streams-num-streams-threads property. You can read more about it here.
Now, this is the number of KSQL Streams threads where stream processing occurs for that particular KSQL instance. It's important for vertical scaling because you might have enough computation resources in your machine to handle more threads and therefore you can do more work processing your streams on that specific machine.
If you have the capacity (i.e: CPU Cores), then you should have more threads so more Stream tasks can be scheduled on that instance and therefor having additional parallelization capacity on your KSQL Instance or Cluster (if you have more than one instance).
What you must understand with Kafka, Kafka Streams and KSQL is that horizontal scaling occurs with two main concepts:
Kafka Streams applications (such as KSQL) can paralelize work based
on the number of kafka topic partitions. If you have 3 partitions
and you launch 4 KSQL Instances (i.e: on different servers), then one of them will not be doing work on a Stream you create on top of that topic. If you have the
same topic with 3 partitions and you have only 1 KSQL Server, he'll
be doing all of the work for the 3 partitions.
When you add a new instance of your application Kafka Stream Application (in your case KSQL) and it joins your cluster processing your KSQL Streams and Tables, this specific instance will join the consumer groups consuming for
those topics and immediately start sharing the load with the other
instances as long as there are available partitions that other instances can offload (triggering a consumer group rebalance). The same happens if you take a instance down... the other instances will pick up the slack and start processing the partition(s) the retired instance was processing.
When comparing to vertical scaling (i.e: adding more capacity and threads to a KSQL instance), horizontal scaling does the same by adding the same computational resources to a different instance of the application on a different machine. You can understand the Kafka Stream Application Threading Model (with one or more application instances, on one or more machines) here:
I tried to simplify it, but you can read more of it on the KSQL Capacity Planning page and Confluent Kafka Streams Elastic Scale Blog Post
The important aspects of the scale-out / scale-in lifecycle of Kafka Streams (and KSQL) applications can be better understood like this:
1. A single instance working on 4 different partitions
2. Three instances working on 4 different partitions (one of them is
working on 2 different partitions)
3. An instances just left the group, now two instances are working on 4
different partitions, perfectly balanced (2 partitions for each)
(Images from confluent blog)

Related

How does Flink scale for hot partitions?

If I have a use case where I need to join two streams or aggregate some kind of metrics from a single stream, and I use keyed streams to partition the events, how does Flink handle the operations for hot partitions where the data might not fit into memory and needs to be split across partitions?
Flink doesn't do anything automatic regarding hot partitions.
If you have a consistently hot partition, you can manually split it and pre-aggregate the splits.
If your concern is about avoiding out-of-memory errors due to unexpected load spikes for one partition, you can use a state backend that spills to disk.
If you want more dynamic data routing / partitioning, look at the Stateful Functions API or the Dynamic Data Routing section of this blog post.
If you want auto-scaling, see Autoscaling Apache Flink with Ververica Platform Autopilot.

Share storage/volume between worker nodes in Kubernetes?

Is it possible to have a centralized storage/volume that can be shared between two pods/instances of an application that exist in different worker nodes in Kubernetes?
So to explain my case:
I have a Kubernetes cluster with 2 worker nodes. In each one of these I have 1 instance of app X running. This means I have 2 instances of app X running totally at the same time.
Both instances subscribe on the topic topicX, that has 2 partitions, and are part of a consumer group in Apache Kafka called groupX.
As I understand it the message load will be split among the partitions, but also among the consumers in the consumer group. So far so good, right?
So to my problem:
In my whole solution I have a hierarchy division with the unique constraint by country and ID. Each combination of country and ID has a pickle model (python Machine Learning Model), which is stored in a directory accessed by the application. For each combination of a country and ID I receive one message per minute.
At the moment I have 2 countries, so to be able to scale properly I wanted to split the load between two instances of app X, each one handling its own country.
The problem is that with Kafka the messages can be balanced between the different instances, and to access the pickle-files in each instance without know what country the message belongs to, I have to store the pickle-files in both instances.
Is there a way to solve this? I would rather keep the setup as simple as possible so it is easy to scale and add a third, fourth and fifth country later.
Keep in mind that this is an overly simplified way of explaining the problem. The number of instances is much higher in reality etc.
Yes. It's possible if you look at this table any PV (Physical Volume) that supports ReadWriteMany will help you accomplish having the same data store for your Kafka workers. So in summary these:
AzureFile
CephFS
Glusterfs
Quobyte
NFS
VsphereVolume - (works when pods are collocated)
PortworxVolume
In my opinion, NFS is the easiest to implement. Note that Azurefile, Quobyte, and Portworx are paid solutions.

KafkaStreams applications

I run two kafka streams applications separately, each one in a different JVM instance and they are working fine. Once I run the applications in the same JVM instance, the second application is not working (neither consuming nor producing data). Are there any limitations about running two separate apps in the same JVM instance? Does this also happen for kafka consumers?
It should work to put multiple KafkaStreams instances into the same JVM. But it might be easier to just increase the number of thread within one instance.
Note: the number of thread over all instances that you can utilize is limited by the number of input partitions of you input topics (this is not quite exact but a good rule of thumb). Do you have enough partitions in your input topic? Do you see that some partitions are not assigned to an instance an not processed?
Also note, Kafka Stream parallelizes per threads -- this means, if you have 2 instances with 2 threads each, and only 2 input topic partitions, it can happen that both partitions are assigned to the two thread of one instance while the two thread of the other instance are idle. During runtime, we only "see" thread and don't know if they run within the same or different KafkaStreams instances.

Can multiple sinks read from same channel or how to load balance flume sinks?

According to multiple sources, such as Hadoop Application Architecture, multiple sinks can read from the same channel to increase throughput:
A sink can only fetch data from a single channel, but many sinks can fetch data from that same channel. A sink runs in a single thread, which has huge limitations on a single sink—for example, throughput to disk. Assume with HDFS you get 30 MBps to a single disk; if you only have one sink writing to HDFS then all you’re going to get is 30 MBps throughput with that sink. More sinks consuming from the same channel will resolve this bottleneck. The limitation with more sinks should be the network or the CPU. Unless you have a really small cluster, HDFS should never be your bottleneck.
But besides this, there is a concept of sink groups with load balancing sink processor. According to the article one does not need to create sink group to faster consume events:
It is important to understand that all sinks within a sink group are not active at the same time; only one of them is sending data at any point in time. Therefore, sink groups should not be used to clear off the channel faster—in this case, multiple sinks should simply be set to operate by themselves with no sink group, and they should be configured to read from the same channel
So, I really do not understand when I should use group sinks with load balancer, and when just add more sinks which read from one specific channel.
Multiple Sinks can read from same channel but It is important to remember, however, that Flume can only guarantee that each event will be pushed into at least one sink, but not into every connected sink. The processing speeds of those sinks are different, and it is unpredictable to which sinks an event will be pushed.
In case you require multiple sinks to read from same channel always use Failover or Load balancing Sink Processors.

Bosun HA and scalability

I have a minor bosun setup, and its collecting metrics from numerous services, and we are planning to scale these services on the cloud.
This will mean more data coming into bosun and hence, the load/efficiency/scale of bosun is affected.
I am afraid of losing data, due to network overhead, and in case of failures.
I am looking for any performance benchmark reports for bosun, or any inputs on benchmarking/testing bosun for scale and HA.
Also, any inputs on good practices to be followed to scale bosun will be helpful.
My current thinking is to run numerous bosun binaries as a cluster, backed by a distributed opentsdb setup.
Also, I am thinking is it worthwhile to run some bosun executors as plain 'collectors' of scollector data (with bosun -n command), and some to just calculate the alerts.
The problem with this approach is it that same alerts might be triggered from multiple bosun instances (running without option -n). Is there a better way to de-duplicate the alerts?
The current best practices are:
Use https://godoc.org/bosun.org/cmd/tsdbrelay to forward metrics to opentsdb. This gets the bosun binary out of the "critical path". It should also forward the metrics to bosun for indexing, and can duplicate the metric stream to multiple data centers for DR/Backups.
Make sure your hadoop/opentsdb cluster has at least 5 nodes. You can't do live maintenance on a 3 node cluster, and hadoop usually runs on a dozen or more nodes. We use Cloudera Manager to manage the hadoop cluster, and others have recommended Apache Ambari.
Use a load balancer like HAProxy to split the /api/put write traffic across multiple instances of tsdbrelay in an active/passive mode. We run one instance on each node (with tsdbrelay forwarding to the local opentsdb instance) and direct all write traffic at a primary write node (with multiple secondary/backup nodes).
Split the /api/query traffic across the remaining nodes pointed directly at opentsdb (no need to go thru the relay) in an active/active mode (aka round robin or hash based routing). This improves query performance by balancing them across the non-write nodes.
We only run a single bosun instance in each datacenter, with the DR site using the read only flag (any failover would be manual). It really isn't designed for HA yet, but in the future may allow two nodes to share a redis instance and allow active/active or active/passive HA.
By using tsdbrelay to duplicate the metric streams you don't have to deal with opentsdb/hbase replication and instead can setup multiple isolated monitoring systems in each datacenter and duplicate the metrics to whichever sites are appropriate. We have a primary and a DR site, and choose to duplicate all metrics to both data centers. I actually use the DR site daily for Grafana queries since it is closer to where I live.
You can find more details about production setups at http://bosun.org/resources including copies of all of the haproxy/tsdbrelay/etc configuration files we use at Stack Overflow.

Resources