Can multiple sinks read from same channel or how to load balance flume sinks? - flume

According to multiple sources, such as Hadoop Application Architecture, multiple sinks can read from the same channel to increase throughput:
A sink can only fetch data from a single channel, but many sinks can fetch data from that same channel. A sink runs in a single thread, which has huge limitations on a single sink—for example, throughput to disk. Assume with HDFS you get 30 MBps to a single disk; if you only have one sink writing to HDFS then all you’re going to get is 30 MBps throughput with that sink. More sinks consuming from the same channel will resolve this bottleneck. The limitation with more sinks should be the network or the CPU. Unless you have a really small cluster, HDFS should never be your bottleneck.
But besides this, there is a concept of sink groups with load balancing sink processor. According to the article one does not need to create sink group to faster consume events:
It is important to understand that all sinks within a sink group are not active at the same time; only one of them is sending data at any point in time. Therefore, sink groups should not be used to clear off the channel faster—in this case, multiple sinks should simply be set to operate by themselves with no sink group, and they should be configured to read from the same channel
So, I really do not understand when I should use group sinks with load balancer, and when just add more sinks which read from one specific channel.

Multiple Sinks can read from same channel but It is important to remember, however, that Flume can only guarantee that each event will be pushed into at least one sink, but not into every connected sink. The processing speeds of those sinks are different, and it is unpredictable to which sinks an event will be pushed.
In case you require multiple sinks to read from same channel always use Failover or Load balancing Sink Processors.

Related

How does Flink scale for hot partitions?

If I have a use case where I need to join two streams or aggregate some kind of metrics from a single stream, and I use keyed streams to partition the events, how does Flink handle the operations for hot partitions where the data might not fit into memory and needs to be split across partitions?
Flink doesn't do anything automatic regarding hot partitions.
If you have a consistently hot partition, you can manually split it and pre-aggregate the splits.
If your concern is about avoiding out-of-memory errors due to unexpected load spikes for one partition, you can use a state backend that spills to disk.
If you want more dynamic data routing / partitioning, look at the Stateful Functions API or the Dynamic Data Routing section of this blog post.
If you want auto-scaling, see Autoscaling Apache Flink with Ververica Platform Autopilot.

How does apache beam access bigtable data?

If BigtableIO.Read is run in dataflow, is the data being accessed via a bigtable node or going directly to bigtable tablets?
Bigtable architecture has:
client requests go through a front-end server before they are sent to a Cloud Bigtable node
and goes on to say:
A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets to help balance the workload of queries... Tablets are stored on Colossus, Google's file system, in SSTable format
(The concern is if there is a dataflow job running at the same as users are making individual request that definitely go through the nodes, whether there will be a small or large amount of contention from the dataflow job. I would guess that if the dataflow job went through the nodes there would be significantly more contention as opposed to hitting the tablets directly.)
Beam BigTable connector uses the Cloud BigTable's public API hence requests will be going through the BigTable front end server nodes.
See here for bit more detail regarding BigTable client API usage of the Beam connector.

KSQL Query number of Thread

is there a way to specify the number of Threads that a KSQL query running on a KSQL Server should consume ? Is other words the parallelism of the query.
Is there any limit to the number of application that can be run on a KSQL Server ? When or how to decide to Scale out ?
Yes, you can specify ksql-streams-num-streams-threads property. You can read more about it here.
Now, this is the number of KSQL Streams threads where stream processing occurs for that particular KSQL instance. It's important for vertical scaling because you might have enough computation resources in your machine to handle more threads and therefore you can do more work processing your streams on that specific machine.
If you have the capacity (i.e: CPU Cores), then you should have more threads so more Stream tasks can be scheduled on that instance and therefor having additional parallelization capacity on your KSQL Instance or Cluster (if you have more than one instance).
What you must understand with Kafka, Kafka Streams and KSQL is that horizontal scaling occurs with two main concepts:
Kafka Streams applications (such as KSQL) can paralelize work based
on the number of kafka topic partitions. If you have 3 partitions
and you launch 4 KSQL Instances (i.e: on different servers), then one of them will not be doing work on a Stream you create on top of that topic. If you have the
same topic with 3 partitions and you have only 1 KSQL Server, he'll
be doing all of the work for the 3 partitions.
When you add a new instance of your application Kafka Stream Application (in your case KSQL) and it joins your cluster processing your KSQL Streams and Tables, this specific instance will join the consumer groups consuming for
those topics and immediately start sharing the load with the other
instances as long as there are available partitions that other instances can offload (triggering a consumer group rebalance). The same happens if you take a instance down... the other instances will pick up the slack and start processing the partition(s) the retired instance was processing.
When comparing to vertical scaling (i.e: adding more capacity and threads to a KSQL instance), horizontal scaling does the same by adding the same computational resources to a different instance of the application on a different machine. You can understand the Kafka Stream Application Threading Model (with one or more application instances, on one or more machines) here:
I tried to simplify it, but you can read more of it on the KSQL Capacity Planning page and Confluent Kafka Streams Elastic Scale Blog Post
The important aspects of the scale-out / scale-in lifecycle of Kafka Streams (and KSQL) applications can be better understood like this:
1. A single instance working on 4 different partitions
2. Three instances working on 4 different partitions (one of them is
working on 2 different partitions)
3. An instances just left the group, now two instances are working on 4
different partitions, perfectly balanced (2 partitions for each)
(Images from confluent blog)

Difference between stream processing and message processing

What is the basic difference between stream processing and traditional message processing? As people say that kafka is good choice for stream processing but essentially kafka is a messaging framework similar to ActivMQ, RabbitMQ etc.
Why do we generally not say that ActiveMQ is good for stream processing as well.
Is it the speed at which messages are consumed by the consumer determines if it is a stream?
In traditional message processing, you apply simple computations on the messages -- in most cases individually per message.
In stream processing, you apply complex operations on multiple input streams and multiple records (ie, messages) at the same time (like aggregations and joins).
Furthermore, traditional messaging systems cannot go "back in time" -- ie, they automatically delete messages after they got delivered to all subscribed consumers. In contrast, Kafka keeps the messages as it uses a pull-based model (ie, consumers pull data out of Kafka) for a configurable amount of time. This allows consumers to "rewind" and consume messages multiple times -- or if you add a new consumer, it can read the complete history. This makes stream processing possible, because it allows for more complex applications. Furthermore, stream processing is not necessarily about real-time processing -- it's about processing infinite input streams (in contrast to batch processing, which is applied to finite inputs).
And Kafka offers Kafka Connect and Streams API -- so it is a stream-processing platform and not just a messaging/pub-sub system (even if it uses this in its core).
If you like splitting hairs:
Messaging is communication between two or more processes or components whereas streaming is the passing of event log as they occur. Messages carry raw data whereas events contain information about the occurrence of and activity such as an order.
So Kafka does both, messaging and streaming. A topic in Kafka can be raw messages or and event log that is normally retained for hours or days. Events can further be aggregated to more complex events.
Although Rabbit supports streaming, it was actually not built for it(see Rabbit´s web site)
Rabbit is a Message broker and Kafka is a event streaming platform.
Kafka can handle a huge number of 'messages' towards Rabbit.
Kafka is a log while Rabbit is a queue which means that if once consumed, Rabbit´s messages are not there anymore in case you need it.
However Rabbit can specify message priorities but Kafka doesn´t.
It depends on your needs.
Message Processing implies operations on and/or using individual messages. Stream Processing encompasses operations on and/or using individual messages as well as operations on collection of messages as they flow into the system. For e.g., let's say transactions are coming in for a payment instrument - stream processing can be used to continuously compute hourly average spend. In this case - a sliding window can be imposed on the stream which picks up messages within the hour and computes average on the amount. Such figures can then be used as inputs to fraud detection systems
Apologies for long answer but I think short answer will not be justice to question.
Consider queue system. like MQ, for:
Exactly once delivery, and to participate into two phase commit transaction
Asynchronous request / reply communication: the semantic of the communication is for one component to ask a second command to do something on its data. This is a command pattern with delay on the response.
Recall messages in queue are kept until consumer(s) got them.
Consider streaming system, like Kafka, as pub/sub and persistence system for:
Publish events as immutable facts of what happened in an application
Get continuous visibility of the data Streams
Keep data once consumed, for future consumers, for replay-ability
Scale horizontally the message consumption
What are Events and Messages
There is a long history of messaging in IT systems. You can easily see an event-driven solution and events in the context of messaging systems and messages. However, there are different characteristics that are worth considering:
Messaging: Messages transport a payload and messages are persisted until consumed. Message consumers are typically directly targeted and related to the producer who cares that the message has been delivered and processed.
Events: Events are persisted as a replayable stream history. Event consumers are not tied to the producer. An event is a record of something that has happened and so can't be changed. (You can't change history.)
Now Messaging versus event streaming
Messaging are to support:
Transient Data: data is only stored until a consumer has processed the message, or it expires.
Request / reply most of the time.
Targeted reliable delivery: targeted to the entity that will process the request or receive the response. Reliable with transaction support.
Time Coupled producers and consumers: consumers can subscribe to queue, but message can be remove after a certain time or when all subscribers got message. The coupling is still loose at the data model level and interface definition level.
Events are to support:
Stream History: consumers are interested in historic events, not just the most recent.
Scalable Consumption: A single event is consumed by many consumers with limited impact as the number of consumers grow.
Immutable Data
Loosely coupled / decoupled producers and consumers: strong time decoupling as consumer may come at anytime. Some coupling at the message definition level, but schema management best practices and schema registry reduce frictions.
Hope this answer help!
Basically Kafka is messaging framework similar to ActiveMQ or RabbitMQ. There are some effort to take Kafka towards streaming:
https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/
Then why Kafka comes into picture when talking about Stream processing?
Stream processing framework differs with input of data.In Batch processing,you have some files stored in file system and you want to continuously process that and store in some database. While in stream processing frameworks like Spark, Storm, etc will get continuous input from some sensor devices, api feed and kafka is used there to feed the streaming engine.
Recently, I have come across a very good document that describe the usage of "stream processing" and "message processing"
https://developer.ibm.com/articles/difference-between-events-and-messages/
Taking the asynchronous processing in context -
Messaging:
Consider it when there is a "request for processing" i.e. client makes a request for server to process.
Event streaming:
Consider it when "accessing enterprise data" i.e. components within the enterprise can emit data that describe their current state. This data does not normally contain a direct instruction for another system to complete an action. Instead, components allow other systems to gain insight into their data and status.
To facilitate this evaluation, consider these key selection criteria to consider when selecting the right technology for your solution:
Event history - Kafka
Fine-grained subscriptions - MQ
Scalable consumption - Kafka
Transactional behavior - MQ

Erlang messages when there are lots of nodes or binary data

Would native Erlang messages provide reasonable performance when there are lots of nodes or binary data?
Case 1: There's a dynamic pool of about 50-200 machines (erlang nodes). It's constantly changing, about 5-50 machines added or removed every 10min.
Case 2: Let's say we are using this cluster to build youtube-clone and planning to stream video data via messages.
By reasonable performance I mean - it's ok to be 2-3 times slower than the top possible performance achieved by the complex Erlang code, 10 times slower is not ok.
There is not any significant difference between sending a message and binary data. The message is just transformed to the binary packet using term_to_binary and sent via TCP and same apply to the binary data. (Well, it is little bit smarter than that because textual form of the same atoms is not sent again and again as would simple term_to_binary do.) So the difference is negligible.
There are important details:
1) In clusters over 100 nodes, ping noise in full connected cluster will be significant part of network traffic. Even bigger deployments require deep changes in Erlang VM and OS.
2) If you want to stream video or audio you need plan capacity of single node: clients per node, tcp/udp packets rate, network bandwidth.
3) There is performance limit ~150-200K/s messages between 2 processes on different nodes.

Resources