Control rate of individual topic consumption in Kafka Streams 0.9.1.0-cp1? - stream

I am trying to backprocess data in Kafka topics using a Kafka Streams application that involves a join. One of the streams to be joined has much larger volume of data per unit of time in the corresponding topic. I would like to control the consumption from the individual topics so that I get roughly the same event timestamps from each topic in a single consumer.poll(). However, there doesn't appear to be any way to control the behavior of the KafkaConsumer backing the source stream. Is there any way around this? Any insight would be appreciated.

Currently Kafka cannot control the rate limit on both Producers and Consumers.
Refer:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-13+-+Quotas
But if you are using Apache Spark as the stream processing platform, you can limit the input rate for the Kafka receivers.

in the consumer side you can use consume([num_messages=1][, timeout=-1])
function instead of poll.
consume([num_messages=1][, timeout=-1]):
Consumes a list of messages (possibly empty on timeout). Callbacks may be executed as a side effect of calling this method.
The application must check the returned Message object’s Message.error() method to distinguish between proper messages (error() returns None) and errors for each Message in the list (see error().code() for specifics). If the enable.partition.eof configuration property is set to True, partition EOF events will also be exposed as Messages with error().code() set to _PARTITION_EOF.
num_messages (int) – The maximum number of messages to return (default: 1).
timeout (float) – The maximum time to block waiting for message, event or callback (default: infinite (-1)). (Seconds)

Related

Streaming Beam pipeline with "lookbehind"

I am new to Beam/Dataflow and am trying to figure out if it is suited to this problem. I am trying to keep a running sum of which types of messages are currently backlogged in a queueing system. The system uses a monotonically increasing offset number to order messages: producers learn the number when the send a message, and consumers track the watermark offset as they process each message in FIFO order. This pipeline would have two inputs: counts from the producers and watermarks from the consumers.
The queue producer would regularly flush a batch of count metrics to Beam:
(type1, offset, count)
(type2, offset, count)
...
where the offset was the last offset the producer wrote for typeN, and count is how many typeN messages it enqueued in the current batch period.
The queue consumer will regularly send its latest consumed watermark offset. The effect this should have is to invalidate any counts that have an offset lower than this consumer watermark.
The output of the pipeline is the sum of all counts with a higher offset than the largest consumer watermark yet seen, grouped by message type. (snapshotted every 5 minutes or so.)
(Of course there would be 100k message "types", hundreds of producer servers, occasional 2-hour periods where the consumer doesn't report an advancing watermark, etc.)
Is this doable? That this pipeline would need to maintain and scan an unbounded-ish history of count records is the part that seems maybe unsuited to Beam.
One possible approach would be to model this as two timeseries (left , right) where you want to match left.timestamp <= right.timestamp. You can do this using the State and Timer API.
In order to achieve this unbounded, you will need to be working within a GlobalWindow. Important note in the Global Window there is no expiry of the state, so you will need to make sure to do Garbage Collection on your left and right streams. Also data will arrive in the onprocess unordered, so you will need to make use of Event Time timers to do the actual work.
Very roughly:
onProcess(){
Store data in BagState.
Setup Event time timer to go off
}
OnTimer(){
Do your buiss logic.
}
This is a lot easier with Apache Beam > 2.24.0 as OrderedListState has been added.
Although the timeseries use case is different from the one in this question, this talk from the 2019 Beam summit also has some pointers (but does not make use of OrderedListState, which was not available at the time);
State and Timer API and Timeseries

Consume latest value from a topic for each keys

I have a Kafka producer which is producing messages at high rate (message key is let us say a username and value is his current score in a game). The Kafka consumer is relatively slow in processing the consumed messages. Here my requirement is to show most up-to-date score and avoid showing stale data, with the tradeoff that some scores may never be shown.
Essentially for each of the username, I may have hundreds of messages in the same partition, but I always want to read the latest one.
A crude solution which has been implemented was like this: The producer sends just a key as each message and actual value is written to a database, which is shared with the consumer. The consumer reads each key from the queue and value from the database. Here the goal to read always the latest value is achieved by producer overwriting the value in the database -- so consumer which is in fact reading a given key will actually consume the latest value. But this solution has some drawbacks due to high number of reads and updates (slow, race conditions etc.)
I am looking for a more natural way of solving this in kafka or kafka streams where I can somehow define get latest value for a key from the stream of data for each key. Thanks!
Below code helped
KStreamBuilder builder = new KStreamBuilder();
KTable<String, String> dataTable = builder.table("input-topic");
dataTable.toStream().foreach((key, message) -> client.post(message));
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
What makes this possible in practice is in-memory compaction of incoming stream (details explained here). We could control the pressure using the parameters cache.max.bytes.buffering and commit.interval.ms

Plot event values in Graphite

We would like to use Graphite to plot values related to events such as "a packet of N messages has been published". When no packet is published, no code is run at all and so we cannot send zero to Graphite.
Essentially, we would like to compute some kind of publication rate per second.
Here are some sample data that we send to Graphite (with added timestamps):
2016-11-28 14:46:33.6338Z api.message.publication.count:100
2016-11-28 15:01:36.0780Z api.message.publication.count:12
2016-11-28 15:01:36.9911Z api.message.publication.count:1
2016-11-28 15:01:37.0679Z api.message.publication.count:100
Between 14:46:33 and 15:01:36, no messages were sent. However, between 15:01:36 and 15:01:37, 13 messages were sent (reported as two values, 12 and 1).
I've tried the summarize() function but it does not give results that make sense to me, i.e. I cannot correlate what I'm sending to Graphite and what is displayed by Graphite. Moreover, it seems that summarize() does not support 1-second intervals (I've tried "1second" and "1s" for the interval parameter).
The perSecond() function computes a rate of change (i.e. a derivative) but what we're sending is already a kind of derivative (maybe it's closer to a Dirac delta?) so it doesn't make sense in our context.
Are we completely off, or is there a way to make this work with Graphite?
Edit: I guess we need to add an aggregation stage to our data. Would Carbon aggregation fit the bill here?
It turns out that we were already sending our metrics to statsd, which supports aggregation via the c metric type, and a few other nifty things: https://github.com/etsy/statsd/blob/master/docs/metric_types.md

Apache Storm - use multiple spouts?

So I'm trying to configure my spout(s) to read from an Amazon SQS queue. Now, I want a situation wherein I can share the load across multiple spouts.
I understand it's possible to have multiple threads, but can I have two or more different spout instances/applications which are reading from the same queue and emitting to the same topology? For eg. Spout A and Spout B read from the SQS and then both emit to bolt C?
Of course, you can have multiple spouts, but you have to define them accordingly to prevent double submit of the same element (or your topology does accept that by design). Multiple processes of the same element imply bad counters for instance.
Check Storm concurrency as a start with executors (threads) and tasks (instances) per spout / bolt and choose the number you want.
In your code, you have to be sure that you don't manage the same tuples twice or more, either you do it before storm (a queue which doesn't accept the same element twice which is processed / emptied by many spouts for instance, or multiple queues - one for each spout, beware of transactions) or you do it in storm (process messages only with x param in one spout, with y in another and a message cannot be x and y at the same time).
SQS Queue -----> Spout (N Number of Executors).
This model will perfectly fine. as soon as, any of executor instance will pick up message, message will become invisible from SQS.
Keep Message Invisibility time Much higher than Message Processing time with in Storm Topology.
You can keep delete SQS message logic inside ack method.

Questions about the nextTuple method in the Spout of Storm stream processing

I am developing some data analysis algorithms on top of Storm and have some questions about the internal design of Storm. I want to simulate a sensor data yielding and processing in Storm, and therefore I use Spout to push sensor data into the succeeding bolts at a constant time interval via setting a sleep method in nextTuple method of Spout. But from the experiment results, it appeared that spout didn't push data at the specified rate. In the experiment, there was no bottleneck bolt in the system.
Then I checked some material about the ack and nextTuple methods of Storm. Now my doubt is if the nextTuple method is called only when the previous tuples are fully processed and acked in the ack method?
If this is true, does it means that I cannot set a fixed time interval to emit data?
Thx a lot!
My experience has been that you should not expect Storm to make any real-time guarantees, including in your case the rate of tuple processing. You can certainly write a spout that only emits tuples on some time schedule, but Storm can't really guarantee that it will always call on the spout as often as you would like.
Note that nextTuple should be called whenever there is room available for more pending tuples in the topology. If the topology has free capacity, I would expect Storm to try to fill it up if it can with whatever it can get.
I had a similar use-case, and the way I accomplished it is by using TICK_TUPLE
Config tickConfig = new Config();
tickConfig.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 15);
...
...
builder.setBolt("storage_bolt", new S3Bolt(), 4).fieldsGrouping("shuffle_bolt", new Fields("hash")).addConfigurations(tickConfig);
Then in my storage_bolt (note it's written in python, but you will get an idea) i check if message is tick_tuple if it is then execute my code:
def process(self, tup):
if tup.stream == '__tick':
# Your logic that need to be executed every 15 seconds,
# or what ever you specified in tickConfig.
# NOTE: the maximum time is 600 s.
storm.ack(tup)
return

Resources