I have a topology that's receiving data from a MQTT broker, and I want a spout to behave like this:
Emit a batch of tuples (or a list of strings in a single tuple) every x seconds. How do I achieve this? I read a bit about Storm Trident but its IBatchSpout doesn't seem to allow me to emit tuples in batch with a specific time interval.
What should the spout do if there's no new data coming in? It can't block the thread since it's Storm's main thread, right?
You could implement your own MQTT spout. For an example have a look at the MongoSpout.
The important part is the nextTuple method.
When this method is called, Storm is requesting that the Spout emit
tuples to the output collector. This method should be non-blocking, so
if the Spout has no tuples to emit, this method should return.
nextTuple, ack, and fail are all called in a tight loop in a single
thread in the spout task. When there are no tuples to emit, it is
courteous to have nextTuple sleep for a short amount of time (like a
single millisecond) so as not to waste too much CPU.
You must not wait the specified time at once, but you could implement nextTuple so that it only emits a tuple once in a while.
private static final EMISSION_PERIOD = 2000; // 2 seconds
private long lastEmission;
#Override
public void nextTuple() {
if (lastEmission == null ||
lastEmission + EMISSION_PERIOD >= System.currentMillis()) {
List<Object> tuple = pollMQTT();
if (tuple != null) {
this.collector.emit(tuple);
return;
}
}
Utils.sleep(50);
}
Note that I've found an open source MQTT spout. It doesn't look production ready, but you could use it as a starting point.
In addition to Christian, I found this implementation for Storm's MQTT client. The previous mentioned link is still not developed.
Related
In my reactive application I have hot Publisher with slow Subscriber. To handle lack of demand I am using onBackpressureBuffer but possible overflow errors are kinda scary.
How can I monitor number of elements present in the queue created by Flux.onBackpressureBuffer(maxSize)? Preferably with built-in reactor metrics() method. I am using Spring Boot + Micrometer if it makes any difference.
Although we didn't we find an easy way to this in Reactor, but we found a bit "hacky" one. Here it is: https://github.com/allegro/envoy-control/blob/master/envoy-control-core/src/main/kotlin/pl/allegro/tech/servicemesh/envoycontrol/utils/ReactorUtils.kt#L34
This function measures buffer size of various Flux operators. It is not guaranteed to work on every operator, but it was tested on onBackpressureBuffer with positive results.
It is written in Kotlin, but it should be very easy to port it to Java.
The essence of this code in case of onBackpressureBuffer is to cast Subscription to Scannable, and then use BUFFERED attribute:
flux
.onBackressureBuffer(maxSize)
.doOnSubscribe { subscription ->
// ...
val queueSize = Scannable.from(subscription).scan(Scannable.Attr.BUFFERED)
// ...
}
What is the difference between Flux.create and Flux.generate? I am looking--ideally with an example use case--to understand when I should use one or the other.
In short:
Flux::create doesn't react to changes in the state of the app while Flux::generate does.
The long version
Flux::create
You will use it when you want to calculate multiple (0...infinity) values which are not influenced by the state of your app and the state of your pipeline (your pipeline == the chain of operations which comes after Flux::create == downstream).
Why? Because the method which you sent to Flux::create keeps calculating elements (or none). The downstream will determine how many elements (elements == next signals) it wants and if he can't keep up, those elements which are already emitted will be removed/buffered in some strategy (by default they will be buffered until the downstream will ask for more).
The first and easiest use case is for emitting values which you, theoretically, could sum to a collection and only then take each element and do something with it:
Flux<String> articlesFlux = Flux.create((FluxSink<String> sink) -> {
/* get all the latest article from a server and emit them one by one to downstream. */
List<String> articals = getArticalsFromServer();
articals.forEach(sink::next);
});
As you can see, Flux.create is used for interaction between blocking method (getArticalsFromServer) to asynchronous code.
I'm sure there are other use cases for Flux.create.
Flux::generate
Flux.generate((SynchronousSink<Integer> synchronousSink) -> {
synchronousSink.next(1);
})
.doOnNext(number -> System.out.println(number))
.doOnNext(number -> System.out.println(number + 4))
.subscribe();
The output will be 1 5 1 5 1 5................forever
In each invocation of the method you sent to Flux::generate, synchronousSink can only emits: onSubscribe onNext? (onError | onComplete)?.
It means that Flux::generate will calculate and emit values on demand. When should you use it? In cases where it's too expensive to calculate elements which may not be used downstream or the events which you emit are influenced by the state of the app or from the state of your pipeline (your pipeline == the chain of operations which comes after Flux::create == downstream).
For example, if you are building a torrent application then you are receiving blocks of data in real time. You could use Flux::generate to give tasks (blocks to download) to multiple threads and you will calculate the block you want to download inside Flux::generate only when some thread is asking. So you will emit only blocks you don't have. The same algorithm with Flux::create will fail because Flux::create will emit all the blocks we don't have and if some blocks failed to be downloaded then we have a problem. because Flux::create doesn't react to changes in the state of the app while Flux::generate does.
Create:
Accepts a Consumer<FluxSink<T>>
Consumer is invoked only once per subscriber
Consumer can emit 0..N elements immediately
Publisher is not aware of downstream state. So we need to provide Overflow strategy as an additional parameter
We can get the reference of FluxSink using which we could keep on emitting elements as and when required using multiple threads.
Generate:
Accepts a Consumer<SynchronousSink<T>>
Consumer is invoked again and again based on the downstream demand
Consumer can emit only one element at the max with an optional complete/error signal.
Publisher produces elements based on the downstream demand
We can get the reference of SynchronousSink. But it might not be really useful as we could emit only one element
Check this blog for more details.
The setup is similar to this.
One agent, (dataSource) is generating data, and a single agent (dataProcessor) is processing the data. There is a lot more data being generated than dataProcessor can process, and I am not interested in processing all messages, just processing the latest piece of data.
One possible solution, proposed there by Jon Harrop there "is to greedily eat all messages in the inbox when one arrives and discard all but the most recent".
Another approach is not to listen for all messages, but rather for dataProcessor to PostAndReply dataSource to get the latest piece of data.
What are the pros and cons of these approaches?
This is an intriguing question and there are quite likely several possible perspectives. I think the most notable aspect is that the choice will affect how you design the API at the interface between the two components:
In "Consume all" approach, the producer has a very simple API where it triggers some event whenever a value is produced and your consumer will subscribe to it. This means that you could have other subscribers listening to updates from the producer and doing something else than your consumer from this question.
In "Call to get latest" approach, the producer will presumably need to be written so that it keeps the current state and discards old values. It will then provide blocking async API to get the latest value. It could still expose an event for other consumers though. The consumer will need to actively poll for changes (in a busy loop of some sorts).
You could also have a producer with an event as in "Consume all", but then create another component that listens to any given event, keeps the latest value and makes it available via a blocking async call to any other client.
Here some advantages/disadvantages I can think of:
In (1) the producer is very simple; the consumer is harder to write
In (2) the producer needs to do a bit more work, but the consumer is simple
In (3), you are adding another layer, but in a fairly reusable way.
I would probably go with either (2) (if I only need this for one data source) or with (3) after checking that it does not affect the performance.
As for (3), the sketch of what I was thinking would look something like this:
type KeepLastMessage<'T> =
| Update of 'T
| Get of AsyncReplyChannel<'T>
type KeepLast<'T>(initial:'T, event:IObservable<'T>) =
let agent = MailboxProcessor.Start(fun inbox ->
let rec loop last = async {
let! msg = inbox.Receive()
match msg with
| Update last -> return! loop last
| Get ch -> ch.Reply(last); return! loop last }
loop initial)
member x.AsyncGet() = agent.PostAndAsyncReply(Get)
I am developing some data analysis algorithms on top of Storm and have some questions about the internal design of Storm. I want to simulate a sensor data yielding and processing in Storm, and therefore I use Spout to push sensor data into the succeeding bolts at a constant time interval via setting a sleep method in nextTuple method of Spout. But from the experiment results, it appeared that spout didn't push data at the specified rate. In the experiment, there was no bottleneck bolt in the system.
Then I checked some material about the ack and nextTuple methods of Storm. Now my doubt is if the nextTuple method is called only when the previous tuples are fully processed and acked in the ack method?
If this is true, does it means that I cannot set a fixed time interval to emit data?
Thx a lot!
My experience has been that you should not expect Storm to make any real-time guarantees, including in your case the rate of tuple processing. You can certainly write a spout that only emits tuples on some time schedule, but Storm can't really guarantee that it will always call on the spout as often as you would like.
Note that nextTuple should be called whenever there is room available for more pending tuples in the topology. If the topology has free capacity, I would expect Storm to try to fill it up if it can with whatever it can get.
I had a similar use-case, and the way I accomplished it is by using TICK_TUPLE
Config tickConfig = new Config();
tickConfig.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 15);
...
...
builder.setBolt("storage_bolt", new S3Bolt(), 4).fieldsGrouping("shuffle_bolt", new Fields("hash")).addConfigurations(tickConfig);
Then in my storage_bolt (note it's written in python, but you will get an idea) i check if message is tick_tuple if it is then execute my code:
def process(self, tup):
if tup.stream == '__tick':
# Your logic that need to be executed every 15 seconds,
# or what ever you specified in tickConfig.
# NOTE: the maximum time is 600 s.
storm.ack(tup)
return
I have a bunch of threads that are doing lots of communication with each other.
I would prefer this be lock free.
For each thread, I want to have a mailbox, where other threads can send it messages, (but only the owner can remove messages). This is a multiple-producer single-consumer situation. is it possible for me to do this in a lockfree / high performance matter? (This is in the inner loop of a gigantic simulation.)
Lock-free Multiple Producer Single Consumer (MPSC) Queue is one of the easiest lock-free algorithms to implement.
The most basic implementation requires a simple lock-free singly-linked list (SList) with only push() and flush(). The functions are available in the Windows API as InterlockedFlushSList() and InterlockedPushEntrySList() but these are very easy to roll on your own.
Multiple Producer push() items onto the SList using a CAS (interlocked compare-and-swap).
The Single Consumer does a flush() which swaps the head of the SList with a NULL using an XCHG (interlocked exchange). The Consumer then has a list of items in the reverse-order.
To process the items in order, you must simply reverse the list returned from flush() before processing it. If you do not care about order, you can simply walk the list immediately to process it.
Two notes if you roll your own functions:
1) If you are on a system with weak memory ordering (i.e. PowerPC), you need to put a "release memory barrier" at the beginning of the push() function and an "aquire memory barrier" at the end of the flush() function.
2) You can make the functions considerably simplified and optimized because the ABA-issue with SLists occur during the pop() function. You can not have ABA-issues with a SList if you use only push() and flush(). This means you can implement it as a single pointer very similar to the non-lockfree code and there is no need for an ABA-prevention sequence counter.
Sure, if you have an atomic CompareAndSwap instruction:
for (i = 0; ; i = (i + 1) % MAILBOX_SIZE)
{
if ((mailbox[i].owned == false) &&
(CompareAndSwap(&mailbox[i].owned, true, false) == false))
break;
}
mailbox[i].message = message;
mailbox[i].ready = true;
After reading a message, the consuming thread just sets mailbox[i].ready = false; mailbox[i].owned = false; (in that order).
Here's a paper from the University of Rochester illustrating a non-blocking concurrent queue. The algorithm described in the paper shows one technique for making a lockless queue.
may want to look at Intel thread building blocks, I recall being to lecture by Intel developer that mentioned something along those lines.