Questions about the nextTuple method in the Spout of Storm stream processing - stream

I am developing some data analysis algorithms on top of Storm and have some questions about the internal design of Storm. I want to simulate a sensor data yielding and processing in Storm, and therefore I use Spout to push sensor data into the succeeding bolts at a constant time interval via setting a sleep method in nextTuple method of Spout. But from the experiment results, it appeared that spout didn't push data at the specified rate. In the experiment, there was no bottleneck bolt in the system.
Then I checked some material about the ack and nextTuple methods of Storm. Now my doubt is if the nextTuple method is called only when the previous tuples are fully processed and acked in the ack method?
If this is true, does it means that I cannot set a fixed time interval to emit data?
Thx a lot!

My experience has been that you should not expect Storm to make any real-time guarantees, including in your case the rate of tuple processing. You can certainly write a spout that only emits tuples on some time schedule, but Storm can't really guarantee that it will always call on the spout as often as you would like.
Note that nextTuple should be called whenever there is room available for more pending tuples in the topology. If the topology has free capacity, I would expect Storm to try to fill it up if it can with whatever it can get.

I had a similar use-case, and the way I accomplished it is by using TICK_TUPLE
Config tickConfig = new Config();
tickConfig.put(Config.TOPOLOGY_TICK_TUPLE_FREQ_SECS, 15);
...
...
builder.setBolt("storage_bolt", new S3Bolt(), 4).fieldsGrouping("shuffle_bolt", new Fields("hash")).addConfigurations(tickConfig);
Then in my storage_bolt (note it's written in python, but you will get an idea) i check if message is tick_tuple if it is then execute my code:
def process(self, tup):
if tup.stream == '__tick':
# Your logic that need to be executed every 15 seconds,
# or what ever you specified in tickConfig.
# NOTE: the maximum time is 600 s.
storm.ack(tup)
return

Related

Streaming Beam pipeline with "lookbehind"

I am new to Beam/Dataflow and am trying to figure out if it is suited to this problem. I am trying to keep a running sum of which types of messages are currently backlogged in a queueing system. The system uses a monotonically increasing offset number to order messages: producers learn the number when the send a message, and consumers track the watermark offset as they process each message in FIFO order. This pipeline would have two inputs: counts from the producers and watermarks from the consumers.
The queue producer would regularly flush a batch of count metrics to Beam:
(type1, offset, count)
(type2, offset, count)
...
where the offset was the last offset the producer wrote for typeN, and count is how many typeN messages it enqueued in the current batch period.
The queue consumer will regularly send its latest consumed watermark offset. The effect this should have is to invalidate any counts that have an offset lower than this consumer watermark.
The output of the pipeline is the sum of all counts with a higher offset than the largest consumer watermark yet seen, grouped by message type. (snapshotted every 5 minutes or so.)
(Of course there would be 100k message "types", hundreds of producer servers, occasional 2-hour periods where the consumer doesn't report an advancing watermark, etc.)
Is this doable? That this pipeline would need to maintain and scan an unbounded-ish history of count records is the part that seems maybe unsuited to Beam.
One possible approach would be to model this as two timeseries (left , right) where you want to match left.timestamp <= right.timestamp. You can do this using the State and Timer API.
In order to achieve this unbounded, you will need to be working within a GlobalWindow. Important note in the Global Window there is no expiry of the state, so you will need to make sure to do Garbage Collection on your left and right streams. Also data will arrive in the onprocess unordered, so you will need to make use of Event Time timers to do the actual work.
Very roughly:
onProcess(){
Store data in BagState.
Setup Event time timer to go off
}
OnTimer(){
Do your buiss logic.
}
This is a lot easier with Apache Beam > 2.24.0 as OrderedListState has been added.
Although the timeseries use case is different from the one in this question, this talk from the 2019 Beam summit also has some pointers (but does not make use of OrderedListState, which was not available at the time);
State and Timer API and Timeseries

Apache Beam: DoFn.Setup equivalent in Python SDK

What is the recommended way to do expensive one-off initialization in a Beam Python DoFn? The Java SDK has DoFn.Setup, but there doesn't appear to be an equivalent in Beam Python.
Is the best way currently to attach objects to threading.local() in the DoFn initializer?
Dataflow Python is not particularly transparent about the optimal method for initializing expensive objects. There are a few mechanisms by which objects can be instantiated infrequently (it is currently not ideal to perform exactly once initialization). Below are outlined some of the experiments I have run and conclusions I have come to. Hopefully someone from the Beam community can help correct me wherever I have strayed.
__init__
Although the __init__ method can be used to initialize an expensive object exactly once, this initialization does not happen on the Worker machines. The object will need to be serialized in order to be sent off to the Worker which, for large objects, as well as Tensorflow models, can be quite unwieldy or not work at all. Furthermore, since this object will be serialized and sent over a wire, it is not secure to perform initializations here, as payloads can be intercepted. The recommendation is against using this method.
start_bundle()
Dataflow processes data in discrete groups that it calls bundles. These are fairly well defined in batch processes, but in streaming they are dependent on the throughput. There are no mechanisms for configuring how Dataflow creates its bundles, and in fact the size of a bundle is entirely dictated by Dataflow. The start_bundle() method will be called on the Worker and can be used to initialize state, however experiments find that in a streaming context, this method is called more frequently than desired, and expensive re-initializations would happen quite often.
Lazy initialization
This methodology was suggested by the Beam docs and is somewhat surprisingly the most performant. Lazy initialization means that you create some stateful parameter that you initialize to None, then execute code such as the following:
if self.expensive_object is None:
self.expensive_object = self.__expensive_initialization()
You can execute this code directly in your process() method. You can also put together some helper functions easily enough that rely on global state so that you can have functions such as (an example of what this might look like is at the bottom of this post):
self.expensive_object = get_or_initialize_global(‘expensive_object’, self.__expensive_initialization)
Experiments
The following experiments were run on a job that was configured using both start_bundle and the lazy initialization method described above, with appropriate logging to indicate invocation. Various throughput was published to the appropriate queue and the results were recorded accordingly.
At a rate of 1 msg/sec over 100s:
Context Number of Invocations
------------------------------------------------------------
NEW BUNDLE 100
LAZY INITIALIZATION 25
TOTAL MESSAGES 100
At a rate of 10 msg/sec over 100s
Context Number of Invocations
------------------------------------------------------------
NEW BUNDLE 942
LAZY INITIALIZATION 3
TOTAL MESSAGES 1000
At a rate of 100 msg/sec over 100s
Context Number of Invocations
------------------------------------------------------------
NEW BUNDLE 2447
LAZY INITIALIZATION 30
TOTAL MESSAGES 10000
At a rate of 1000 msg/sec over 100s
Context Number of Invocations
------------------------------------------------------------
NEW BUNDLE 2293
LAZY INITIALIZATION 36
TOTAL MESSAGES 100000
Takeaways
Although start_bundle works well for high throughput, lazy initialization is nonetheless the most performant by a wide margin regardless of throughput. It is the recommended way of performing expensive initializations on Python Beam. This result is perhaps not too surprising given this quote from the official docs:
Setup - called once per DoFn instance before anything else; this has not been implemented in the Python SDK so the user can work around just with lazy initialization
The fact that is is called a "work around" is not particularly encouraging though, and maybe we can expect something more robust in the near future.
Code Samples
Courtesy of Andreas Jansson:
def get_or_initialize_global(object_key, initialize_expensive_object):
if object_key in globals():
expensive_object = globals()[object_key]
else:
expensive_object = initialize_expensive_object()
globals()[object_key] = expensive_object
Setup and teardown have now been added to the Python SDK and are the recommended way to do expensive one-off initialization in a Beam Python DoFn.
This sounds like it could be it https://beam.apache.org/releases/pydoc/2.8.0/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.start_bundle

How to calculate CPU time in elixir when multiple actors/processes are involved?

Let's say I have a function which does some work by spawning multiple processes. I want to compare CPU time vs real time taken by this function.
def test do
prev_real = System.monotonic_time(:millisecond)
# Code to complete some task
# Spawn different processes & give each process some task
# Receive result
# Finish task
current_real = System.monotonic_time(:millisecond)
diff_real = current_real - prev_real
IO.puts "Real time " <> to_string(diff_real)
IO.puts "CPU time ?????"
end
How to calculate CPU time required by the given function? I am interested in calculating CPU time/Real time ratio.
If you are just trying to profile your code rather than implement your own profiling framework I would recommend using already existing tools like:
fprof which will give you information about time spent in functions (real and own)
percept which will provide you information about which processes in your system ware working at any given time and on what
xprof which is design to help you find which calls to your function will cause it to take more time (trigger inefficient branch of code).
They take advantage of both erlang:trace to figure out which function is being executed and for how long and erlang:system_profile with runnable_procs to determine which processes are currently running. You might start a function, hit a receive or be preemptive rescheduled and wait without doing any actual work. Combining those two might be complicated, and I would recommend using already existing tools before trying glue together your own.
You could also look into tools like erlgrind and eflame if you are looking for more visual representations of your calls.

A "Temporal Pass" Module

I am trying to create a general module that collects data at an irregular interval. Data arrives from the left end as soon as new data has arrived. This may be something like 100 times a second.
On the right end I want to be able to "plug in" n listeners, each with its own regular interval. For the purpose of simplification, let's say all with an interval of once per second.
Every listener registers a callback function that may or may not be asynchronous.
My problem is that if the callback function is synchronous, my "temporal pass" may hang. What is the best way to approach this? Should I spawn a process whose pure purpose is to pass along the data and pay the price if the callback hangs?
+-------------+ Data Out 1
=======> |Temporal Pass| ==========>
Data In +-------------+ \\ Data Out 2
++=======>
\\ Data Out n
++=======>
Spawn a new process for the message, otherwise the process will wait until synchronous calls are done. This is exactly the sort of problem the process model is meant to solve and I do not see any other way to do it.
Spawning processes are not expensive, but not entirely free either. You may get a small performance boost by only spawning new processes for the synchronous calls. That will require some way of flagging each callback as either synchronous or asynchronous.

Is it possible implement Pregel in Erlang without supersteps?

Let's say we implement Pregel with Erlang. Why do we actually need supersteps? Isn't it better to just send messages from one supervisor to processes that represent nodes? They could just apply the calculation function to themselves, send messages to each other and then send a 'done' message to the supervisor.
What is the whole purpose of supersteps in concurrent Erlang implementation of Pregel?
The SuperStep concept as espoused by the Pregel model could be viewed as sort of a Barrier for parallel-y executing entities. At the end of each superstep, each worker, flushes it state to the persistent store.
The algorithm is check-pointed at the end of each SuperStep so that in case of failure, when a new node has to take over the function of a failed peer, it has a point to start from. Pregel guarantees that since the data of the node has been flushed to disk before the SuperStep started, it can reliably start from exactly that point.
It also in a way signifies "progress" of the algorithm. A pregel algorithm/job can be provided with a "max number of supersteps" after which the algorithm should terminate.
What you specified in your question (about superisors sending worker a calculation function and waiting for a "done") can definitely be implemented (although I dont think the current supervisor packaged with OTP can do stuff like that out of the box) but I guess the concept of a SuperStep is just a requirement of a Pregel model. If on the other hand, you were implementing something like a parallel mapper (like what Joe implements in his book) you wont need supersteps/

Resources