How to map tuples with persistent state in Trident? - stream

I'm learning Trident framework. There are several methods on Trident Streams for aggregation tuples within a batch, including this one which allows to preform a stateful mapping of the tuples using Aggregator interface. But unfortunately a built-in counterpart to additionally persist the map state, like other 9 overloadings of persistentAggregate(), only with Aggregator as an argument, is not present.
Thus how can I implement the desired functionality by combining lower-level Trident and Storm abstractions and tools? It is pretty hard to explore the API because there is almost no Javadoc documentation.
In other words, persistentAggregate() methods allow to end stream processing with updating some persistent state:
stream of tuples ---> persistent state
I want to update persistent state and emit different tuples by the way:
stream of tuples ------> stream of different tuples
with
persistent state
Stream.aggregate(Fields, Aggregator, Fields) doesn't provide fault-tolerance:
stream of tuples ------> stream of different tuples
with
simple in-memory state

You can create a new stream from a state using the method TridentState#newValuesStream().
This will allow you to retrieve a stream of the aggregated values.
For illustrative purpose, we can improve the example in Trident documentation by adding this method and a Debug Filter :
FixedBatchSpout spout = new FixedBatchSpout(new Fields("sentence"), 3,
new Values("the cow jumped over the moon"),
new Values("the man went to the store and bought some candy"),
new Values("four score and seven years ago"),
new Values("how many apples can you eat"));
spout.setCycle(true);
TridentTopology topology = new TridentTopology();
topology.newStream("spout1", spout)
.each(new Fields("sentence"), new Split(), new Fields("word"))
.groupBy(new Fields("word"))
.persistentAggregate(new MemoryMapState.Factory(), new Count(), new Fields("count"))
.newValuesStream().each(new Fields("count"), new Debug());
Running this topology will output (to the console) the aggregated counts.
Hope it helps

Related

Can Google dataflow GroupByKey handle hot keys?

Input is PCollection<KV<String,String>>
I have to write files by the key and each line as value of the KV group.
In order to group based on Key, I have 2 options :
1. GroupByKey --> PCollection<KV<String, Iterable<String>>>
2. Combine.perKey.withhotKeyFanout --> PCollection
where value String is accumulated Strings from all pairs.
(Combine.CombineFn<String, List<String>, CustomStringObJ>)
I can have a millon records per key.The collection of keyed-data is optimised using Windows and Trigger, still can have thousands of entries per key.
I worry the max size of String will cause issue if Combine.perKey.withHotKeyFanout is used to create a CustomStringObJ which has List<String> as member to be written in the file.
If we use GroupByKey, how to handle hot keys?
You should use the approach with GroupByKey, not use Combine to concatenate a large string. The actual implementation (not unique to Dataflow) is that elements are shuffled according to their key and in the output KV<K, Iterable<V>> the iterable of values is a particular lazy/streamed view on the elements shuffled to that key. There is no actual iterable constructed - this is just as good as routing each element to the worker that owns each file and writing it directly.
Your use of windows and triggers might actually force buffering and make this less efficient. You should only use event time windowing if it is part of your business case; it isn't a mechanism for controlling performance. Triggers are good for managing how data is batched up and sent downstream, but most useful for aggregations where triggering less frequently saves a lot of data volume. For a raw grouping of the elements, triggers tend to be less useful.

Consume latest value from a topic for each keys

I have a Kafka producer which is producing messages at high rate (message key is let us say a username and value is his current score in a game). The Kafka consumer is relatively slow in processing the consumed messages. Here my requirement is to show most up-to-date score and avoid showing stale data, with the tradeoff that some scores may never be shown.
Essentially for each of the username, I may have hundreds of messages in the same partition, but I always want to read the latest one.
A crude solution which has been implemented was like this: The producer sends just a key as each message and actual value is written to a database, which is shared with the consumer. The consumer reads each key from the queue and value from the database. Here the goal to read always the latest value is achieved by producer overwriting the value in the database -- so consumer which is in fact reading a given key will actually consume the latest value. But this solution has some drawbacks due to high number of reads and updates (slow, race conditions etc.)
I am looking for a more natural way of solving this in kafka or kafka streams where I can somehow define get latest value for a key from the stream of data for each key. Thanks!
Below code helped
KStreamBuilder builder = new KStreamBuilder();
KTable<String, String> dataTable = builder.table("input-topic");
dataTable.toStream().foreach((key, message) -> client.post(message));
KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
What makes this possible in practice is in-memory compaction of incoming stream (details explained here). We could control the pressure using the parameters cache.max.bytes.buffering and commit.interval.ms

General principle to implement node-based workflow as seen in Unreal, Blender, Alteryx and the like?

This topic is difficult to Google, because of "node" (not node.js), and "graph" (no, I'm not trying to make charts).
Despite being a pretty well rounded and experienced developer, I can't piece together a mental model of how these sorts of editors get data in a sensible way, in a sensible order, from node to node. Especially in the Alteryx example, because a Sort module, for example, needs its entire upstream dataset before proceeding. And some nodes can send a single output to multiple downstream consumers.
I was able to understand trees and what not in my old data structures course back in the day, and successfully understand and adapt the basic graph concepts from https://www.python.org/doc/essays/graphs/ in a real project. But that was a static structure and data weren't being passed from node to node.
Where should I be starting and/or what concept am I missing that I could use implement something like this? Something to let users chain together some boxes to slice and dice text files or data records with some basic operations like sort and join? I'm using C#, but the answer ought to be language independent.
This paradigm is called Dataflow Programming, it works with stream of data which is passed from instruction to instruction to be processed.
Dataflow programs can be programmed in textual or visual form, and besides the software you have mentioned there are a lot of programs that include some sort of dataflow language.
To create your own dataflow language you have to:
Create program modules or objects that represent your processing nodes realizing different sort of data processing. Processing nodes usually have one or multiple data inputs and one or multiple data output and implement some data processing algorithm inside them. Nodes also may have control inputs that control how given node process data. A typical dataflow algorithm calculates output data sample from one or many input data stream values as for example FIR filters do. However processing algorithm also can have data values feedback (output values in some way are mixed with input values) as in IIR filters, or accumulate values in some way to calculate output value
Create standard API for passing data between processing nodes. It can be different for different kinds of data and controlling signals, but it must be standard because processing nodes should 'understand' each other. Data usually is passed as plain values. Controlling signals can be plain values, events, or more advanced controlling language - depending of your needs.
Create arrangement to link your nodes and to pass data between them. You can create your own program machinery or use some standard things like pipes, message queues, etc. For example this functional can be implemented as a tree-like structure whose nodes are your processing nodes, and have references to next nodes and its appropriate input that process data coming from the output of the current node.
Create some kind of nodes iterator that starts from begin of the dataflow graph and iterates over each processing node where it:
provides next data input values
invokes node data processing methods
updates data output value
pass updated data output values to inputs of downstream processing nodes
Create a tool for configuring nodes parameters and links between them. It can be just a simple text file edited with text editor or a sophisticated visual editor with GUI to draw dataflow graph.
Regarding your note about Sort module in Alteryx - perhaps data values are just accumulated inside this module and then sorted.
here you can find even more detailed description of Dataflow programming languages.

Splitting a futures::Stream into multiple streams based on a property of the stream item

I have a Stream of items (u32, Bytes) where the integer is an index in the range 0..n I would like to split this stream into n streams, basically filtering by the integer.
I considered several possibilities, including
creating n streams each of which peeks at the underlying stream to determine if the next item is for it
pushing the items to one of n sinks when they arrive, and then use the other side of the sink as a stream again. (This seems to be related to
Forwarding from a futures::Stream to a futures::Sink.).
I feel that neither of these possibilities is convincing. The first one seems to create unnecessary overhead and the second one is just not elegant (if it even works, I am not sure).
What's a good way of splitting the stream?
At one point I had a similar requirement and wrote a group_by operator for Stream.
I haven't yet published this to crates.io as I didn't really feel it was ready for consumption but feel free to take a look at the code at https://github.com/Lukazoid/lz_stream_tools or attempt to use it for yourself.
Add the following to your cargo.toml:
[dependencies]
lz_stream_tools = { git = "https://github.com/Lukazoid/lz_stream_tools" }
And extern crate lz_stream_tools; to your bin.rs/lib.rs.
Then from your code you may use it like so:
use lz_stream_tools::StreamTools;
let groups = some_stream.group_by(|x| x.0);
groups will now be a Stream of (u32, Stream<Item=Bytes)).
You could use channels to represent the index-specific streams. You'd have to spawn one Task that pulls from the original stream and has a map of Senders.

Serializing Flux in Reactor

Is possible to serialize Reactor Flux. For example my Flux is in some state and is currently processing some event. And suddenly service is terminated. Current state of Flux is saved to database or to file. And then on restart of aplication I just take all Flux from that file/table and subscribe them to restart processing from last state. This is possible in reactor?
No, this is not possible. Flux are not serializable and are closer to a chain of functions, they don't necessarily have a state[1] but describe what to do given an input (provided by an initial generating Flux)...
So in order to "restart" a Flux, you'd have to actually create a new one that gets fed the remaining input the original one would have received upon service termination.
Thus it would be more up to the source of your data to save the last emitted state and allow restarting a new Flux sequence from there.
[1] Although, depending on what operators you chained in, you could have it impact some external state. In that case things will get more complicated, as you'll have to also persist that state.

Resources