Kafka Streams - Join two streams and get values ordered by date - join

I have two streams, both with the same key. Stream1 values are temperatures and the Stream2 values are types of events, "red" or "green". The key is "station".
KStream<String, String> Stream1 = builder.stream(inputTopic);
KStream<String, String> Stream2 = builder.stream(inputTopic2);
The idea is to join both of those streams and get their value by date order. For example, my producer sent the following records:
topic1: -35º (12:10:10)
topic1: -40º (12:10:11)
topic2: "red" (12:10:12)
topic1: 20º (12:10:13)
topic2: "green" (12:11:00)
topic1: 10º (12:11:05)
topic2: "red" (12:11:10)
topic1: 40º (12:11:20)
I want to have a new stream with all those values in that same order. The goal is to associate the temperature with the previous type of event. For example, 20º (12:10:13) is a red type, 10º (12:11:05) is a green type and 40º (12:11:20) is a red type.
How should I proceed?

Related

SoftwareAG webMethods EDI mapping question : How to map one record into Multiple records

I am trying to map one record into multiple records using webMethods Designer flow service.
1 row converted into several rows.
Please help me to wire webethods flow service to map the following using LOOP,REPEAT, MAP, etc.
SourceRecord
DateFields TargetRecord
DT ( Record initiator )( 1 .. 1 ) DTM (Record initiator)( 1 .. many times)
OrderDate DTM_01
SalesDate DTM_02
ExpireDate
Sample Input data ( element delimiter "," and segment terminator newline)
DT,20200914,20200916,20230913 <-- where DT is record initiator "," is element separator
and orderDate = 20200914
SalesDate = 20200916
ExpireDate = 20230913
Desired Output Data ( multiple rows) ( DTM is record initiator element delimiter "*" and segment terminator newline)
DTM*002*20200914 <-- 002 is qualifier for OrderDate
DTM*007*20200916 <-- 007 is the qualifier for SalesDate
DTM*036*20230913 <-- 036 is the qulifier for ExpireDate
There is not enough information. Do you have one string with one record of input data? Do you have a list of strings or a document list? Most likely the record comes from a flat file?
Is the output a string list or document list?
Anyway, the simple solution to your question (assuming one record) is to tokenize (using pub.string:tokenize) the input string and map it to output object using indices concatenating it with preset qualifiers:
Now you can build your string out of that string list using pub.string:makeString, using new line as separator (notice that cursor is on the second line):

KafkaStreams - joins are not always triggered

I have a Kafkastreams application where I am trying to join transactions to shops.
This is working fine for new events, because shops are always created before transactions happen, but when I am trying to read "historic" data (events that happened before the application was started), sometimes are all transactions joined to shops and sometimes less, around 60%-80%, but once I got 30%.
This is very weird because I know that all transactions have a correct shop id, as sometimes it does join all of them.
The events are in one topic and I use filters to put them into two streams. Then I create a KTable from the shops stream and then I join the transaction stream to the shop table.
final KStream<String, JsonNode> shopStream = eventStream
.filter((key, value) -> "SHOP_CREATED_EVENT".equals(value.path("eventType").textValue()))
.map((key, value) -> KeyValue.pair(value.path("shop_id").textValue(), value)
);
final KStream<String, JsonNode> transactionStream = eventStream
.filter((key, value) -> "TRANSACTION_EVENT".equals(value.path("eventType").textValue()))
.map((key, value) -> KeyValue.pair(value.path("shop_id").textValue(), value)
);
final KTable<String, JsonNode> shopTable = shopStream
.groupByKey(Grouped.with(Serdes.String(), jsonSerde))
.reduce(genericReducer);
final KStream<String, JsonNode> joinedStream = transactionStream
.join(shopTable, this::joinToShop, Joined.valueSerde(jsonSerde))
I also tried to use stream to stream join instead of stream to table, same result:
final KStream<String, JsonNode> joinedStream = transactionStream
.join(shopStream, this::joinToShop,JoinWindows.of(Duration.ofMinutes(5)), Joined.with(Serdes.String(), jsonSerde, jsonSerde) )
Finally I write the joinedStream to an output topic:
joinedStream
.map((key, value) -> KeyValue.pair(value.path("transactionId").textValue(), value))
.to(OUTPUT_TOPIC, Produced.with(Serdes.String(), jsonSerde));
Then I create two keyValue stores to count the number of original transactions and the joined ones:
Materialized.with(Serdes.String(), Serdes.Long());
transactionStream
.map((key, value) -> KeyValue.pair(SOURCE_COUNT_KEY, value))
.groupByKey(Grouped.with(Serdes.String(), jsonSerde))
.count(as(SOURCE_STORE))
;
joinedStream
.map((key, value) -> KeyValue.pair(TARGET_COUNT_KEY, value))
.groupByKey(Grouped.with(Serdes.String(), jsonSerde))
.count(as(TARGET_STORE))
;
The filters work because when I output all events in shopStream and transactionStream I can see that all arrive and the joins start only after all events are printed.
I can also see that the shop created event arrives before the transaction events to that shop. Also what is weird that sometimes when I have 10 transactions to the same shop, then 7 is joined correctly and 3 is missing (as an example).
Also the counts in the keyvalue stores are correct because its the same number of events in the output topic.
The joinToShop() method is not triggered for the missing joins.
So my question is why is this happening? That sometimes it processes all events and sometimes just a part of them? And how can I make sure that all events are joined?
Data is processed based on timestamps. However, in older releases, Kafka Streams applies a best effort approach to read data from different topics to based on timestamps.
I would recommend to upgrade to version 2.1 (or newer) that improves timestamp synchronization and should avoid the issue (cf. https://cwiki.apache.org/confluence/display/KAFKA/KIP-353%3A+Improve+Kafka+Streams+Timestamp+Synchronization)

How to capture kafka records that don't match the condition of the kafka stream join?

I'm doing data enrichment by joining a kstream with a ktable. The kstream contains messages sent by vehicles and the ktable contains vehicle data.
The problem i have is that i want to capture message from the stream that don't have a corresponding join key in the table.
Kafka stream silently skip records that they don't have a join match.
Is there some way to emit those records to a different topic, so they can be processed later?
StreamsBuilder builder = new StreamsBuilder();
final KTable<String, VinMappingInfo> vinMappingTable = builder.table(vinInfoTopic, Consumed.with(Serdes.String(), valueSerde));
KStream<String, VehicleMessage> vehicleStream = builder.stream(sourceTopic);
vehicleStream.join(vinMappingTable, (vehicleMsg, vinInfo) -> {
log.info("joining {} with vin info {}", vehicleMsg.getPayload().getId(), vinInfo.data.vin);
vehicleMsg.setVin(vinInfo.data.vin);
return vehicleMsg;
}, Joined.with(null, null, valueSerde))
.to(destinationTopic);
final Topology topology = builder.build();
log.info("The topology of connected processor nodes: \n {}", topology.describe());
KafkaStreams streams = new KafkaStreams(topology, config);
streams.cleanUp();
streams.start();
You can use a left-join:
stream.leftJoin(table,...);
This ensures the all records from the input stream are in the output stream. The ValueJoiner will be called with apply(streamValue, null) for this case.

SPARK - Joining two data streams - maintenance of cache

It is evident that the out of box join capability in spark streaming does not warrent a lot of real life use cases. The reason being it joins only the data contained in the micro batch RDDs.
Use case is to join data from two kafka streams and enrich each object in stream1 with it's corresponding object in stream2 in spark and save it to HBase.
Implementation would
maintain a dataset in memory from objects from stream2, adding or replacing objects as and when they are recieved
for every element in stream1, access the cache to find a matching object from stream2, save to HBase if match is found or put it back on the kafka stream if not.
This question is on exploration of Spark streaming and it's API to find a way to implement the above mentioned.
You can join the incoming RDDs to other RDDs -- not just the ones in that micro-batch. Basically you keep a "running total" RDD that you fill something like:
var globalRDD1: RDD[...] = sc.emptyRDD
var globalRDD2: RDD[...] = sc.emptyRDD
dstream1.foreachRDD(rdd => if (!rdd.isEmpty) globalRDD1 = globalRDD1.union(rdd))
dstream2.foreachRDD(rdd => if (!rdd.isEmpty) {
globalRDD2 = globalRDD2.union(rdd))
globalRDD1.join(globalRDD2).foreach(...) // etc, etc
}
A good start would be to look into mapWithState. This is a more efficient replacement for updateStateByKey. These are defined on PairDStreamFunction, so assuming your objects of type V in stream2 are identified by some key of type K, your first point would go like this:
def stream2: DStream[(K, V)] = ???
def maintainStream2Objects(key: K, value: Option[V], state: State[V]): (K, V) = {
value.foreach(state.update(_))
(key, state.get())
}
val spec = StateSpec.function(maintainStream2Objects)
val stream2State = stream2.mapWithState(spec)
stream2State is now a stream where each batch contains the (K, V) pairs with the latest value seen for each key. You can do a join on this stream and stream1 to perform the further logic for your second point.

Is it possible to read a message from a PubSub and separate its data in different elements of a PCollection<String>? If so, how?

Now, I have the below code:
PCollection<String> input_data =
pipeline
.apply(PubsubIO
.Read
.withCoder(StringUtf8Coder.of())
.named("ReadFromPubSub")
.subscription("/subscriptions/project_name/subscription_name"));
Looks like you want to read some messages from pubsub and convert each of them to multiple parts by splitting a message on space characters, and then feed the parts to the rest of your pipeline. No special configuration of PubsubIO is needed, because it's not a "reading data" problem - it's a "transforming data you have already read" problem - you simply need to insert a ParDo which takes your "composite" record and breaks it down in the way you want, e.g.:
PCollection<String> input_data =
pipeline
.apply(PubsubIO
.Read
.withCoder(StringUtf8Coder.of())
.named("ReadFromPubSub")
.subscription("/subscriptions/project_name/subscription_name"))
.apply(ParDo.of(new DoFn<String, String>() {
public void processElement(ProcessContext c) {
String composite = c.element();
for (String part : composite.split(" ")) {
c.output(part);
}
}}));
}));
I take it you mean that the data you want is present in different elements of the PCollection and want to extract and group it somehow.
A possible approach is to write a DoFn function that processes each String in the PCollection. You output a key value pair for each piece of data you want to group. You can then use the GroupByKey transform to group all the relevant data together.
For example you have the following messages from pubsub in your PCollection:
User 1234 bought item A
User 1234 bought item B
The DoFn function will output a key value pair with the user id as key and the item bought as value. ( <1234,A> , <1234, B> ).
Using the GroupByKey transform you group the two values together in one element. You can then perform further processing on that element.
This is a very common pattern in bigdata called mapreduce.
You can output an Iterable<A> then use Flatten to squash it. Unsurprisingly this is termed flatMap in many next-gen data processing platforms, c.f. spark / flink.

Resources