I have a Kafkastreams application where I am trying to join transactions to shops.
This is working fine for new events, because shops are always created before transactions happen, but when I am trying to read "historic" data (events that happened before the application was started), sometimes are all transactions joined to shops and sometimes less, around 60%-80%, but once I got 30%.
This is very weird because I know that all transactions have a correct shop id, as sometimes it does join all of them.
The events are in one topic and I use filters to put them into two streams. Then I create a KTable from the shops stream and then I join the transaction stream to the shop table.
final KStream<String, JsonNode> shopStream = eventStream
.filter((key, value) -> "SHOP_CREATED_EVENT".equals(value.path("eventType").textValue()))
.map((key, value) -> KeyValue.pair(value.path("shop_id").textValue(), value)
);
final KStream<String, JsonNode> transactionStream = eventStream
.filter((key, value) -> "TRANSACTION_EVENT".equals(value.path("eventType").textValue()))
.map((key, value) -> KeyValue.pair(value.path("shop_id").textValue(), value)
);
final KTable<String, JsonNode> shopTable = shopStream
.groupByKey(Grouped.with(Serdes.String(), jsonSerde))
.reduce(genericReducer);
final KStream<String, JsonNode> joinedStream = transactionStream
.join(shopTable, this::joinToShop, Joined.valueSerde(jsonSerde))
I also tried to use stream to stream join instead of stream to table, same result:
final KStream<String, JsonNode> joinedStream = transactionStream
.join(shopStream, this::joinToShop,JoinWindows.of(Duration.ofMinutes(5)), Joined.with(Serdes.String(), jsonSerde, jsonSerde) )
Finally I write the joinedStream to an output topic:
joinedStream
.map((key, value) -> KeyValue.pair(value.path("transactionId").textValue(), value))
.to(OUTPUT_TOPIC, Produced.with(Serdes.String(), jsonSerde));
Then I create two keyValue stores to count the number of original transactions and the joined ones:
Materialized.with(Serdes.String(), Serdes.Long());
transactionStream
.map((key, value) -> KeyValue.pair(SOURCE_COUNT_KEY, value))
.groupByKey(Grouped.with(Serdes.String(), jsonSerde))
.count(as(SOURCE_STORE))
;
joinedStream
.map((key, value) -> KeyValue.pair(TARGET_COUNT_KEY, value))
.groupByKey(Grouped.with(Serdes.String(), jsonSerde))
.count(as(TARGET_STORE))
;
The filters work because when I output all events in shopStream and transactionStream I can see that all arrive and the joins start only after all events are printed.
I can also see that the shop created event arrives before the transaction events to that shop. Also what is weird that sometimes when I have 10 transactions to the same shop, then 7 is joined correctly and 3 is missing (as an example).
Also the counts in the keyvalue stores are correct because its the same number of events in the output topic.
The joinToShop() method is not triggered for the missing joins.
So my question is why is this happening? That sometimes it processes all events and sometimes just a part of them? And how can I make sure that all events are joined?
Data is processed based on timestamps. However, in older releases, Kafka Streams applies a best effort approach to read data from different topics to based on timestamps.
I would recommend to upgrade to version 2.1 (or newer) that improves timestamp synchronization and should avoid the issue (cf. https://cwiki.apache.org/confluence/display/KAFKA/KIP-353%3A+Improve+Kafka+Streams+Timestamp+Synchronization)
Related
None of the provided DataFlow templates match what I need to do, so I'm trying to write my own. I managed to run the example code like word count example without issue, so I tried to butcher together parts separate examples that read from BigQuery and writes to Spanner but there's just so many things in the source code I don't understand and cannot adapt to my own problem.
I'm REALLY lost on this and any help is greatly appreciated!
The goal is to use DataFlow and Apache Beam SDK to read from a BigQuery table with 3 string fields and 1 integer field, then concatenate the content of the 3 string fields into one string and put that new string in a new field called "key", then I want to write the key field and the integer field (which is unchanged) to a Spanner table that already exists, ideally append rows with a new key and update the integer field of rows with a key that already exists.
I'm trying to do this in Java because there is no i/o connector for Python. Any advice on doing this with Python are much appreciated.
For now I would be super happy if I could just read a table from BigQuery and write whatever I get from that table to a table in Spanner, but I can't even make that happen.
Problems:
I'm using Maven and I don't know what dependencies I need to put in the pom file
I don't know which package and import I need at the beginning of my java file
I don't know if I should use readTableRows() or read(SerializableFunction) to read from BigQuery
I have no idea how to access the string fields in the PCollection to concatenate them or how to make the new PCollection with only the key and integer field
I somehow need to make the PCollection into a Mutation to write to Spanner
I want to use an INSERT UPDATE query to write to the Spanner table, which doesn't seem to be an option in the Spanner i/o connector.
Honestly, I'm too embarrassed to even show that code I'm trying to run.
public class SimpleTransfer {
public static void main(String[] args) {
// Create and set your PipelineOptions.
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
// For Cloud execution, set the Cloud Platform project, staging location, and specify DataflowRunner.
options.setProject("myproject");
options.setStagingLocation("gs://mybucket");
options.setRunner(DataflowRunner.class);
// Create the Pipeline with the specified options.
Pipeline p = Pipeline.create(options);
String tableSpec = "database.mytable";
// read whole table from bigquery
rowsFromBigQuery =
p.apply(
BigQueryIO.readTableRows()
.from(tableSpec);
// Hopefully some day add a transform
// Somehow make a Mutation
PCollection<Mutation> mutation = rowsFromBigQuery;
// Only way I found to write to Spanner, not even sure if that works.
SpannerWriteResult result = mutation.apply(
SpannerIO.write().withInstanceId("myinstance").withDatabaseId("mydatabase").grouped());
p.run().waitUntilFinish();
}
}
It's intimidating to deal with these strange data types, but once you get used to the TableRow and Mutation types, you'll be able to code robust pipelines.
The first thing you need to do is take your PCollection of TableRows, and convert those into an intermediate format that is convenient for you. Let's use Beam's KV, which defines a key-value pair. In the following snippet, we're extracting the values from the TableRow, and concatenating the string you want:
rowsFromBigQuery
.apply(
MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings()
TypeDescriptors.integers()))
.via(tableRow -> KV.of(
(String) tableRow.get("myKey1")
+ (String) tableRow.get("myKey2")
+ (String) tableRow.get("myKey3"),
(Integer) tableRow.get("myIntegerField"))))
Finally, to write to Spanner, we use Mutation-type objects, which define the kind of mutation that we want to apply to a row in Spanner. We'll do it with another MapElements transform, which takes N inputs, and returns N outputs. We define the insert or update mutations there:
myKvPairsPCollection
.apply(
MapElements.into(TypeDescriptor.of(Mutation.class))
.via(elm -> Mutation.newInsertOrUpdateBuilder("myTableName)
.set("key").to(elm.getKey())
.set("value").to(elm.getValue()));
And then you can pass the output to that to SpannerIO.write. The whole pipeline looks something like this:
Pipeline p = Pipeline.create(options);
String tableSpec = "database.mytable";
// read whole table from bigquery
PCollection<TableRow> rowsFromBigQuery =
p.apply(
BigQueryIO.readTableRows().from(tableSpec));
// Take in a TableRow, and convert it into a key-value pair
PCollection<Mutation> mutations = rowsFromBigQuery
// First we make the TableRows into the appropriate key-value
// pair of string key and integer.
.apply(
MapElements.into(TypeDescriptors.kvs(TypeDescriptors.strings()
TypeDescriptors.integers()))
.via(tableRow -> KV.of(
(String) tableRow.get("myKey1")
+ (String) tableRow.get("myKey2")
+ (String) tableRow.get("myKey3"),
(Integer) tableRow.get("myIntegerField"))))
// Now we construct the mutations
.apply(
MapElements.into(TypeDescriptor.of(Mutation.class))
.via(elm -> Mutation.newInsertOrUpdateBuilder("myTableName)
.set("key").to(elm.getKey())
.set("value").to(elm.getValue()));
// Now we pass the mutations to spanner
SpannerWriteResult result = mutations.apply(
SpannerIO.write()
.withInstanceId("myinstance")
.withDatabaseId("mydatabase").grouped());
p.run().waitUntilFinish();
}
I'm doing data enrichment by joining a kstream with a ktable. The kstream contains messages sent by vehicles and the ktable contains vehicle data.
The problem i have is that i want to capture message from the stream that don't have a corresponding join key in the table.
Kafka stream silently skip records that they don't have a join match.
Is there some way to emit those records to a different topic, so they can be processed later?
StreamsBuilder builder = new StreamsBuilder();
final KTable<String, VinMappingInfo> vinMappingTable = builder.table(vinInfoTopic, Consumed.with(Serdes.String(), valueSerde));
KStream<String, VehicleMessage> vehicleStream = builder.stream(sourceTopic);
vehicleStream.join(vinMappingTable, (vehicleMsg, vinInfo) -> {
log.info("joining {} with vin info {}", vehicleMsg.getPayload().getId(), vinInfo.data.vin);
vehicleMsg.setVin(vinInfo.data.vin);
return vehicleMsg;
}, Joined.with(null, null, valueSerde))
.to(destinationTopic);
final Topology topology = builder.build();
log.info("The topology of connected processor nodes: \n {}", topology.describe());
KafkaStreams streams = new KafkaStreams(topology, config);
streams.cleanUp();
streams.start();
You can use a left-join:
stream.leftJoin(table,...);
This ensures the all records from the input stream are in the output stream. The ValueJoiner will be called with apply(streamValue, null) for this case.
Consider the following flux
FluxSink<String> sink;
Flux<String> flux1 = Flux
.<String>create(emitter -> {
sink = emitter;
},...)
.cache()
.publish()
.autoConnect();
So to add/subscribe an item, we can do sink.next(“4”);
flux1.subscribe(item -> log.info(“item: “+item);
By filtering flux1, say from element “2” were not removing that item from the flux..
I know the Flux publisher is immutable.
If we can add to it via sink, how can we remove an item from flux1?
The proper thinking gives proper answers
Think about Flux as about immutable Stream of messages. It is like a river, you can add some water to it, but you can't roll back the water you have already given down the flow. However, you can filter that water down the stream.
In case you need to "remove" illegal elements from the stream, you can filter them:
flux1.filter(e -> !e.equal("2"))
.subscribe(item -> log.info(“item: “+item);
The flux is not a data structure to which we got used to, but it is a stream of data, which you cannot modify at the point of data supply, but can manipulate the way they go to the endpoint
It is evident that the out of box join capability in spark streaming does not warrent a lot of real life use cases. The reason being it joins only the data contained in the micro batch RDDs.
Use case is to join data from two kafka streams and enrich each object in stream1 with it's corresponding object in stream2 in spark and save it to HBase.
Implementation would
maintain a dataset in memory from objects from stream2, adding or replacing objects as and when they are recieved
for every element in stream1, access the cache to find a matching object from stream2, save to HBase if match is found or put it back on the kafka stream if not.
This question is on exploration of Spark streaming and it's API to find a way to implement the above mentioned.
You can join the incoming RDDs to other RDDs -- not just the ones in that micro-batch. Basically you keep a "running total" RDD that you fill something like:
var globalRDD1: RDD[...] = sc.emptyRDD
var globalRDD2: RDD[...] = sc.emptyRDD
dstream1.foreachRDD(rdd => if (!rdd.isEmpty) globalRDD1 = globalRDD1.union(rdd))
dstream2.foreachRDD(rdd => if (!rdd.isEmpty) {
globalRDD2 = globalRDD2.union(rdd))
globalRDD1.join(globalRDD2).foreach(...) // etc, etc
}
A good start would be to look into mapWithState. This is a more efficient replacement for updateStateByKey. These are defined on PairDStreamFunction, so assuming your objects of type V in stream2 are identified by some key of type K, your first point would go like this:
def stream2: DStream[(K, V)] = ???
def maintainStream2Objects(key: K, value: Option[V], state: State[V]): (K, V) = {
value.foreach(state.update(_))
(key, state.get())
}
val spec = StateSpec.function(maintainStream2Objects)
val stream2State = stream2.mapWithState(spec)
stream2State is now a stream where each batch contains the (K, V) pairs with the latest value seen for each key. You can do a join on this stream and stream1 to perform the further logic for your second point.
Now, I have the below code:
PCollection<String> input_data =
pipeline
.apply(PubsubIO
.Read
.withCoder(StringUtf8Coder.of())
.named("ReadFromPubSub")
.subscription("/subscriptions/project_name/subscription_name"));
Looks like you want to read some messages from pubsub and convert each of them to multiple parts by splitting a message on space characters, and then feed the parts to the rest of your pipeline. No special configuration of PubsubIO is needed, because it's not a "reading data" problem - it's a "transforming data you have already read" problem - you simply need to insert a ParDo which takes your "composite" record and breaks it down in the way you want, e.g.:
PCollection<String> input_data =
pipeline
.apply(PubsubIO
.Read
.withCoder(StringUtf8Coder.of())
.named("ReadFromPubSub")
.subscription("/subscriptions/project_name/subscription_name"))
.apply(ParDo.of(new DoFn<String, String>() {
public void processElement(ProcessContext c) {
String composite = c.element();
for (String part : composite.split(" ")) {
c.output(part);
}
}}));
}));
I take it you mean that the data you want is present in different elements of the PCollection and want to extract and group it somehow.
A possible approach is to write a DoFn function that processes each String in the PCollection. You output a key value pair for each piece of data you want to group. You can then use the GroupByKey transform to group all the relevant data together.
For example you have the following messages from pubsub in your PCollection:
User 1234 bought item A
User 1234 bought item B
The DoFn function will output a key value pair with the user id as key and the item bought as value. ( <1234,A> , <1234, B> ).
Using the GroupByKey transform you group the two values together in one element. You can then perform further processing on that element.
This is a very common pattern in bigdata called mapreduce.
You can output an Iterable<A> then use Flatten to squash it. Unsurprisingly this is termed flatMap in many next-gen data processing platforms, c.f. spark / flink.