What is the difference between DataStream and KeyedStream in Apache Flink? - join

I am looking in the context of joining two streams using Flink and would like to understand how these two streams differ and affect how Flink processes them.
As a related question, I would also like to understand how CoProcessFunction would differ from a KeyedCoProcessFunction.

A KeyedStream is a DataStream that has been hash partitioned, with the effect that for any given key, every stream element for that key is in the same partition. This guarantees that all messages for a key are processed by the same worker instance. Only keyed streams can use key-partitioned state and timers.
A KeyedCoProcessFunction connects two streams that have been keyed in compatible ways -- the two streams are mapped to the same keyspace -- making it possible for the KeyedCoProcessFunction to have keyed state that relates to both streams. For example, you might want to join a stream of customer transactions with a stream of customer updates -- joining them on the customer_id. You would implement this in Flink (if doing so at a low level) by keying both streams by the customer_id, and connecting those keyed streams with a KeyedCoProcessFunction.
On the other hand, a CoProcessFunction has two inputs, but with no particular relationship between those inputs.
The Flink training has tutorials covering keyed streams and connected streams, and a related exercise/example.

Related

Flink: How can I emits the outputs immediately in the one-to-many streams' join

I have one query stream and one item stream. I want to join these two stream on the query_id, the relation is one-to-many. How can I emit the item immediately to the output stream after its arrival and get some info from the query.
You could do this either using Table API and a simple join or you could implement it yourself using CoFlatMap on stream keyed by the query_id and buffering incoming events in the state. You should consider some retention policy though, to make sure the state won't grow infinitely.

Join of of multiple streams with the Python SDK

I would like to join multiple streams on a common key and trigger a result either as soon as all of the streams have contributed at least one element or at the end of the window. CoGroupByKey seems to be the appropriate building block, but there does not seem to be a way to express the early trigger condition (count trigger applies per input collection)?
I believe CoGroupByKey is implemented as Flatten + GroupByKey under the hood. Once multiple streams are flattened into one, data-driven trigger (or any other triggers) won't have enough control to achieve what you want.
Instead of using CoGroupByKey, you can use Flatten and StatefulDoFn that fills an object backed by State for each key. Also in this case, StatefulDoFn would have the chance to decide what to do when stream A has 2 elements arrived but stream B doesn't have any element yet.
Another potential solution that comes to mind is (a stateless) DoFn that filters the CoGBK results to remove those that don't have at least one occurrence for each joined stream. For the end of window result (which does not have the same restriction), it would then be necessary to have a parallel CoGBK and its result would not go through the filter. I don't think there is a way to tag results with the trigger that emitted it?

Kafka Streams wait function with depending objects

I create a Kafka Streams application, which receives different JSON objects from different topics and I want to implement some kind of wait function, but I'm not sure about how to implement it best.
To simplify the problem I'll use simplified entities in the following section, I hope the problem can be described very good with it.
So in one of my streams I receive car objects and every car has an id. In a second stream I receive person objects and every person has also a car id and is assigned to a car with this id.
I want to read with my Kafka Streams application from both input streams (topics) and enrich the car object with the four persons, which have the same car id. The car objects should only be forwarded to the next downstream processor when all four persons are included into the car object.
I have planned to create an input stream for the car and one for the person objects, parse the JSON data into the internal object representation, merge both streams together and apply a "selectKey" function on the merged stream to extract the keys out of the entities.
After that I would push the data into a custom transformation function, which has a state store inlcuded. Inside this transform function I would store every arriving car object with its id in the state store. As soon as new person objects arrive, I would add them to the respective car object in the state store (please ignore the case of late arriving cars here). As soon as four persons are in a car object, I would forward the object to the next stream function and remove the car object out of the state store.
Would this be a suitable approach for this? I'm not sure about scalability, because I have to make sure that when running multiple instances that the car and person objects with the same id will be processed by the same application instance. I would use the selectKey function for this, would that work?
Thanks!
The basic design looks sound to me.
However, selectKey() itself will not be sufficient, because transform() (in contrast to DSL operators) does not trigger an auto-rebalance. Thus, you need to manually rebalance via through().
stream.selectKey(...)
.through("user-created-topic")
.transform(...);
https://docs.confluent.io/current/streams/upgrade-guide.html#auto-repartitioning

Kapacitor: calculating difference between two streams via join

Full disclosure: I also published a variant of this question here.
I have an embedded device as part of a heating system that is publishing two temperature values, each to an individual MQTT topic, every 5 seconds via a mosquitto MQTT broker. "mydevice/sensor1" is the pre-heated temperature, and "mydevice/sensor2" is post-heating temperature. The values are published at almost the same time, so there's typically never more than half a second of delay between the two messages - but they aren't synchronised exactly.
Telegraf is subscribed to the same broker and is happily putting these measurements into an InfluxDB database called "telegraf.autogen". The measurements both appear under a single measurement called "mqtt_consumer" with a field called "value". In InfluxDB I can differentiate between topic-tagged values by filtering with the "topic" tag:
SELECT mean("value") AS "mean_value" FROM "telegraf"."autogen"."mqtt_consumer" WHERE time > now() - 1m AND "topic"='mydevice/sensor1' GROUP BY time(5s)
This all seems to be working correctly.
What I want to do is calculate the difference between these two topic values, for each pair of incoming values, in order to calculate the temperature differential and eventually calculate the energy being transferred by the heating system (the flow rate is constant and known). I tried to do this with InfluxDB queries in Grafana but it seemed quite difficult (I failed), so I thought I'd try and use TICKscript to break down my process into small steps.
I have been putting together a TICKscript to calculate the difference based on this example:
https://docs.influxdata.com/kapacitor/v1.3/guides/join_backfill/#stream-method
However in my case I don't have two separate measurements. Instead, I create two separate streams from the single "mqtt_consumer" measurement, using the topic tag as a filter. Then I attempt to join these with a 1s tolerance (values are always published close enough in time). I'm using httpOut to generate a view for debugging (Aside: this only updates every 10 seconds, missing every second value, even though my stream operates at 5 second intervals - why is that? I can see in the new db that the values are all present though).
Once I have them joined, I would evaluate the difference in values, and store this in a new database under a measurement called "diff".
Here's my script so far:
var sensor1 = stream
|from()
.database('telegraf')
.retentionPolicy('autogen')
.measurement('mqtt_consumer')
.where(lambda: "topic" == 'mydevice/sensor1')
.groupBy(*)
|httpOut('sensor1')
var sensor2 = stream
|from()
.database('telegraf')
.retentionPolicy('autogen')
.measurement('mqtt_consumer')
.where(lambda: "topic" == 'mydevice/sensor2')
.groupBy(*)
|httpOut('sensor2')
sensor1
|join(sensor2)
.as('value1', 'value2')
.tolerance(1s)
|httpOut('join')
|eval(lambda: "sensor1.value1" - "sensor1.value2")
.as('diff')
|httpOut('diff')
|influxDBOut()
.create()
.database('mydb')
.retentionPolicy('myrp')
.measurement('diff')
Unfortunately my script is failing to pass any items through the join node. In kapacitor show I can see that the httpOut nodes are both passing items to the join node, but it isn't passing any on. The kapacitor logs don't show anything obvious either. An HTTP GET for httpOut('join') returns:
{"series":null}
I have two questions:
is this approach, using Kapacitor with a TICKscript for calculating energy based on the difference between two values in a single measurement, valid? Or is there a better/simpler way to do this?
why isn't the join node producing any output? What can I do to debug this further?
Try to add |mean node, to calculate the mean of the field, in both sensors:
var sensor1 = stream
|from()
.database('telegraf')
.retentionPolicy('autogen')
.measurement('mqtt_consumer')
.where(lambda: "topic" == 'mydevice/sensor1')
.groupBy(*)
|mean('field1')
|httpOut('sensor1')
After the join, you should use the newly assigned name to the streams nor the original ones:
sensor1
|join(sensor2)
.as('value1', 'value2')
.tolerance(1s)
|httpOut('join')
|eval(lambda: "value1.field1" - "value2.field2")
.as('diff')
|httpOut('diff')
|influxDBOut()
.create()
.database('mydb')
.retentionPolicy('myrp')
.measurement('diff')
Where mean fields are the field calculated on my previous comment. Try it out!
Also, to further debugging, try to add log nodes where you want to put your eye.
Hope this helps! Regards

Time Series Databases - Metrics vs. tags

I'm new with TSDB and I have a lot of temperature sensors to store in my database with one point per second. Is it better to use one unique metric per sensor, or only one metric (temperature for example) with distinct tags depending sensor??
I searched on Internet what is the best practice, but I didn't found a good answer...
Thank you! :-)
Edit:
I will have 8 types of measurements (temperature, setpoint, energy, power,...) from 2500 sources
If you are storing your data in InfluxDB, I would recommend storing all the metrics in a single measurement and using tags to differentiate the sources, rather than creating a measurement per source. The reason being that you can trivially merge or decompose the metrics using tags within a measurement, but it is not possible in the newest InfluxDB to merge or join across measurements.
Ultimately the decision rests with both your choice of TSDB and the queries you care most about running.
For comparison purposes, in Axibase Time-Series Database you can store temperature as a metric and sensor id as entity name. ATSD schema has a notion of entity which is the name of system for which the data is being collected. The advantage is more compact storage and the ability to define tags for entities themselves, for example sensor location, sensor type etc. This way you can filter and group results not just by sensor id but also by sensor tags.
To give you an example, in this blog article 0601911 stands for entity id - which is EPA station id. This station collects several environmental metrics and at the same time is described with multiple tags in the database: http://axibase.com/environmental-monitoring-using-big-data/.
The bottom line is that you don't have to stage a second database, typically a relational one, just to store extended information about sensors, servers etc. for advanced reporting.
UPDATE 1: Sample network command:
series e:sensor-001 d:2015-08-03T00:00:00Z m:temperature=42.2 m:humidity=72 m:precipitation=44.3
Tags that describe sensor-001 such as location, type, etc are stored separately, minimizing storage footprint and speeding up queries. If you're collecting energy/power metrics you often have to specify attributes to series such as Status because data may not come clean/verified. You can use series tags for this purpose.
series e:sensor-001 d:2015-08-03T00:00:00Z m:temperature=42.2 ... t:status=Provisional
You should use one metric per sensor. You probably won't be needing to aggregate values from different temperature sensors, but you will be needing to aggregate values of a given sensor (average over a minute for instance).
Metrics correspond to data coming from the same source, or at least data you will be likely to aggregate. You can create almost as many metrics as you want (up to 16 million metrics in OpenTSDB for instance).
Tags make distinctions between these pieces of data. For instance, you could tag data differently if they suddenly change a lot, in order to retrieve only relevant data if needed, without losing the rest of the data. Although for a temperature sensor getting data every second, the best would probably be to filter and only store data when the value changed...
Best practices are summed up here

Resources