Creating Custom Windowing Function in Apache Beam - google-cloud-dataflow

I have a Beam pipeline that starts off with reading multiple text files where each line in a file represents a row that gets inserted into Bigtable later in the pipeline. The scenario requires confirming that the count of rows extracted from each file & count of rows later inserted into Bigtable match. For this I am planning to develop a custom Windowing strategy so that lines from a single file get assigned to a single window based on the file name as the key that will be passed to the Windowing function.
Is there any code sample for creating custom Windowing functions?

Although I changed my strategy for confirming the inserted number of rows, for anyone who is interested in windowing elements read from a batch source e.g. FileIO in a batch job, here's the code for creating a custom windowing strategy:
public class FileWindows extends PartitioningWindowFn<Object, IntervalWindow>{
private static final long serialVersionUID = -476922142925927415L;
private static final Logger LOG = LoggerFactory.getLogger(FileWindows.class);
#Override
public IntervalWindow assignWindow(Instant timestamp) {
Instant end = new Instant(timestamp.getMillis() + 1);
IntervalWindow interval = new IntervalWindow(timestamp, end);
LOG.info("FileWindows >> assignWindow(): Window assigned with Start: {}, End: {}", timestamp, end);
return interval;
}
#Override
public boolean isCompatible(WindowFn<?, ?> other) {
return this.equals(other);
}
#Override
public void verifyCompatibility(WindowFn<?, ?> other) throws IncompatibleWindowException {
if (!this.isCompatible(other)) {
throw new IncompatibleWindowException(other, String.format("Only %s objects are compatible.", FileWindows.class.getSimpleName()));
}
}
#Override
public Coder<IntervalWindow> windowCoder() {
return IntervalWindow.getCoder();
}
}
and then it can be used in the pipeline as below:
p
.apply("Assign_Timestamp_to_Each_Message", ParDo.of(new AssignTimestampFn()))
.apply("Assign_Window_to_Each_Message", Window.<KV<String,String>>into(new FileWindows())
.withAllowedLateness(Duration.standardMinutes(1))
.discardingFiredPanes());
Please keep in mind that you will need to write the AssignTimestampFn() so that each message carries a timestamp.

Related

Flink Count of Events using metric

I have a topic in kafka where i am getting multiple type of events in json format. I have created a filestreamsink to write these events to S3 with bucketing.
FlinkKafkaConsumer errorTopicConsumer = new FlinkKafkaConsumer(ERROR_KAFKA_TOPICS,
new SimpleStringSchema(),
properties);
final StreamingFileSink<Object> errorSink = StreamingFileSink
.forRowFormat(new Path(outputPath + "/error"), new SimpleStringEncoder<>("UTF-8"))
.withBucketAssigner(new EventTimeBucketAssignerJson())
.build();
env.addSource(errorTopicConsumer)
.name("error_source")
.setParallelism(1)
.addSink(errorSink)
.name("error_sink").setParallelism(1);
public class EventTimeBucketAssignerJson implements BucketAssigner<Object, String> {
#Override
public String getBucketId(Object record, Context context) {
StringBuffer partitionString = new StringBuffer();
Tuple3<String, Long, String> tuple3 = (Tuple3<String, Long, String>) record;
try {
partitionString.append("event_name=")
.append(tuple3.f0).append("/");
String timePartition = TimeUtils.getEventTimeDayPartition(tuple3.f1);
partitionString.append(timePartition);
} catch (Exception e) {
partitionString.append("year=").append(Constants.DEFAULT_YEAR).append("/")
.append("month=").append(Constants.DEFAULT_MONTH).append("/")
.append("day=").append(Constants.DEFAULT_DAY);
}
return partitionString.toString();
}
#Override
public SimpleVersionedSerializer<String> getSerializer() {
return SimpleVersionedStringSerializer.INSTANCE;
}
}
Now i want to publish hourly count of each event as metrics to prometheus and publish a grafana dashboard over that.
So please help me how can i achieve hourly count for each event using flink metrics and publish to prometheus.
Thanks
Normally, this is done by simply creating a counter for requests and then using the rate() function in Prometheus, this will give you the rate of requests in the given time.
If You, however, want to do this on Your own for some reason, then You can do something similar to what has been done in org.apache.kafka.common.metrics.stats.Rate. So You would, in this case, need to gather list of samples with the time at which they were collected, along with the window size You want to use for calculation of the rate, then You could simply do the calculation, i.e. remove samples that went out of scope and has expired and then simply calculate how many samples are in the window.
You could then set the Gauge to the calculated value.

Read from multiple Pubsub subscriptions using ValueProvider

I have multiple subscriptions from Cloud PubSub to read based on certain prefix pattern using Apache Beam. I extend PTransform class and implement expand() method to read from multiple subscriptions and do Flatten transformation to the PCollectionList (multiple PCollection on from each subscription). I have a problem to pass subscription prefix as ValueProvider into the expand() method, since expand() is called on template creation time, not when launching the job. However, if I only use 1 subscription, I can pass ValueProvider into PubsubIO.readStrings().fromSubscription().
Here's some sample code.
public class MultiPubSubIO extends PTransform<PBegin, PCollection<PubsubMessage>> {
private ValueProvider<String> prefixPubsub;
public MultiPubSubIO(#Nullable String name, ValueProvider<String> prefixPubsub) {
super(name);
this.prefixPubsub = prefixPubsub;
}
#Override
public PCollection<PubsubMessage> expand(PBegin input) {
List<String> myList = null;
try {
// prefixPubsub.get() will return error
myList = PubsubHelper.getAllSubscription("projectID", prefixPubsub.get());
} catch (Exception e) {
LogHelper.error(String.format("Error getting list of subscription : %s",e.toString()));
}
List<PCollection<PubsubMessage>> collectionList = new ArrayList<PCollection<PubsubMessage>>();
if(myList != null && !myList.isEmpty()){
for(String subs : myList){
PCollection<PubsubMessage> pCollection = input
.apply("ReadPubSub", PubsubIO.readMessagesWithAttributes().fromSubscription(this.prefixPubsub));
collectionList.add(pCollection);
}
PCollection<PubsubMessage> pubsubMessagePCollection = PCollectionList.of(collectionList)
.apply("FlattenPcollections", Flatten.pCollections());
return pubsubMessagePCollection;
} else {
LogHelper.error(String.format("No subscription with prefix %s found", prefixPubsub));
return null;
}
}
public static MultiPubSubIO read(ValueProvider<String> prefixPubsub){
return new MultiPubSubIO(null, prefixPubsub);
}
}
So I'm thinking of how to use the same way PubsubIO.read().fromSubscription() to read from ValueProvider. Or am I missing something?
Searched links:
extract-value-from-valueprovider-in-apache-beam - Answer talked about using DoFn, while I need PTransform that receives PBegin.
Unfortunately this is not possible currently:
It is not possible for the value of a ValueProvider to affect transform expansion - at expansion time, it is unknown; by the time it is known, the pipeline shape is already fixed.
There is currently no transform like PubsubIO.read() that can accept a PCollection of topic names. Eventually there will be (it is enabled by Splittable DoFn), but it will take a while - nobody is working on this currently.
You can use MultipleReadFromPubSub from apache beam io module https://beam.apache.org/releases/pydoc/2.27.0/_modules/apache_beam/io/gcp/pubsub.html
topic_1 = PubSubSourceDescriptor('projects/myproject/topics/a_topic')
topic_2 = PubSubSourceDescriptor(
'projects/myproject2/topics/b_topic',
'my_label',
'my_timestamp_attribute')
subscription_1 = PubSubSourceDescriptor(
'projects/myproject/subscriptions/a_subscription')
results = pipeline | MultipleReadFromPubSub(
[topic_1, topic_2, subscription_1])

How do I make View's asList() sortable in Google Dataflow SDK?

We have a problem making asList() method sortable.
We thought we could do this by just extending the View class and override the asList method but realized that View class has a private constructor so we could not do this.
Our other attempt was to fork the Google Dataflow code on github and modify the PCollectionViews class to return a sorted list be using the Collections.sort method as shown in the code snippet below
#Override
protected List<T> fromElements(Iterable<WindowedValue<T>> contents) {
Iterable<T> itr = Iterables.transform(
contents,
new Function<WindowedValue<T>, T>() {
#SuppressWarnings("unchecked")
#Override
public T apply(WindowedValue<T> input){
return input.getValue();
}
});
LOG.info("#### About to start sorting the list !");
List<T> tempList = new ArrayList<T>();
for (T element : itr) {
tempList.add(element);
};
Collections.sort((List<? extends Comparable>) tempList);
LOG.info("##### List should now be sorted !");
return ImmutableList.copyOf(tempList);
}
Note that we are now sorting the list.
This seemed to work, when run with the DirectPipelineRunner but when we tried the BlockingDataflowPipelineRunner, it didn't seem like the code change was being executed.
Note: We actually recompiled the dataflow used it in our project but this did not work.
How can we be able to achieve this (as sorted list from the asList method call)?
The classes in PCollectionViews are not intended for extension. Only the primitive view types provided by View.asSingleton, View.asSingleton View.asIterable, View.asMap, and View.asMultimap are supported.
To obtain a sorted list from a PCollectionView, you'll need to sort it after you have read it. The following code demonstrates the pattern.
// Assume you have some PCollection
PCollection<MyComparable> myPC = ...;
// Prepare it for side input as a list
final PCollectionView<List<MyComparable> myView = myPC.apply(View.asList());
// Side input the list and sort it
someOtherValue.apply(
ParDo.withSideInputs(myView).of(
new DoFn<A, B>() {
#Override
public void processElement(ProcessContext ctx) {
List<MyComparable> tempList =
Lists.newArrayList(ctx.sideInput(myView));
Collections.sort(tempList);
// do whatever you want with sorted list
}
}));
Of course, you may not want to sort it repeatedly, depending on the cost of sorting vs the cost of materializing it as a new PCollection, so you can output this value and read it as a new side input without difficulty:
// Side input the list, sort it, and put it in a PCollection
PCollection<List<MyComparable>> sortedSingleton = Create.<Void>of(null).apply(
ParDo.withSideInputs(myView).of(
new DoFn<Void, B>() {
#Override
public void processElement(ProcessContext ctx) {
List<MyComparable> tempList =
Lists.newArrayList(ctx.sideInput(myView));
Collections.sort(tempList);
ctx.output(tempList);
}
}));
// Prepare it for side input as a list
final PCollectionView<List<MyComparable>> sortedView =
sortedSingleton.apply(View.asSingleton());
someOtherValue.apply(
ParDo.withSideInputs(sortedView).of(
new DoFn<A, B>() {
#Override
public void processElement(ProcessContext ctx) {
... ctx.sideInput(sortedView) ...
// do whatever you want with sorted list
}
}));
You may also be interested in the unsupported sorter contrib module for doing larger sorts using both memory and local disk.
We tried to do it the way Ken Knowles suggested. There's a problem for large datasets. If the tempList is large (so sort takes some measurable time as it's proportion to O(n * log n)) and if there are millions of elements in the "someOtherValue" PCollection, then we are unecessarily re-sorting the same list millions of times. We should be able to sort ONCE and FIRST, before passing the list to the someOtherValue.apply's DoFn.

Data reshape operation in Google Dataflow

I'm currently having a streaming pipeline processing events and pushing them to a BigQuery table named EventsTable:
TransactionID EventType
1 typeA
1 typeB
1 typeB
1 typeC
2 typeA
2 typeC
3 typeA
I want to add a branch to my processing pipeline and also "group" together transaction-related data into a TransactionsTable. Roughly, the value in the type columns in the TransactionsTable would be the count of the related eventType for a given transaction. With the previous example events, the output would look like this:
TransactionID typeA typeB typeC
1 1 2 1
2 1 0 1
3 1 0 0
The number of "type" columns would be equal to the number of different eventType that exists within the system.
I'm trying to see how I could do this with Dataflow, but cannot find any clean way to do it. I know that PCollections are immutable, so I cannot store the incoming data in a growing PCollection structure that would queue the incoming events up to the moment where the needed other elements are present and that I could write them to the second BigQuery table. Is there a sort of windowing function that would allow to do this with Dataflow (like queuing the events in a temporary windowed structure with a sort of expiry date or something)?
I could probably do something with batched jobs and PubSub, but this would be far more complex. On the other hand, I do understand that Dataflow is not meant to have ever growing data structures and that the data, once it goes in, has to go through the pipeline and exit (or be discarded). Am I missing something?
In general, the easiest way to do this kind of "aggregate data across many events" is to use a CombineFn which allows you to combine all of the values associated with a specific key. This is typically more efficient than just queueing the events, because it only needs to accumulate the result rather than accumulate all of the events.
For your specific case, you could create a custom CombineFn. The accumulator would be a Map<EventType, Long>. For instance:
public class TypedCountCombineFn
extends CombineFn<EventType, Map<EventType, Long>, TransactionRow> {
#Override
public Map<EventType, Long> createAccumulator() {
return new HashMap<>();
}
#Override
public Map<EventType, Long> addInput(
Map<EventType, Long> accum, EventType input) {
Long count = accum.get(input);
if (count == null) { count = 0; accum.put(input, count); }
count++;
return accum;
}
#Override
public Map<EventType, Long> mergeAccumulators(
Iterable<Map<EventType, Long>> accums) {
// TODO: Sum up all the counts for similar event types
}
#Override
public TransactionRow extractOutput(Map<EventType, Long> accum) {
// TODO: Build an output row from the per-event-type accumulator
}
}
Applying this CombineFn can be done globally (across all transactions in a PCollection) or per-key (such as per transaction ID):
PCollection<EventType> pc = ...;
// Globally
PCollection<TransactionRow> globalCounts = pc.apply(Combine.globally(new TypedCountCombineFn()));
// PerKey
PCollection<KV<Long, EventType>> keyedPC = pc.apply(WithKeys.of(new SerializableFunction<EventType, Long>() {
#Override
public long apply(EventType in) {
return in.getTransactionId();
}
});
PCollection<KV<Long, TransactionRow>> keyedCounts =
keyedPC.apply(Combine.perKey(new TypedCountCombineFn()));

Field and values connection in Storm

I have a fundamental question in storm. I can clearly understand some basic things. For example i have a main class with this code inside:
...
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(SENTENCE_SPOUT_ID, new SentenceSpout());
builder.setBolt(SPLIT_BOLT_ID, new SplitSentenceBolt()).shuffleGrouping(SENTENCE_SPOUT_ID);
builder.setBolt(COUNT_BOLT_ID, new WordCountBolt(), 3).fieldsGrouping(SPLIT_BOLT_ID, new Fields("word"));
builder.setBolt(REPORT_BOLT_ID, new ReportBolt()).globalGrouping(COUNT_BOLT_ID);
...
and i understand that 1st element(ex. "SENTENCE_SPOUT_ID") is the id of the bolt/spout in order to show the connection between 2 of them. The 2nd element(ex.new SentenceSpout()) specifies the spout or bold that we set in our topology. 3rd element marks the num of tasks that we need for this certain bolt spout.
Then we use .fieldsGrouping or .shuffleGrouping etc to specify the type of grouping and then between the parenthesis the 1st element is the connection with the bolt/spout that takes the input and the 2nd (ex. new Fields("word")) determines the fields that we will group by.
Inside the code of one of the bolts:
public class SplitSentenceBolt extends BaseRichBolt{
private OutputCollector collector;
public void prepare(Map config, TopologyContext context, OutputCollector collector) {
this.collector = collector;
}
public void execute(Tuple tuple) {
this.collector.emit(a, new Values(word, time, name));
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
At this.collector.emit(a, new Values(word, time, name)); a is the stream_ID and values(...) are the elements of the tuple.
At declarer.declare(new Fields("word")); word must be one of the previous values. Am i right to all the previous?
So my question is: that in declarer.declare(new Fields("word")); word must be the same with word in this.collector.emit(a, new Values(word, time, name)); and the same with the word in builder.setBolt(COUNT_BOLT_ID, new WordCountBolt(), 3).fieldsGrouping(SPLIT_BOLT_ID, new Fields("word")); ????
The number and order of the fields you declare in declareOutputFields should match the fields you emit.
Two changes I'd recommend:
For now use the default stream by omitting the first parameter: collector.emit(new Values(word, time, name));
Make sure you declare the same number of fields: declarer.declare(new Fields("word", "time", "name"));

Resources