Data reshape operation in Google Dataflow - google-cloud-dataflow

I'm currently having a streaming pipeline processing events and pushing them to a BigQuery table named EventsTable:
TransactionID EventType
1 typeA
1 typeB
1 typeB
1 typeC
2 typeA
2 typeC
3 typeA
I want to add a branch to my processing pipeline and also "group" together transaction-related data into a TransactionsTable. Roughly, the value in the type columns in the TransactionsTable would be the count of the related eventType for a given transaction. With the previous example events, the output would look like this:
TransactionID typeA typeB typeC
1 1 2 1
2 1 0 1
3 1 0 0
The number of "type" columns would be equal to the number of different eventType that exists within the system.
I'm trying to see how I could do this with Dataflow, but cannot find any clean way to do it. I know that PCollections are immutable, so I cannot store the incoming data in a growing PCollection structure that would queue the incoming events up to the moment where the needed other elements are present and that I could write them to the second BigQuery table. Is there a sort of windowing function that would allow to do this with Dataflow (like queuing the events in a temporary windowed structure with a sort of expiry date or something)?
I could probably do something with batched jobs and PubSub, but this would be far more complex. On the other hand, I do understand that Dataflow is not meant to have ever growing data structures and that the data, once it goes in, has to go through the pipeline and exit (or be discarded). Am I missing something?

In general, the easiest way to do this kind of "aggregate data across many events" is to use a CombineFn which allows you to combine all of the values associated with a specific key. This is typically more efficient than just queueing the events, because it only needs to accumulate the result rather than accumulate all of the events.
For your specific case, you could create a custom CombineFn. The accumulator would be a Map<EventType, Long>. For instance:
public class TypedCountCombineFn
extends CombineFn<EventType, Map<EventType, Long>, TransactionRow> {
#Override
public Map<EventType, Long> createAccumulator() {
return new HashMap<>();
}
#Override
public Map<EventType, Long> addInput(
Map<EventType, Long> accum, EventType input) {
Long count = accum.get(input);
if (count == null) { count = 0; accum.put(input, count); }
count++;
return accum;
}
#Override
public Map<EventType, Long> mergeAccumulators(
Iterable<Map<EventType, Long>> accums) {
// TODO: Sum up all the counts for similar event types
}
#Override
public TransactionRow extractOutput(Map<EventType, Long> accum) {
// TODO: Build an output row from the per-event-type accumulator
}
}
Applying this CombineFn can be done globally (across all transactions in a PCollection) or per-key (such as per transaction ID):
PCollection<EventType> pc = ...;
// Globally
PCollection<TransactionRow> globalCounts = pc.apply(Combine.globally(new TypedCountCombineFn()));
// PerKey
PCollection<KV<Long, EventType>> keyedPC = pc.apply(WithKeys.of(new SerializableFunction<EventType, Long>() {
#Override
public long apply(EventType in) {
return in.getTransactionId();
}
});
PCollection<KV<Long, TransactionRow>> keyedCounts =
keyedPC.apply(Combine.perKey(new TypedCountCombineFn()));

Related

Flink Count of Events using metric

I have a topic in kafka where i am getting multiple type of events in json format. I have created a filestreamsink to write these events to S3 with bucketing.
FlinkKafkaConsumer errorTopicConsumer = new FlinkKafkaConsumer(ERROR_KAFKA_TOPICS,
new SimpleStringSchema(),
properties);
final StreamingFileSink<Object> errorSink = StreamingFileSink
.forRowFormat(new Path(outputPath + "/error"), new SimpleStringEncoder<>("UTF-8"))
.withBucketAssigner(new EventTimeBucketAssignerJson())
.build();
env.addSource(errorTopicConsumer)
.name("error_source")
.setParallelism(1)
.addSink(errorSink)
.name("error_sink").setParallelism(1);
public class EventTimeBucketAssignerJson implements BucketAssigner<Object, String> {
#Override
public String getBucketId(Object record, Context context) {
StringBuffer partitionString = new StringBuffer();
Tuple3<String, Long, String> tuple3 = (Tuple3<String, Long, String>) record;
try {
partitionString.append("event_name=")
.append(tuple3.f0).append("/");
String timePartition = TimeUtils.getEventTimeDayPartition(tuple3.f1);
partitionString.append(timePartition);
} catch (Exception e) {
partitionString.append("year=").append(Constants.DEFAULT_YEAR).append("/")
.append("month=").append(Constants.DEFAULT_MONTH).append("/")
.append("day=").append(Constants.DEFAULT_DAY);
}
return partitionString.toString();
}
#Override
public SimpleVersionedSerializer<String> getSerializer() {
return SimpleVersionedStringSerializer.INSTANCE;
}
}
Now i want to publish hourly count of each event as metrics to prometheus and publish a grafana dashboard over that.
So please help me how can i achieve hourly count for each event using flink metrics and publish to prometheus.
Thanks
Normally, this is done by simply creating a counter for requests and then using the rate() function in Prometheus, this will give you the rate of requests in the given time.
If You, however, want to do this on Your own for some reason, then You can do something similar to what has been done in org.apache.kafka.common.metrics.stats.Rate. So You would, in this case, need to gather list of samples with the time at which they were collected, along with the window size You want to use for calculation of the rate, then You could simply do the calculation, i.e. remove samples that went out of scope and has expired and then simply calculate how many samples are in the window.
You could then set the Gauge to the calculated value.

Apache Flink: Stream Join Window is not triggered

I'm trying to join two streams in apache flink to get some results.
The current state of my project is, that I am fetching twitter data and map it into a 2-tuple, where the language of the user and the sum of tweets in a defined time window get saved.
I do these both for the number of tweets per language and retweets per language. The tweet/retweet aggregation works fine in other processes.
I now want to get the percentage of the number of retweets to the number of all tweets in a time window.
Therefore I use the following code:
Time windowSize = Time.seconds(15);
// Sum up tweets per language
DataStream<Tuple2<String, Integer>> tweetsLangSum = tweets
.flatMap(new TweetLangFlatMap())
.keyBy(0)
.timeWindow(windowSize)
.sum(1);
// ---
// Get retweets out of all tweets per language
DataStream<Tuple2<String, Integer>> retweetsLangMap = tweets
.keyBy(new KeyByTweetPostId())
.flatMap(new RetweetLangFlatMap());
// Sum up retweets per language
DataStream<Tuple2<String, Integer>> retweetsLangSum = retweetsLangMap
.keyBy(0)
.timeWindow(windowSize)
.sum(1);
// ---
tweetsLangSum.join(retweetsLangSum)
.where(new KeySelector<Tuple2<String, Integer>, String>() {
#Override
public String getKey(Tuple2<String, Integer> tweet) throws Exception {
return tweet.f0;
}
})
.equalTo(new KeySelector<Tuple2<String, Integer>, String>() {
#Override
public String getKey(Tuple2<String, Integer> tweet) throws Exception {
return tweet.f0;
}
})
.window(TumblingEventTimeWindows.of(windowSize))
.apply(new JoinFunction<Tuple2<String, Integer>, Tuple2<String, Integer>, Tuple4<String, Integer, Integer, Double>>() {
#Override
public Tuple4<String, Integer, Integer, Double> join(Tuple2<String, Integer> in1, Tuple2<String, Integer> in2) throws Exception {
String lang = in1.f0;
Double percentage = (double) in1.f1 / in2.f1;
return new Tuple4<>(in1.f0, in1.f1, in2.f1, percentage);
}
})
.print();
When I print tweetsLangSum or retweetsLangSum the output seems to be fine. My problem is that I never get an output from the join. Does anyone have an idea why? Or am I using the window function in the first step of aggregation wrong when it comes to the join?
This might be caused by a mix of different time semantics. The KeyedStream.timeWindow() method is a shortcut that creates a window operator based on the configured time characteristics, i.e., an event-time window if event-time is enabled or a processing-time window otherwise. For the join, you explicitly define an event-time window.
Did you enable event-time processing?
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);

Creating Custom Windowing Function in Apache Beam

I have a Beam pipeline that starts off with reading multiple text files where each line in a file represents a row that gets inserted into Bigtable later in the pipeline. The scenario requires confirming that the count of rows extracted from each file & count of rows later inserted into Bigtable match. For this I am planning to develop a custom Windowing strategy so that lines from a single file get assigned to a single window based on the file name as the key that will be passed to the Windowing function.
Is there any code sample for creating custom Windowing functions?
Although I changed my strategy for confirming the inserted number of rows, for anyone who is interested in windowing elements read from a batch source e.g. FileIO in a batch job, here's the code for creating a custom windowing strategy:
public class FileWindows extends PartitioningWindowFn<Object, IntervalWindow>{
private static final long serialVersionUID = -476922142925927415L;
private static final Logger LOG = LoggerFactory.getLogger(FileWindows.class);
#Override
public IntervalWindow assignWindow(Instant timestamp) {
Instant end = new Instant(timestamp.getMillis() + 1);
IntervalWindow interval = new IntervalWindow(timestamp, end);
LOG.info("FileWindows >> assignWindow(): Window assigned with Start: {}, End: {}", timestamp, end);
return interval;
}
#Override
public boolean isCompatible(WindowFn<?, ?> other) {
return this.equals(other);
}
#Override
public void verifyCompatibility(WindowFn<?, ?> other) throws IncompatibleWindowException {
if (!this.isCompatible(other)) {
throw new IncompatibleWindowException(other, String.format("Only %s objects are compatible.", FileWindows.class.getSimpleName()));
}
}
#Override
public Coder<IntervalWindow> windowCoder() {
return IntervalWindow.getCoder();
}
}
and then it can be used in the pipeline as below:
p
.apply("Assign_Timestamp_to_Each_Message", ParDo.of(new AssignTimestampFn()))
.apply("Assign_Window_to_Each_Message", Window.<KV<String,String>>into(new FileWindows())
.withAllowedLateness(Duration.standardMinutes(1))
.discardingFiredPanes());
Please keep in mind that you will need to write the AssignTimestampFn() so that each message carries a timestamp.

How do I make View's asList() sortable in Google Dataflow SDK?

We have a problem making asList() method sortable.
We thought we could do this by just extending the View class and override the asList method but realized that View class has a private constructor so we could not do this.
Our other attempt was to fork the Google Dataflow code on github and modify the PCollectionViews class to return a sorted list be using the Collections.sort method as shown in the code snippet below
#Override
protected List<T> fromElements(Iterable<WindowedValue<T>> contents) {
Iterable<T> itr = Iterables.transform(
contents,
new Function<WindowedValue<T>, T>() {
#SuppressWarnings("unchecked")
#Override
public T apply(WindowedValue<T> input){
return input.getValue();
}
});
LOG.info("#### About to start sorting the list !");
List<T> tempList = new ArrayList<T>();
for (T element : itr) {
tempList.add(element);
};
Collections.sort((List<? extends Comparable>) tempList);
LOG.info("##### List should now be sorted !");
return ImmutableList.copyOf(tempList);
}
Note that we are now sorting the list.
This seemed to work, when run with the DirectPipelineRunner but when we tried the BlockingDataflowPipelineRunner, it didn't seem like the code change was being executed.
Note: We actually recompiled the dataflow used it in our project but this did not work.
How can we be able to achieve this (as sorted list from the asList method call)?
The classes in PCollectionViews are not intended for extension. Only the primitive view types provided by View.asSingleton, View.asSingleton View.asIterable, View.asMap, and View.asMultimap are supported.
To obtain a sorted list from a PCollectionView, you'll need to sort it after you have read it. The following code demonstrates the pattern.
// Assume you have some PCollection
PCollection<MyComparable> myPC = ...;
// Prepare it for side input as a list
final PCollectionView<List<MyComparable> myView = myPC.apply(View.asList());
// Side input the list and sort it
someOtherValue.apply(
ParDo.withSideInputs(myView).of(
new DoFn<A, B>() {
#Override
public void processElement(ProcessContext ctx) {
List<MyComparable> tempList =
Lists.newArrayList(ctx.sideInput(myView));
Collections.sort(tempList);
// do whatever you want with sorted list
}
}));
Of course, you may not want to sort it repeatedly, depending on the cost of sorting vs the cost of materializing it as a new PCollection, so you can output this value and read it as a new side input without difficulty:
// Side input the list, sort it, and put it in a PCollection
PCollection<List<MyComparable>> sortedSingleton = Create.<Void>of(null).apply(
ParDo.withSideInputs(myView).of(
new DoFn<Void, B>() {
#Override
public void processElement(ProcessContext ctx) {
List<MyComparable> tempList =
Lists.newArrayList(ctx.sideInput(myView));
Collections.sort(tempList);
ctx.output(tempList);
}
}));
// Prepare it for side input as a list
final PCollectionView<List<MyComparable>> sortedView =
sortedSingleton.apply(View.asSingleton());
someOtherValue.apply(
ParDo.withSideInputs(sortedView).of(
new DoFn<A, B>() {
#Override
public void processElement(ProcessContext ctx) {
... ctx.sideInput(sortedView) ...
// do whatever you want with sorted list
}
}));
You may also be interested in the unsupported sorter contrib module for doing larger sorts using both memory and local disk.
We tried to do it the way Ken Knowles suggested. There's a problem for large datasets. If the tempList is large (so sort takes some measurable time as it's proportion to O(n * log n)) and if there are millions of elements in the "someOtherValue" PCollection, then we are unecessarily re-sorting the same list millions of times. We should be able to sort ONCE and FIRST, before passing the list to the someOtherValue.apply's DoFn.

Is there a limit on the number of side outputs in Google Cloud Dataflow?

We have a Cloud Dataflow Job that takes in a BigQuery table, transforms it and then writes each record out to a different table depending on the month/year in the timestamp for that record. So when we run our job over a table with 12 months of data there should be 12 output tables. The first month will be the main output and the other 11 months will be the side outputs.
We have found that a job will fail when we run it over 10 or more months(9 side outputs).
Is this a limit on Cloud Dataflow or is it a bug?
I noticed in the execution graph when it was running with more than 8 side outputs that some of the outputs said "running" but they didn't seem to be writing any records.
Here are some of our job ids:
2015-06-14_23_58_06-14457541029573485807 (8 side outputs - passed)
2015-06-14_23_48_43-15277609445992188388 (9 side outputs - failed)
2015-06-14_23_11_46-10500077558949649888 (7 side outputs - passed)
2015-06-14_22_38_48-1428211312699949403 (3 side outputs - passed)
2015-06-14_21_44_27-16273252623089185131 (11 side outputs - failed)
This is the code that processes the data. There is no caching involved. (TressOutputManager only holds a cache of TupleTag<TableRow>)
public class TressDenormalizationDoFn extends DoFn<TableRow, TableRow> {
#Inject
#Named("tress.mappers")
private Set<CPTMapper> mappers;
#Inject
private TressOutputManager tuples;
#Override
public void processElement(ProcessContext c) throws Exception {
TableRow row = c.element().clone();
for (CPTMapper mapper : mappers) {
String mapped = mapper.map((String) row.get("event"));
if (mapped != null) {
row.set(mapper.getId(), mapped);
}
}
// places the record in the correct month based on the time stamp
String timeStamp = (String) row.get("time_local");
if(timeStamp != null){
timeStamp = timeStamp.substring(0, 7).replaceAll("-", "_");
if (tuples.isMainOutput(timeStamp)) {
c.output(row);
} else {
c.sideOutput(tuples.getTuple(timeStamp), row);
}
}
}
}

Resources