Marking a key as complete in a GroupBy | Dataflow Streaming Pipeline - google-cloud-dataflow

To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks

This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.

Related

Flink: Are multiple execution environments supported?

Is it OK to create multiple ExecutionEnvironments in a Flink program? More specifically, create one ExecutionEnvironment and one StreamExecutionEnvironment in the same main method, so that one can work with batch and later transit to streaming without problems?
I guess that the other possibility would be to split the program in two, but for my testing purposes this seems better. Is Flink prepared for this scenario?
All seems to work fine, except I am currently having problems with no output when joining two streams on a common index and using window(TumblingProcessingTimeWindows.of(Time.seconds(1))). I have already called setStreamTimeCharacteristic(TimeCharacteristic.EventTime) on the StreamExecutionEnvironment and even tried assigning custom watermarks on both joined streams with assignTimestampsAndWatermarks where I just return System.currentTimeMillis() as the timestamp of each record.
Since it finishes really quickly, both streams should fit in that 1-second window, no? Both streams print just fine right before the join. I can try supplying the important parts of code (it's rather lengthy) if anyone's interested.
UPDATE: OK, so I separated the two environments (put each inside a main method) and then I simply call the first main from the second main method. The described problem no longer occurs.
No, this not supported, and won't really work.
At least up through Flink 1.9, a given application must either have an ExecutionEnvironment and use the DataSet API, or a StreamExecutionEnvironment and use the DataStream API. You cannot mix the two in one application.
There is ongoing work to more completely unify batch and streaming, but that's a work in progress. To understand this better you might want to watch the video for this recent Flink Forward talk when it becomes available.

How are Dataflow bundles created after GroupBy/Combine?

Setup:
read from pubsub -> window of 30s -> group by user -> combine -> write to cloud datastore
Problem:
I'm seeing DataStoreIO writer errors as objects with similar keys are present in the same transaction.
Question:
I want to understand how my pipeline combines results into bundles after a group by/combine operation. I would expect the bundle to be created for every window after the combine. But apparently, a bundle can contain more than 2 occurrences of the same user?
Can re-execution (retries) of bundles cause this behavior?
Is this bundling dependent of the runner?
Is deduplication an option? if so, how would I best approach that?
Note that I'm not looking for a replacement for the datastore writer at the end of the pipeline, I already know that we can use a different strategy. I'm merely trying to understand how the bundling happens.
There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.
Specific to your pipeline
I am assuming that the 'key' for Datastore is the User ID? In that case, if you have events from the same user in more than one window, your GroupByKey or Combine operations will have one separate element for every pair of user+window.
So the question is: What are you trying to insert into datastore?
An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.
A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?
Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.
The larger question about bundling of data
Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:
Every worker is assigned a key range. Elements for the same key will be processed by the same worker.
Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.
*- when can Data freshness jump suddenly? A stream with a single element with a very old timestamp, and that is very slow to process may hold the watermark for a long time. Once this element is processed, the watermark may jump a lot, to the next oldest element (Check out this lecture on watermarks ; )).

Kafka KTables missing data when joining KStream to KTable

Has anyone posted a response to this problem? There have been other posts with no answers. Our situation is that we are pushing messages onto a topic that is backing a KTable in the first step of our stream process. We are then pulling a small amount of data from those messages and passing them along. We are doing multiple computations on that smaller amount of data for grouping and aggregation. At the end of the streaming process, we simply want to join back to that original topic via a KTable to pick up the full message content again. The results of the join are only a subset of the data because it can not find the entries in the KTable.
This is just the beginning of the problem. In another case, we are using KTables as indexes for lookups meant to enrich the data coming in. Think of these lookups as identifying whether we have seen a specific pattern in the streaming message before. If we have seen the pattern we want to tag it with an ID (used for grouping) pulled from an existing KTable. If we have not seen the pattern before we would assign it an ID and place it back into the KTable to be used to tag future messages. What we have found is that there is no guaranty that the information will be present in the KTable for future messages. This lack of guaranty seems to make KTables useless. We can not figure out why there is a very little discussion of this on the forums.
Finally, none of this seemed to be a problem when running with a single instance of the streams application. However, as soon as our data got large and we were forced to have 10 instances of the app, everything broke. As well, there is no way that we could use things like GlobalKTables because there is too much data to be loaded into a single machine's memory.
What can we do? We are currently planning to abandon KTables all together and use something like Hazelcast to store the lookup data. Should we just move to Hazelcast Jet and drop Kafka streams all together?
Adding flow:
Kafka data flow
I'm sorry for this non-answer answer, but I don't have enough points to comment...
The behavior you describe is definitely inconsistent with my understanding and experience with streams. If you can share the topology (or a simplified one) that is causing the problem, there might be a simple mistake we can point out.
Once we get more info, I can edit this into a "real" answer...
Thanks!
-John

In Dataflow with PubsubIO is there any possibility of late data in a global window?

I was going to start developing programs in Google cloud Pubsub. Just wanted to confirm this once.
From the beam documentation the data loss can only occur if data was declared late by Pubsub. Is it safe to assume that the data will always be delivered without any message drops (Late data) when using a global window?
From the concepts of watermark and lateness I have come to a conclusion that these metrics are critical in conditions where custom windowing is applied over the data being received with event based triggers.
When you're working with streaming data, choosing a global window basically means that you are going to completely ignore event time. Instead, you will be taking snapshots of your data in processing time (that is, as they arrive) using triggers. Therefore, you can no longer define data as "late" (neither "early" or "on time" for that matter).
You should choose this approach if you are not interested in the time at which these events actually happened but, instead, you just want to group them according to the order in which they were observed. I would suggest that you go through this great article on streaming data processing, especially the part under When/Where: Processing-time windows which includes some nice visuals comparing different windowing strategies.

Use data from different pipeline

I have two pipelines, "gameEngineEvents" and "userEvents" that consume from equivalent pubsub topics. A userEvent might have one or many gameEngineEvents.
When a gameEngineEvent happens I want to check if there is a userEvent that has a reference to that gameEngineEvent, run some logic and then publish a new message to a third pubsub topic.
So, is it possible to do something like this only in dataflow?
This is certainly possible. What you will want to use here is a CoGroupByKey which will shuffle the "gameEngineEvent"s with a certain key to the same machine as the "userEvent" with that key in order to process them together a perform a certain logic on them. You will end up with 2 iterables for that key which you can use in your processing.
More info on the specifics of CoGroupByKey can be found here.
Since these are PubSub topics and you are probably coping with an unbounded source you'll probably also want to look at Windowing, in order to set borders on which events you want to perform the processing on.

Resources