I'm trying to figure out how we "seed" the window state for some of our streaming dataflow jobs. Scenario is we have a stream of forum messages, we want to emit a running count of messages for each topic for all time, so we have a streaming dataflow job with a global window and triggers to emit each time a record for a topic comes in. All good so far. But prior to the stream source, we have a large file which we'd like to process to get our historical counts, also, because topics live forever, we need the historical count to inform the outputs from the stream source, so we kind've need the same logic to run over the file, then start running over the stream source when the file is exhausted, while keeping the window state.
Current ideas:
Write a custom unbounded source that does just that. Reads over the file until it's exhausted and then starts reading from the stream. Not much fun because writing custom sources is not much fun.
Run the logic in batch mode over the file, and as the last step emit the state to a stream sink somehow, then have a streaming version of the logic start up that reads from both the state stream and the data stream, and somehow combines the two. This seems to make some sense, but not sure how to make sure that the streaming job reads everything from the state source, to initialise, before reading from the data stream.
Pipe the historical data into a stream, write a job that reads from both the streams. Same problems as the second solution, not sure how to make sure one stream is "consumed" first.
EDIT: Latest option, and what we're going with, is to write the calculation job such that it doesn't matter at all what order the events arrive in, so we'll just push the archive to the pub/sub topic and it will all work. That works in this case, but obviously it affects the downstream consumer (need to either support updates or retractions) so I'd be interested to know what other solutions people have for seeding their window states.
You can do what you suggested in bullet point 2 --- run two pipelines (in the same main), with the first that populates a pubsub topic from the large file. This is similar to what the StreamingWordExtract example does.
Related
Using JSR-352 batch job along with Java EE, I'm trying to process items on chunk from a source in partitions. On retriable exception I want to be able to return to a past checkpoint, so I could get items already read from the source.
The nature of the source is such that in parallel environment I cannot require the same chunk of items twice. The only feasible way to be able to get the exact same items when reading twice is by having to restart the whole job.
I need to write a generic ItemReader which can manage sources of such a kind (so it may be reusable). This basically means that want to find nice and clear design/implementation of such a reader.
To achieve the required behavior of ItemReader to process the source, what I currently do is getting the items in the beginning of the readItem() if they have not been fetched for the current chunk, and then iterate one by one through them. In order to manage retriable exceptions I'm trying to use the checkpoint properties of the ItemReader.
The problem I'm facing is that the behavior of checkpoints is such that they are loaded in open(...) method, before readItem() and saved only after the chunk has been successful. This results in a problem with saving all the items of the chunk into a valid checkpoint before I must actually retry the chunk in case of an retriable exception.
My question is there a way to make augment the behavior of checkpoints, so they are saved after the initial readItem(), or do you happen to know any other nice and clear strategy, without the usage of additional listeners, userTransientData which would make the reader hard to integrate into other batch job steps with the same read behavior?
I was going to start developing programs in Google cloud Pubsub. Just wanted to confirm this once.
From the beam documentation the data loss can only occur if data was declared late by Pubsub. Is it safe to assume that the data will always be delivered without any message drops (Late data) when using a global window?
From the concepts of watermark and lateness I have come to a conclusion that these metrics are critical in conditions where custom windowing is applied over the data being received with event based triggers.
When you're working with streaming data, choosing a global window basically means that you are going to completely ignore event time. Instead, you will be taking snapshots of your data in processing time (that is, as they arrive) using triggers. Therefore, you can no longer define data as "late" (neither "early" or "on time" for that matter).
You should choose this approach if you are not interested in the time at which these events actually happened but, instead, you just want to group them according to the order in which they were observed. I would suggest that you go through this great article on streaming data processing, especially the part under When/Where: Processing-time windows which includes some nice visuals comparing different windowing strategies.
Let's say we have log data with timestamps that can either be streamed into BigQuery or stored as files in Google Storage, but not streamed directly to the unbounded collection source types that Dataflow supports.
We want to analyse this data based on timestamp, either relatively or absolutely, e.g. "how many hits in the last 1 hour?" and "how many hits between 3pm and 4pm on 5th Feb 2018"?
Having read the documentation on windows and triggers, it's not clear how we would divide our incoming data into batches in a way that is supported by Dataflow if we want to have a large window - potentially we want to aggregate over the last day, 30 days, 3 months, etc.
For example, if our batched source is a BigQuery query, run every 5 mins, for the last 5 mins worth of data, will Dataflow keep the windows open between job runs, even though the data is arriving in 5 min chunks?
Similarly, if the log files are rotated every 5 mins, and we start Dataflow as a new file is saved to the bucket, the same question applies - is the job stopped and started, and all knowledge of previous jobs discarded, or does the large window (e.g. up to a month) remain open for new events?
How do we change/modify this pipeline without disturbing the existing state?
Apologies if these are basic questions, even a link to some docs would be appreciated.
It sounds like you want arbitrary interactive aggregation queries on your data. Beam / Dataflow are not a good fit for this per se, however one of the most common use cases of Dataflow is to ingest data into BigQuery (e.g. from GCS files or from Pubsub), which is a very good fit for that.
A few more comments on your question:
it's not clear how we would divide our incoming data into batches
Windowing in Beam is simply a way to specify the aggregation scope in the time dimension. E.g. if you're using sliding windows of size 15 minutes every 5 minutes, then a record whose event-time timestamp is 14:03 counts towards aggregations in three windows: 13:50..14:05, 13:55..14:10, 14:00..14:15.
So: same way as you don't need to divide your incoming data into "keys" when grouping by a key (the data processing framework performs the group-by-key for you), you don't divide it into windows either (the framework performs group-by-window implicitly as part of every aggregating operation).
will Dataflow keep the windows open between job runs
I'm hoping this is addressed by the previous point, but to clarify more: No. Stopping a Dataflow job discards all of its state. However, you can "update" a job with new code (e.g. if you've fixed a bug or added an extra processing step) - in that case state is not discarded, but I think that's not what you're asking.
if the log files are rotated every 5 mins, and we start Dataflow as a new file is saved
It sounds like you want to ingest data continuously. The way to do that is to write a single continuously running streaming pipeline that ingests the data continuously, rather than to start a new pipeline every time new data arrives. In the case of files arriving into a bucket, you can use TextIO.read().watchForNewFiles() if you're reading text files, or its various analogues if you're reading some other kind of files (most general is FileIO.matchAll().continuously()).
To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks
This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.
We're trying to use dataflow's processing-time independence to start up a new streaming job and replay all of our data into it via Pub/Sub but are running into the following problem:
The first stage of the pipeline is a groupby on a transaction id, with a session window of 10s discarding fired panes and no allowed lateness. So if we don't specify the timestampLabel of our replay pub/sub topic then when we replay into pub/sub all of the event timestamps are the same and the groupby tries to group all of our archived data into transaction id's for all time. No good.
If we set the timestampLabel to be the actual event timestamp from the archived data, and replay say 1d at a time into the pub/sub topic then it works for the first day's worth of events, but then as soon as those are exhausted the data watermark for the replay pub/sub somehow jumps forward to the current time, and all subsequent replayed days are dropped as late data. I don't really understand why that happens, as it seems to violate the idea that dataflow logic is independent of the processing time.
If we set the timestampLabel to be the actual event timestamp from the archived data, and replay all of it into the pub/sub topic, and then start the streaming job to consume it, the data watermark never seems to advance, and nothing ever seems to come out of the groupby. I don't really understand what's going on with that either.
Your approaches #2 and #3 are suffering from different issues:
Approach #3 (write all data, then start consuming): Since data is written to the pubsub topic out-of-order, the watermark really cannot advance until all (or most) of the data is consumed - because the watermark is a soft guarantee "further items that you receive you are unlikely to have event time later than this", but due to out-of-order publishing there is no correspondence whatsoever between publish time and event time. So your pipeline is effectively stuck until it finishes processing all this data.
Approach #2: technically it suffers from the same problem within each particular day, but I suppose the amount of data within 1 day is not that large, so the pipeline is able to process it. However, after that, the pubsub channel stays empty for a long time, and in that case the current implementation of PubsubIO will advance the watermark to real time, that's why further days of data are declared late. The documentation explains this some more.
In general, quickly catching up with a large backlog, e.g. by using historic data to "seed" the pipeline and then continuing to stream in new data, is an important use case that we currently don't support well.
Meanwhile I have a couple of recommendations for you:
(better) Use a variation on approach #2, but try timing it against the streaming pipeline so that the pubsub channel doesn't stay empty.
Use approach #3, but with more workers and more disk per worker (your current job appears to be using autoscaling with max 8 workers - try something much larger, like 100? It will downscale after it catches up)