I am processing a windowed stream of PubSub messages and I would like to archive them to GCS. I'd like the archived files to have a prefix that's derived from the window timestamp (something like gs://bucket/messages/2015/01/messages-2015-01-01.json). Is this possible with TextIO.Write, or do I need to implement my own FileBasedSink?
This can be done with the recently added feature for windowed writes in TextIO. Please see the documentation for TextIO, in particular see withWindowedWrites and to(FilenamePolicy). This feature is also present in AvroIO.
Are you simply looking for the function TextIO.Write.Bound<String>.withSuffix() or TextIO.Write.Bound<String>.to()? It seems these would allow you to provide a suffix or prefix for the output filename.
Right now, TextIO.Write does not support operation in streaming mode – writing to GCS is tricky, e.g., because you can't write to a file concurrently from multiple workers and you can't append to files once they close. We have plans to add streaming support to TextIO.
You'll get the best support for this today using BigQuery rather than GCS – because we already support BigQuery writes during streaming, and you choose which table you write to based on the window name, and BigQuery supports writes from many different workers at once.
TextIO.Write ought to work. No need for custom filesink.
In your case, you want to write your PubSub messages to an output text file - not locally, but on remote GS. You ought to be able to use:
PCollection .apply.TextIO.Write().to(
Since you are processing a stream of PubSub messages, your window is unbounded and your PubSub data source already provides a timestamp for each element in the PCollection.
If you wish to assign a timestamp, your ParDo transform needs to use a DoFn that outputs elements using ProcessContext.outputWithTimestamp().
In summary, you can use TextIO.Write aftre ensuring the elements in your PCollection are output with timestamp.
Related
Setup:
read from pubsub -> window of 30s -> group by user -> combine -> write to cloud datastore
Problem:
I'm seeing DataStoreIO writer errors as objects with similar keys are present in the same transaction.
Question:
I want to understand how my pipeline combines results into bundles after a group by/combine operation. I would expect the bundle to be created for every window after the combine. But apparently, a bundle can contain more than 2 occurrences of the same user?
Can re-execution (retries) of bundles cause this behavior?
Is this bundling dependent of the runner?
Is deduplication an option? if so, how would I best approach that?
Note that I'm not looking for a replacement for the datastore writer at the end of the pipeline, I already know that we can use a different strategy. I'm merely trying to understand how the bundling happens.
There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.
Specific to your pipeline
I am assuming that the 'key' for Datastore is the User ID? In that case, if you have events from the same user in more than one window, your GroupByKey or Combine operations will have one separate element for every pair of user+window.
So the question is: What are you trying to insert into datastore?
An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.
A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?
Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.
The larger question about bundling of data
Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:
Every worker is assigned a key range. Elements for the same key will be processed by the same worker.
Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.
*- when can Data freshness jump suddenly? A stream with a single element with a very old timestamp, and that is very slow to process may hold the watermark for a long time. Once this element is processed, the watermark may jump a lot, to the next oldest element (Check out this lecture on watermarks ; )).
I have two pipelines, "gameEngineEvents" and "userEvents" that consume from equivalent pubsub topics. A userEvent might have one or many gameEngineEvents.
When a gameEngineEvent happens I want to check if there is a userEvent that has a reference to that gameEngineEvent, run some logic and then publish a new message to a third pubsub topic.
So, is it possible to do something like this only in dataflow?
This is certainly possible. What you will want to use here is a CoGroupByKey which will shuffle the "gameEngineEvent"s with a certain key to the same machine as the "userEvent" with that key in order to process them together a perform a certain logic on them. You will end up with 2 iterables for that key which you can use in your processing.
More info on the specifics of CoGroupByKey can be found here.
Since these are PubSub topics and you are probably coping with an unbounded source you'll probably also want to look at Windowing, in order to set borders on which events you want to perform the processing on.
To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks
This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.
I'm trying to figure out how we "seed" the window state for some of our streaming dataflow jobs. Scenario is we have a stream of forum messages, we want to emit a running count of messages for each topic for all time, so we have a streaming dataflow job with a global window and triggers to emit each time a record for a topic comes in. All good so far. But prior to the stream source, we have a large file which we'd like to process to get our historical counts, also, because topics live forever, we need the historical count to inform the outputs from the stream source, so we kind've need the same logic to run over the file, then start running over the stream source when the file is exhausted, while keeping the window state.
Current ideas:
Write a custom unbounded source that does just that. Reads over the file until it's exhausted and then starts reading from the stream. Not much fun because writing custom sources is not much fun.
Run the logic in batch mode over the file, and as the last step emit the state to a stream sink somehow, then have a streaming version of the logic start up that reads from both the state stream and the data stream, and somehow combines the two. This seems to make some sense, but not sure how to make sure that the streaming job reads everything from the state source, to initialise, before reading from the data stream.
Pipe the historical data into a stream, write a job that reads from both the streams. Same problems as the second solution, not sure how to make sure one stream is "consumed" first.
EDIT: Latest option, and what we're going with, is to write the calculation job such that it doesn't matter at all what order the events arrive in, so we'll just push the archive to the pub/sub topic and it will all work. That works in this case, but obviously it affects the downstream consumer (need to either support updates or retractions) so I'd be interested to know what other solutions people have for seeding their window states.
You can do what you suggested in bullet point 2 --- run two pipelines (in the same main), with the first that populates a pubsub topic from the large file. This is similar to what the StreamingWordExtract example does.
Is it possible to move a file in GCS after the dataflow pipeline has finished running? If so, how? Should be the last .apply? I can't imagine that being the case.
The case here is that we are importing a lot of .csv's from a client. We need to keep those CSV's indefinitely, so we either need to "mark the CSV as being already handled", or alternatively, move them out of the initial folder that TextIO uses to find the csv's. The only thing I can currently think of is storing the file name (I'm not sure how I'd get this even, I'm a DF newbie) in BigQuery perhaps, and then excluding files that have already been stored from the execution pipeline somehow? But there has to be a better approach.
Is this possible? What should I check out?
Thanks for any help!
You can try using BlockingDataflowPipelineRunner and run arbitrary logic in your main program after p.run() (it will wait for the pipeline to finish).
See Specifying Execution Parameters, specifically the section "Blocking execution".
However, in general, it seems that you really want a continuously running pipeline that watches the directory with CSV files and imports new files as they appear, never importing the same file twice. This would be a great case for a streaming pipeline: you could write a custom UnboundedSource (see also Custom Sources and Sinks) that would watch a directory and return filenames in it (i.e. the T would probably be String or GcsPath):
p.apply(Read.from(new DirectoryWatcherSource(directory)))
.apply(ParDo.of(new ReadCSVFileByName()))
.apply(the rest of your pipeline)
where DirectoryWatcherSource is your UnboundedSource, and ReadCSVFileByName is also a transform you'll need to write that takes a file path and reads it as a CSV file, returning the records in it (unfortunately right now you cannot use transforms like TextIO.Read in the middle of a pipeline, only at the beginning - we're working on fixing this).
It may be somewhat tricky, and as I said we have some features in the works to make it a lot simpler and we're considering creating a built-in source like that, but it's possible that for now this would still be easier than "pinball jobs". Please give it a try and let us know at dataflow-feedback#google.com if anything is unclear!
Meanwhile, you can also store information about which files you have or haven't processed in Cloud Bigtable - it'd be a better fit for that than BigQuery, because it's more suited for random writes and lookups, while BigQuery is more suited for large bulk writes and queries over the full dataset.