In cloud dataflow an element may be assigned into multiple windows in the occasion of SlidingWindow which has a size and a step. Suppose that we have a large size SlidingWindow which has a very small step, in fact the elements in two adjacent window would be almost same except the sliding step.
So would computing on every SlidingWindow just simply load all the elements in this window and trigger a compute on these elements? Or the adjacent window could reuse some computing result to avoid duplicate computing? And whether the element would be copied when been assigned into multiple windows?
Dataflow does not have any special handling for SlidingWindows like this. The element occurs in every window to which it is assigned.
We typically haven't found performance problems using regular SlidingWindows with a CombineFn afterwards. We would suggest trying that first and following up with more details on what you're trying to compute and specifics on your windowing if you're having problems.
Automatically doing this as an optimization doesn't work well in the presence of user-defined windowing, triggering, out-of-order data, and other optimizations already present in the system.
Related
Say I have 3 unrelated time-series. Each written row key starts with the current timestamp: timestamp#.....
Having each time-series in a separate table will cause hotspotting because new rows are always added at one extremity (latest timestamp).
If we join all 3 time-series in one BigTable table with prefixes:
series1#timestamp#....
series2#timestamp#....
series3#timestamp#....
Does that avoid hotspotting? Will each cluster node handle one time-series?
I'm assuming that there are 3 nodes per cluster and that each of the 3 time-series will receive similar load and will grow in size evenly.
If yes, is there any disadvantage to having multiple unrelated time-series in one BigTable table?
Because you have a timestamp as the first part of your rowkey, I think you're going to get hotspots either way.
In a Bigtable instance, your data is split into groups of contiguous rowkeys (called tablets) and those are distributed evenly across the nodes. To maximize efficiency with Bigtable, you need that data to be distributed across the nodes and within the nodes as tablets. You get hotspotting when you're writing to the same row or contiguous set of rows since that is all happening within one tablet. If you are constantly writing with the timestamp as a prominent part of the key, you will keep writing to the same tablet until it fills up and you have to go to the next one rather than writing to multiple tablets within a node.
The Bigtable documentation has a guide for time-series schema design which recommends a few solutions for a use case like yours:
Field promotion: add an additional field to the rowkey before your timestamp to separate out a group of data (USER_ID#timestamp#...)
Salting: take a hash of the timestamp and divide it by the number of nodes, then add that to the rowkey (SALT_RESULT#timestamp#...)
Reverse timestamps: or if either of those don't work, reverse the timestamp. This works best if your most common query is for the latest values, but can make other queries more difficult
Edit:
Your approach is definitely similar to salting, but since your data is already in separate tables you're actually not getting any increased benefit since the hotspotting is going to be caused at the tablet level.
To draw it out more, let's say you have this data in separate tables and start writing data. Each table is going to be composed of tablets, which capture timestamps 0-10, 11-20, etc... Those tablets will automatically be distributed amongst nodes for the best performance. If the loads are all similar, tablets 0-10 should all be on separate nodes, 11-20 will all be on separate nodes etc.
With the way your schema is set up, you are constantly writing to the latest tablet (let's say the time is now 91,) you're only writing to the 91-100 while ignoring all the other tablets within that node. Since that 91-100 tablet is the only one getting work instead of the other tablets, your node isn't going to give you optimized performance and this is what we refer to as hotspotting. A certain tablet is getting a spike, but there wont be enough time for the load balancer to correct it.
If you have it in the same table, we can just focus on one node now. series1#0-10 will first get slammed, then series1#11-20, then series1#21-30. There is always one tablet that is getting too much load and not making use of the full node.
There is some more information about load balancing in the documentation.
Setup:
read from pubsub -> window of 30s -> group by user -> combine -> write to cloud datastore
Problem:
I'm seeing DataStoreIO writer errors as objects with similar keys are present in the same transaction.
Question:
I want to understand how my pipeline combines results into bundles after a group by/combine operation. I would expect the bundle to be created for every window after the combine. But apparently, a bundle can contain more than 2 occurrences of the same user?
Can re-execution (retries) of bundles cause this behavior?
Is this bundling dependent of the runner?
Is deduplication an option? if so, how would I best approach that?
Note that I'm not looking for a replacement for the datastore writer at the end of the pipeline, I already know that we can use a different strategy. I'm merely trying to understand how the bundling happens.
There are two answers to your question. One is specific to your use case, and the other is in general about bundling / windowing in streaming.
Specific to your pipeline
I am assuming that the 'key' for Datastore is the User ID? In that case, if you have events from the same user in more than one window, your GroupByKey or Combine operations will have one separate element for every pair of user+window.
So the question is: What are you trying to insert into datastore?
An individual user's resulting aggregate over all time? In that case, you'd need to use a Global Window.
A user's resulting aggregate for every 30 seconds in time? Then you need to use the window as part of the key you use to insert to datastore. Does that help / make sense?
Happy to help you design your pipeline to do what you want. Chat with me in the comments or via SO chat.
The larger question about bundling of data
Bundling strategies will vary by runner. In Dataflow, you should consider the following two factors:
Every worker is assigned a key range. Elements for the same key will be processed by the same worker.
Windows belong to single elements; but a bundle may contain elements from multiple windows. As an example, if the data freshness metric makes a big jump*, a number of windows may be triggered - and elements of the same key in different windows would be processed in the same bundle.
*- when can Data freshness jump suddenly? A stream with a single element with a very old timestamp, and that is very slow to process may hold the watermark for a long time. Once this element is processed, the watermark may jump a lot, to the next oldest element (Check out this lecture on watermarks ; )).
I was going to start developing programs in Google cloud Pubsub. Just wanted to confirm this once.
From the beam documentation the data loss can only occur if data was declared late by Pubsub. Is it safe to assume that the data will always be delivered without any message drops (Late data) when using a global window?
From the concepts of watermark and lateness I have come to a conclusion that these metrics are critical in conditions where custom windowing is applied over the data being received with event based triggers.
When you're working with streaming data, choosing a global window basically means that you are going to completely ignore event time. Instead, you will be taking snapshots of your data in processing time (that is, as they arrive) using triggers. Therefore, you can no longer define data as "late" (neither "early" or "on time" for that matter).
You should choose this approach if you are not interested in the time at which these events actually happened but, instead, you just want to group them according to the order in which they were observed. I would suggest that you go through this great article on streaming data processing, especially the part under When/Where: Processing-time windows which includes some nice visuals comparing different windowing strategies.
To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks
This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.
I'd like to recognize and discard incomplete windows (independent of sliding) at the start of pipeline execution. For example:
If I'm counting the number of events hourly and I start at :55 past the hour, then I should expect ~1/12th the value in the first window and then a smooth ramp-up to the "correct" averages.
Since data could be arbitrarily late in a user-defined way, the time you start the pipeline up and the windows that are guaranteed to be missing data might be only loosely connected.
You'll need some out-of-band way of indicating which windows they are. If I were implementing such a thing, I would consider a few approaches, in this order I think:
Discarding outliers based on not enough data points. Seems that it would be robust to lots of data issues, if your data set can tolerate it (a statistician might disagree)
Discarding outliers based on data points not distributed in the window (ditto)
Discarding outliers based on some characteristic of the result instead of the input (statisticians will be even more likely to say don't do this, since you are already averaging)
Using a custom pipeline option to indicate a minimum start/end time for interesting windows.
One reason to choose more robust approaches than just "start time" is in the case of down time of your data producer or any intermediate system, etc. (even with delivery guarantees, the watermark may have moved on and made all that data droppable).