What should I use as the Key for GroupIntoBatches.withShardedKey - google-cloud-dataflow

I want to batch the calls to an external service in my streaming dataflow job for unbounded sources. I used windowing + attach a dummy key + GroupByKey as below
messages
// 1. Windowing
.apply("window-5-seconds",
Window.<Message>into(FixedWindows.of(Duration.standardSeconds(5)))
.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1000)
.orFinally(AfterWatermark.pastEndOfWindow())))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
// 2. attach arbitrary key
.apply("attach-arbitrary-key", ParDo.of(new MySink.AttachDummyKey()))
// 3. group by key
.apply(GroupByKey.create())
// 4. call my service
.apply("call-my-service",
ParDo.of(new MySink(myClient)));
This implementation caused performance issues as I attached a dummy key to all the messages that caused the transform to not execute in parallel at all. After reading this answer, I switched to GroupIntoBatches transform as below.
messages
// 1. Windowing
.apply("window-5-seconds",
Window.<Message>into(FixedWindows.of(Duration.standardSeconds(5)))
.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1000)
.orFinally(AfterWatermark.pastEndOfWindow())))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
)
// 2. attach sharding key
.apply("attach-sharding-key", ParDo.of(new MySink.AttachShardingKey()))
// 3. group by key into batches
.apply("group-into-batches",
GroupIntoBatches.<String, MessageWrapper>ofSize(1000)
.withMaxBufferingDuration(Duration.standardSeconds(5);)
.withShardedKey())
// 4. call my service
.apply("call-my-service",
ParDo.of(new MySink(myClient)));
The document states that withShardedKey increases parallelism by spreading one key over multiple threads but the question is what would be a good key when using withShardedKey?
If this truly is runner-determined sharding, would it make sense to use a single dummy key? Or the same problem would occur just like GroupByKey? Currently I do not have a good key to use, I was thinking of creating a hash based on some fields of the message. If I do pick a key that could evenly distribute the traffic, would it still make sense to use withShardedKey? Or it might cause each shard not to include enough data that GroupIntoBatches may not actually be useful?

Usually the key would be a natural key, but since you mentioned that there's no such key, I feel there are a few trade-offs to consider.
You can apply a static key, but the parallelism will just depend on the number of threads (GroupIntoBatches semantic) which is runner specific:
Outputs batched elements associated with sharded input keys. By default, keys are sharded to such that the input elements with the same key are spread to all available threads executing the transform. Runners may override the default sharding to do a better load balancing during the execution time.
If your pipeline can afford more calls (with eventually not full batches, depending on the distribution), applying a random key (using a small range - would have to try an ideal balance) instead of static may provide better guarantees.
I recommend watching this session which provides some relevant information: Beam Summit 2021 - Autoscaling your transforms with auto-sharded GroupIntoBatches

Related

Slowness / Lag in beam streaming pipeline in group by key stage

Context
Hi all, I have been using Apache Beam pipelines to generate columnar DB to store in GCS, I have a datastream coming in from Kafka and have a window of 1m.
I want to transform all data of that 1m window into a columnar DB file (ORC in my case, can be Parquet or anything else), I have written a pipeline for this transformation.
Problem
I am experiencing general slowness. I suspect it could be due to the group by key transformation as I have only key. Is there really a need to do that? If not, what should be done instead? I read that combine isn't very useful for this as my pipeline isn't really aggregating the data but creating a merged file. What I exactly need is an iterable list of objects per window which will be transformed to ORC files.
Pipeline Representation
input -> window -> group by key (only 1 key) -> pardo (to create DB) -> IO (to write to GCS)
What I have tried
I have tried using the profiler, scaling horizontally/vertically. Using the profiler I saw more than 50% of the time going into group by key operation. I do believe the problem is of hot keys but I am unable to find a solution on what should be done. When I removed the group by key operation, my pipeline keeps up with the Kafka lag (ie, it doesn't seem to be an issue at Kafka end).
Code Snippet
p.apply("ReadLines", KafkaIO.<Long, byte[]>read().withBootstrapServers("myserver.com:9092")
.withTopic(options.getInputTopic())
.withTimestampPolicyFactory(MyTimePolicy.myTimestampPolicyFactory())
.withConsumerConfigUpdates(Map.of("group.id", "mygroup-id")).commitOffsetsInFinalize()
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(ByteArrayDeserializer.class).withoutMetadata())
.apply("UncompressSnappy", ParDo.of(new UncompressSnappy()))
.apply("DecodeProto", ParDo.of(new DecodePromProto()))
.apply("MapTSSample", ParDo.of(new MapTSSample()))
.apply(Window.<TSSample>into(FixedWindows.of(Duration.standardMinutes(1)))
.withTimestampCombiner(TimestampCombiner.END_OF_WINDOW))
.apply(WithKeys.<Integer, TSSample>of(1))
.apply(GroupByKey.<Integer, TSSample>create())
.apply("CreateTSORC", ParDo.of(new CreateTSORC()))
.apply(new WriteOneFilePerWindow(options.getOutput(), 1));
Wall Time Profile
https://gist.github.com/anandsinghkunwar/4cc26f7e3da7473af66ce9a142a74c35
The problem indeed seems to be a hot keys issue, I had to change my pipeline to create a custom IO for ORC files and bump up the number of shards to 50 for my case. I removed the GroupByKey totally. Since beam doesn't yet have auto determination of number of shards for FileIO.write(), you'll have to manually choose a number that suits your workload.
Also, enabling streaming engine API in Google Dataflow sped up the ingestion even more.

Can Google dataflow GroupByKey handle hot keys?

Input is PCollection<KV<String,String>>
I have to write files by the key and each line as value of the KV group.
In order to group based on Key, I have 2 options :
1. GroupByKey --> PCollection<KV<String, Iterable<String>>>
2. Combine.perKey.withhotKeyFanout --> PCollection
where value String is accumulated Strings from all pairs.
(Combine.CombineFn<String, List<String>, CustomStringObJ>)
I can have a millon records per key.The collection of keyed-data is optimised using Windows and Trigger, still can have thousands of entries per key.
I worry the max size of String will cause issue if Combine.perKey.withHotKeyFanout is used to create a CustomStringObJ which has List<String> as member to be written in the file.
If we use GroupByKey, how to handle hot keys?
You should use the approach with GroupByKey, not use Combine to concatenate a large string. The actual implementation (not unique to Dataflow) is that elements are shuffled according to their key and in the output KV<K, Iterable<V>> the iterable of values is a particular lazy/streamed view on the elements shuffled to that key. There is no actual iterable constructed - this is just as good as routing each element to the worker that owns each file and writing it directly.
Your use of windows and triggers might actually force buffering and make this less efficient. You should only use event time windowing if it is part of your business case; it isn't a mechanism for controlling performance. Triggers are good for managing how data is batched up and sent downstream, but most useful for aggregations where triggering less frequently saves a lot of data volume. For a raw grouping of the elements, triggers tend to be less useful.

Dataflow fusion + windows/triggers

I'm aware that Dataflow can modify a pipeline's execution graph through Fusion Optimization.
Do windows/triggers factor in at all to fusion optimization?
Does a streaming pipeline and/or unbounded sources (Pub/Sub) influence that behavior at all?
All the complex operations of the Beam programming model, including evaluation of windowing/triggering and such, end up being translated to a low-level graph of (possibly stateful) ParDo and GroupByKey operations (a.k.a. Map and Reduce :) ).
E.g.
You can think of the assigning windows (Window.into()) as a ParDo that takes an element and returns a list of pairs (element, window) for all windows into which the element's timestamp maps
A GroupByKey by a key (or a Combine) in your original pipeline gets translated into a GroupByKey by a composite key (user key, window)
Evaluation of triggers happens as a stateful ParDo that gets inserted immediately after any GroupByKey and reacts to new values arriving for a given key/window by buffering the new value and deciding whether, according to the trigger, it's already time to emit the accumulated values or not.
This is not an exact correspondence (semantics of windows is a little more complex than that), just to give you an idea.
Fusion operates on this low-level graph of ParDo and GroupByKey, collapsing some chains of ParDo's into a single ParDo. Fusion doesn't care whether some of the ParDos play a role related to windowing, or that a GroupByKey groups by a composite key, etc.
I believe in Dataflow Streaming runner, fusion is in practice more aggressive (it always collapses chains of ParDos) than in the batch runner (that collapses only in cases where it seems beneficial according to data size estimates, based on the FlumeJava paper), but this can change as we make improvements to both runners.

Too many 'steps' when executing a pipeline

We have a large data set which needs to be partition into 1,000 separate files, and the simplest implementation we wanted to use is to apply PartitionFn which, given an element of the data set, returns a random integer between 1 and 1,000.
The problem with this approach is it ends up creating 1,000 PCollections and the pipeline does not launch as there seems to be a hard limit on the number of 'steps' (which correspond to the boxes shown on the job monitoring UI in execution graph).
Is there a way to increase this limit (and what is the limit)?
The solution we are using to get around this issue is to partition the data into a smaller subsets first (say 50 subsets), and for each subset we run another layer of partitioning pipelines to produce 20 subsets of each subset (so the end result is 1000 subsets), but it'll be nice if we can avoid this extra layer (as ends up creating 1 + 50 pipelines, and incurs the extra cost of writing and reading the intermediate data).
Rather than using the Partition transform and introducing many steps in the pipeline consider using either of the following approaches:
Many sinks support the option to specify the number of output shards. For example, TextIO has a withNumShards method. If you pass this 1000 it will produce 1000 separate shards in the specified directory.
Using the shard number as a key and using a GroupByKey + a DoFn to write the results.

Stream de-duplication on Dataflow | Running services on Dataflow services

I want to de-dupe a stream of data based on an ID in a windowed fashion. The stream we receive has and we want to remove data with matching within N-hour time windows. A straight-forward approach is to use an external key-store (BigTable or something similar) where we look-up for keys and write if required but our qps is extremely large making maintaining such a service pretty hard. The alternative approach I came up with was to groupBy within a timewindow so that all data for a user within a time-window falls within the same group and then, in each group, we use a separate key-store service where we look up for duplicates by the key. So, I have a few questions about this approach
[1] If I run a groupBy transform, is there any guarantee that each group will be processed in the same slave? If guaranteed, we can group by the userid and then within each group compare the sessionid for each user
[2] If it is feasible, my next question is to whether we can run such other services in each of the slave machines that run the job - in the example above, I would like to have a local Redis running which can then be used by each group to look up or write an ID too.
The idea seems off what Dataflow is supposed to do but I believe such use cases should be common - so if there is a better model to approach this problem, I am looking forward to that too. We essentially want to avoid external lookups as much as possible given the amount of data we have.
1) In the Dataflow model, there is no guarantee that the same machine will see all the groups across windows for the key. Imagine that a VM dies or new VMs are added and work is split across them for scaling.
2) Your welcome to run other services on the Dataflow VMs since they are general purpose but note that you will have to contend with resource requirements of the other applications on the host potentially causing out of memory issues.
Note that you may want to take a look at RemoveDuplicates and use that if it fits your usecase.
It also seems like you might want to be using session windows to dedupe elements. You would call:
PCollection<T> pc = ...;
PCollection<T> windowed_pc = pc.apply(
Window<T>into(Sessions.withGapDuration(Duration.standardMinutes(N hours))));
Each new element will keep extending the length of the window so it won't close until the gap closes. If you also apply an AfterCount speculative trigger of 1 with an AfterWatermark trigger on a downstream GroupByKey. The trigger would fire as soon as it could which would be once it has seen at least one element and then once more when the session closes. After the GroupByKey you would have a DoFn that filters out an element which isn't an early firing based upon the pane information ([3], [4]).
DoFn(T -> KV<session key, T>)
|
\|/
Window.into(Session window)
|
\|/
Group by key
|
\|/
DoFn(Filter based upon pane information)
It is sort of unclear from your description, can you provide more details?
Sorry for not being clear. I gave the setup you mentioned a try, except for the early and late firings part, and it is working on smaller samples. I have a couple of follow up questions, related to scaling this up. Also, I was hoping I could give you more information on what the exact scenario is.
So, we have incoming data stream, each item of which can be uniquely identified by their fields. We also know that duplicates occur pretty far apart and for now, we care about those within a 6 hour window. And regarding the volume of data, we have atleast 100K events every second, which span across a million different users - so within this 6 hour window, we could get a few billion events into the pipeline.
Given this background, my questions are
[1] For the sessioning to happen by key, I should run it on something like
PCollection<KV<key, T>> windowed_pc = pc.apply(
Window<KV<key,T>>into(Sessions.withGapDuration(Duration.standardMinutes(6 hours))));
where key is a combination of the 3 ids I had mentioned earlier. Based on the definition of Sessions, only if I run it on this KV would I be able to manage sessions per-key. This would mean that Dataflow would have too many open sessions at any given time waiting for them to close and I was worried if it would scale or I would run into any bottle-necks.
[2] Once I perform Sessioning as above, I have already removed the duplicates based on the firings since I will only care about the first firing in each session which already destroys duplicates. I no longer need the RemoveDuplicates transform which I found was a combination of (WithKeys, Combine.PerKey, Values) transforms in order, essentially performing the same operation. Is this the right assumption to make?
[3] If the solution in [1] going to be a problem, the alternative is to reduce the key for sessioning to be just user-id, session-id ignoring the sequence-id and then, running a RemoveDuplicates on top of each resulting window by sequence-id. This might reduce the number of open sessions but still would leave a lot of open sessions (#users * #sessions per user) which can easily run into millions. FWIW, I dont think we can session only by user-id since then the session might never close as different sessions for same user could keep coming in and also determining the session gap in this scenario becomes infeasible.
Hope my problem is a little more clear this time. Please let me know any of my approaches make the best use of Dataflow or if I am missing something.
Thanks
I tried out this solution at a larger scale and as long as I provide sufficient workers and disks, the pipeline scales well although I am seeing a different problem now.
After this sessionization, I run a Combine.perKey on the key and then perform a ParDo which looks into c.pane().getTiming() and only rejects anything other than an EARLY firing. I tried counting both EARLY and ONTIME firings in this ParDo and it looks like the ontime-panes are actually deduped more precisely than the early ones. I mean, the #early-firings still has some duplicates whereas the #ontime-firings is less than that and has more duplicates removed. Is there any reason this could happen? Also, is my approach towards deduping using a Combine+ParDo the right one or could I do something better?
events.apply(
WithKeys.<String, EventInfo>of(new SerializableFunction<EventInfo, String>() {
#Override
public java.lang.String apply(EventInfo input) {
return input.getUniqueKey();
}
})
)
.apply(
Window.named("sessioner").<KV<String, EventInfo>>into(
Sessions.withGapDuration(mSessionGap)
)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
)
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes()
);

Resources