I'm using Apache Beam 2.28.0 on Google Cloud DataFlow (with Scio SDK). I have a large input PCollection (bounded) and I want to limit / sample it to a fixed number of elements, but I want to start the downstream processing as soon as possible.
Currently, when my input PCollection has e.g. 20M elements and I want to limit it to 1M by using https://beam.apache.org/releases/javadoc/2.28.0/org/apache/beam/sdk/transforms/Sample.html#any-long-
input.apply(Sample.<String>any(1000000))
it waits until all of the 20M elements are read, which takes a long time.
How to efficiently limit number of elements to a fixed size and start downstream processing as soon as the limit is reached, discarding the rest of the input processing?
OK, so my initial solution for that is to use Stateful DoFn like this (I'm using Scio's Scala SDK as mentioned in the question):
import java.lang.{Long => JLong}
class MyLimitFn[T](limit: Long) extends DoFn[KV[String, T], KV[String, T]] {
#StateId("count") private val count = StateSpecs.value[JLong]()
#ProcessElement
def processElement(context: DoFn[KV[String, T], KV[String, T]]#ProcessContext, #StateId("count") count: ValueState[JLong]): Unit = {
val current = count.read()
if(current < limit) {
count.write(current + 1L)
context.output(context.element())
}
}
}
The downside of this solution is that I need to synthetically add the same key (e.g. an empty string) to all elements before using it. So far, it's much faster than Sample.<>any().
I still look forward to see better / more efficient solutions.
Related
I'm creating a library for creating data processing workflows using Reactor 3. Each task will have an input flux and an output flux. The input flux is provided by the user. The output flux is created by the library. Tasks can be chained to form a DAG. Something like this: (It's in Kotlin)
val base64 = task<String, String>("base64") {
input { Flux.just("a", "b", "c", "d", "e") }
outputFn { ... get the output values ... }
scriptFn { ... do some stuff ... }
}
val step2 = task<List<String>, String>("step2") {
input { base64.output.buffer(3) }
outputFn { ... }
scriptFn { ... }
}
I have the requirement to limit concurrency for the whole workflow. Only a configured number of inputs can be processed at once. In the example above for a limit of 3 this would mean task base64 would run with inputs "a", "b", and "c" first, then wait for each to complete before processing "d", "e" and the "step2" tasks.
How can I apply such limitations when creating output fluxes from input fluxes? Could a TopicProcessor somehow be applied? Maybe some sort of custom scheduler or processor? How would back-pressure work? Do I need to worry about creating a buffer?
Backpressure propagates from the final susbriber up, across the whole chain. But operators in the chain can ask for data in advance (prefetch) or even "rewrite" the request. For example, in the case of buffer(3) if that operator receives a request(1) it will perform a request(3) upstream ("1 buffer == max 3 elements so I can request my source enough to fill the 1 buffer I was requested").
If the input is always provided by the user, this will be hard to abstract away...
There is no easy way to rate limit sources across multiple pipelines or even multiple subscriptions to a given pipeline (a Flux).
Using a shared Scheduler in multiple publishOn will not work because publishOn selects a Worker thread and sticks to it.
However, if your question is more specifically about the base64 task being limited, maybe the effect can be obtained from flatMap's concurrency parameter?
input.flatMap(someString -> asyncProcess(someString), 3, 1);
This will let at most 3 occurrences of asyncProcess run, and each time one terminates it starts a new one from the next value from input.
I have 3 GroupBys in my Java pipeline that after each GroupBy, the program runs some computation b/w stages. These groups become larger and larger blocks.The only thing, the program adds is a new key to each block.
The last GroupBy deals w/ smaller # of large blocks. Of course, the pipeline works for small # of items, but it fails at the second or third GroupBys for large # of items.
I played w/ Xms and Xmx and even chose much larger instances 'n1-standard-64', but it din't work. For the failed example, I'm sure the output is smaller than 5G, so is there any other way that I can control memory in DataFlow per map/reduce tasks?
If Dataflow can handle the first GroupBy then it should be able to reduce the number of tasks to allocate more memory on heap and handle large blocks in the next stage.
Any suggestion will be appreciated!
UPDATE:
.apply(ParDo.named("Sort Bins").of(
new DoFn<KV<Integer, Iterable<KV<Long, Iterable<TableRow>>>>, KV<Integer, KV<Integer, Iterable<KV<Long, Iterable<TableRow>>>>>>() {
#Override
public void processElement(ProcessContext c) {
KV<Integer, Iterable<KV<Long, Iterable<TableRow>>>> e = c.element();
Integer Secondary_key = e.getKey();
ArrayList<KV<Long, Iterable<TableRow>>> records = Lists.newArrayList(e.getValue()); // Get a modifiable list.
Collections.sort(records, BinID_COMPARATOR);
Integer Primary_key= is a simple function of the secondary key;
c.output(KV.of(Primary_key, KV.of(Secondary_key, (Iterable<KV<Long, Iterable<TableRow>>>) records)));
}
}));
Error reported for the last line (c.output).
Is there any way to check if a PCollection is empty?
I haven't found anything relevant in the documentation of Dataflow and Apache Beam.
You didn't specify which SDK you're using, so I assumed Python. The code is easily portable to Java.
You can apply global counting of elements and then map numeric value to boolean by applying simple comparison. You will be able to side-input this value using pvalue.AsSingleton function, like this:
import apache_beam as beam
from apache_beam import pvalue
is_empty_check = (your_pcollection
| "Count" >> beam.combiners.Count.Globally()
| "Is empty?" >> beam.Map(lambda n: n == 0)
)
another_pipeline_branch = (
p
| beam.Map(do_something, is_empty=pvalue.AsSingleton(is_empty_check))
)
Usage of the side input is the following:
def do_something(element, is_empty):
if is_empty:
# yes
else:
# no
There is no way to check size of the PCollection without applying a PTransform on it (such as Count.globally() or Combine.combineFn()) because PCollection is not like a typical Collection in Java SDK or so.
It is an abstraction of bounded or unbounded collection of data where data is fed into the collection for an operation being applied on it (e.g. PTransform). Also it is parallelized (as the P at the beginning of the class suggest).
Therefore you need a mechanism to get counts of elements from each worker/node and combine them to get a value. Whether it is 0 or n can not be known until the end of that transformation.
Is it possible to join two separate PubSubIo Unbounded PCollections using a key present in both of them? I try to accomplish the task with something like:
Read(FistStream)&Read(SecondStream) -> Flatten -> Generate key to use in joining -> Use Session Windowing to gather them together -> Group by key then rewindow with fixed size windows -> AvroIOWrite to disk using windowing.
EDIT:
Here is the pipeline code I created. I experience two problems:
Nothing get's written to the disk
Pipeline starts to be really unstable - it randomly slows down processing of certain steps. Especially group by. It's not able to keep up with ingestion speed even when I use 10 dataflow workers.
I need to handle ~ 10 000 sessions a second. Each session comprises of 1 or 2 events, then needs to be closed.
PubsubIO.Read<String> auctionFinishedReader = PubsubIO.readStrings().withTimestampAttribute(TIMESTAMP_ATTRIBUTE)
.fromTopic("projects/authentic-genre-152513/topics/auction_finished");
PubsubIO.Read<String> auctionAcceptedReader = PubsubIO.readStrings().withTimestampAttribute(TIMESTAMP_ATTRIBUTE)
.fromTopic("projects/authentic-genre-152513/topics/auction_accepted");
PCollection<String> auctionFinishedStream = p.apply("ReadAuctionFinished", auctionFinishedReader);
PCollection<String> auctionAcceptedStream = p.apply("ReadAuctionAccepted", auctionAcceptedReader);
PCollection<String> combinedEvents = PCollectionList.of(auctionFinishedStream)
.and(auctionAcceptedStream).apply(Flatten.pCollections());
PCollection<KV<String, String>> keyedAuctionFinishedStream = combinedEvents
.apply("AddKeysToAuctionFinished", WithKeys.of(new GenerateKeyForEvent()));
PCollection<KV<String, Iterable<String>>> sessions = keyedAuctionFinishedStream
.apply(Window.<KV<String, String>>into(Sessions.withGapDuration(Duration.standardMinutes(1)))
.withTimestampCombiner(TimestampCombiner.END_OF_WINDOW))
.apply(GroupByKey.create());
PCollection<SodaSession> values = sessions
.apply(ParDo.of(new DoFn<KV<String, Iterable<String>>, SodaSession> () {
#ProcessElement
public void processElement(ProcessContext c, BoundedWindow window) {
c.output(new SodaSession("auctionid", "stattedat"));
}
}));
PCollection<SodaSession> windowedEventStream = values
.apply("ApplyWindowing", Window.<SodaSession>into(FixedWindows.of(Duration.standardMinutes(2)))
.triggering(Repeatedly.forever(
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1))
))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes()
);
AvroIO.Write<SodaSession> avroWriter = AvroIO
.write(SodaSession.class)
.to("gs://storage/")
.withWindowedWrites()
.withFilenamePolicy(new EventsToGCS.PerWindowFiles("sessionsoda"))
.withNumShards(3);
windowedEventStream.apply("WriteToDisk", avroWriter);
I've found an efficient solution. As one of my collection was disproportionate in size compared to the other one so I used side input to speed up grouping operation. Here is an overview of my solution:
Read both event streams.
Flatten them into single PCollection.
Use sliding window sized (closable session duration + session max length, every closable session duration).
Partition collections again.
Create PCollectionView from smaller PCollection.
Join both streams using sideInput with the view created in the previous step.
Write sessions to disk.
It handles joining 4000 events/sec stream (larger one) + 60 events/sec stream on 1-2 DataFlow workers versus ~15 workers when used Session windowing along with GroupBy.
I am reading data (GPS-coordinates, with time stamps) from a unbounded pub/sub datasource and need to calculate the distance between all those points. My idea is to have lets say 1 minute windows and do a ParDo with the whole collection as a side input, where I use the side input to look up the next point and calculate the distance inside the ParDo.
If I run the pipeline I can see the View.asList step is not producing any output. Also calcDistance is never producing any output. Are there any examples of how to use a FixedWindow collection as side input? picture of pipeline
Pipeline:
PCollection<Timepoint> inputWindow = pipeline.apply(PubsubIO.Read.topic(""))
.apply(ParDo.of(new ExtractTimestamps()))
.apply(Window.<Timepoint>into(FixedWindows.of(Duration.standardMinutes(1))));
final PCollectionView<List<Timepoint>> SideInputWindowed = inputWindow.apply(View.<Timepoint>asList());
inputWindow.apply(ParDo.named("Add Timestamp "+teams[i]).of(new AddTimeStampAsKey()))
.apply(ParDo.of(new CalcDistanceTest(SideInputWindowed)).withSideInputs(SideInputWindowed));
ParDo:
static class CalcDistance extends DoFn<KV<String,Timepoint>,Timepoint> {
private final PCollectionView<List<Timepoint>> pCollectionView;
public CalcDistance(PCollectionView pCollectionView){
this.pCollectionView = pCollectionView;
}
#Override
public void processElement(ProcessContext c) throws Exception {
LOG.info("starting to calculate distance");
Timepoint input = c.element().getValue();
//doing distance calculation
c.output(input);
}
}
The overall issue is that the timestamp of the element is not known by Dataflow when reading from Pubsub since for your usecase its an attribute of the data.
You'll want to ensure that when reading from Pubsub, you ingest the records with a timestamp label as discussed here.
Finally, the GameStats example uses a side input to find spammy users. In your case instead of calculating a globalMeanScore per window, you'll just place all your Timepoints into the side input.