Using FinishBundle in Apache Beam Go SDK - google-cloud-dataflow

I am attempting to use FinishBundle() to batch requests in beam on dataflow. These requests are fetching information and emitting it for further processing downstream in the pipeline, a la:
func BatchRpcFn {
client RpcClient
bufferRequest *RpcRequest
}
func (f *BatchRpcFn) Setup(ctx context.Context) {
// setup client
}
func (f *BatchRpcFn) ProcessBundle(ctx context.Context, id string, emit func(string, bool)) error {
f.bufferRequest.Ids = append(f.bufferRequest.Ids, id)
if len(f.bufferRequest.Ids) > bufferLimit {
return f.performRequestAndEmit(ctx, emit)
}
return nil
}
func (f *BatchRpcFn) FinishBundle(ctx context.Context, emit func(string, bool)) error {
return f.performRequestAndEmit(ctx, emit)
}
In unit tests, this function works as expected, however when running on dataflow, I get this error:
panic: interface conversion: typex.Window is window.GlobalWindow, not window.IntervalWindow
//...
github.com/apache/beam/sdks/v2/go/pkg/beam/core/runtime/exec.(*intervalWindowEncoder).EncodeSingle()
The documentation on FinishBundle() is a little sparse, so it wasn't clear to me if this is possible. Most of the examples I see of using FinishBundle() are flushing data to some sink instead of adding to the resultant PCollection.
Is this a bug, or am I using FinishBundle incorrectly here?

I think that the processing should be done in ProcessElement() itself which produces the resultant PCollection. StartBundle() and FinishBundle() are one time calls per bundle that have common use-case of connecting/disconnecting to the external service/database, etc.
I guess that having a stateful DoFn to batch the requests may be a good way to do so. For example, Do processing after five elements have been observed, and finally onTimer() callback to process the remaining elements at the end of window.
However, only State support has been added to the Go SDK for 2.42.0 release. Timers are yet to be implemented.

Related

Combine framework: how to process each element of array asynchronously before proceeding

I'm having a bit of a mental block using the iOS Combine framework.
I'm converting some code from "manual" fetching from a remote API to using Combine. Basically, the API is SQL and REST (in actual fact it's Salesforce, but that's irrelevant to the question). What the code used to do is call a REST query method that takes a completion handler. What I'm doing is replacing this everywhere with a Combine Future. So far, so good.
The problem arises when the following scenario happens (and it happens a lot):
We do a REST query and get back an array of "objects".
But these "objects" are not completely populated. Each one of them needs additional data from some related object. So for each "object", we do another REST query using information from that "object", thus giving us another array of "objects".
This might or might not allow us to finish populating the first "objects" — or else, we might have to do another REST query using information from each of the second "object", and so on.
The result was a lot of code structured like this (this is pseudocode):
func fetchObjects(completion: #escaping ([Object] -> Void) {
let restQuery = ...
RESTClient.performQuery(restQuery) { results in
let partialObjects = results.map { ... }
let group = DispatchGroup()
for partialObject in partialObjects {
let restQuery = ... // something based on partialObject
group.enter()
RESTClient.performQuery(restQuery) { results in
group.leave()
let partialObjects2 = results.map { ... }
partialObject.property1 = // something from partialObjects2
partialObject.property2 = // something from partialObjects2
// and we could go down yet _another_ level in some cases
}
}
group.notify {
completion([partialObjects])
}
}
}
Every time I say results in in the pseudocode, that's the completion handler of an asynchronous networking call.
Okay, well, I see well enough how to chain asynchronous calls in Combine, for example by using Futures and flatMap (pseudocode again):
let future1 = Future...
future1.map {
// do something
}.flatMap {
let future2 = Future...
return future2.map {
// do something
}
}
// ...
In that code, the way we form future2 can depend upon the value we received from the execution of future1, and in the map on future2 we can modify what we received from upstream before it gets passed on down the pipeline. No problem. It's all quite beautiful.
But that doesn't give me what I was doing in the pre-Combine code, namely the loop. Here I was, doing multiple asynchronous calls in a loop, held in place by a DispatchGroup before proceeding. The question is:
What is the Combine pattern for doing that?
Remember the situation. I've got an array of some object. I want to loop through that array, doing an asynchronous call for each object in the loop, fetching new info asynchronously and modifying that object on that basis, before proceeding on down the pipeline. And each loop might involve a further nested loop gathering even more information asynchronously:
Fetch info from online database, it's an array
|
V
For each element in the array, fetch _more_ info, _that's_ an array
|
V
For each element in _that_ array, fetch _more_ info
|
V
Loop thru the accumulated info and populate that element of the original array
The old code for doing this was horrible-looking, full of nested completion handlers and loops held in place by DispatchGroup enter/leave/notify. But it worked. I can't get my Combine code to work the same way. How do I do it? Basically my pipeline output is an array of something, I feel like I need to split up that array into individual elements, do something asynchronously to each element, and put the elements back together into an array. How?
The way I've been solving this works, but doesn't scale, especially when an asynchronous call needs information that arrived several steps back in the pipeline chain. I've been doing something like this (I got this idea from https://stackoverflow.com/a/58708381/341994):
An array of objects arrives from upstream.
I enter a flatMap and map the array to an array of publishers, each headed by a Future that fetches further online stuff related to one object, and followed by a pipeline that produces the modified object.
Now I have an array of pipelines, each producing a single object. I merge that array and produce that publisher (a MergeMany) from the flatMap.
I collect the resulting values back into an array.
But this still seems like a lot of work, and even worse, it doesn't scale when each sub-pipeline itself needs to spawn an array of sub-pipelines. It all becomes incomprehensible, and information that used to arrive easily into a completion block (because of Swift's scoping rules) no longer arrives into a subsequent step in the main pipeline (or arrives only with difficulty because I pass bigger and bigger tuples down the pipeline).
There must be some simple Combine pattern for doing this, but I'm completely missing it. Please tell me what it is.
With your latest edit and this comment below:
I literally am asking is there a Combine equivalent of "don't proceed to the next step until this step, involving multiple asynchronous steps, has finished"
I think this pattern can be achieved with .flatMap to an array publisher (Publishers.Sequence), which emits one-by-one and completes, followed by whatever per-element async processing is needed, and finalized with a .collect, which waits for all elements to complete before proceeding
So, in code, assuming we have these functions:
func getFoos() -> AnyPublisher<[Foo], Error>
func getPartials(for: Foo) -> AnyPublisher<[Partial], Error>
func getMoreInfo(for: Partial, of: Foo) -> AnyPublisher<MoreInfo, Error>
We can do the following:
getFoos()
.flatMap { fooArr in
fooArr.publisher.setFailureType(to: Error.self)
}
// per-foo element async processing
.flatMap { foo in
getPartials(for: foo)
.flatMap { partialArr in
partialArr.publisher.setFailureType(to: Error.self)
}
// per-partial of foo async processing
.flatMap { partial in
getMoreInfo(for: partial, of: foo)
// build completed partial with more info
.map { moreInfo in
var newPartial = partial
newPartial.moreInfo = moreInfo
return newPartial
}
}
.collect()
// build completed foo with all partials
.map { partialArr in
var newFoo = foo
newFoo.partials = partialArr
return newFoo
}
}
.collect()
(Deleted the old answer)
Using the accepted answer, I wound up with this structure:
head // [Entity]
.flatMap { entities -> AnyPublisher<Entity, Error> in
Publishers.Sequence(sequence: entities).eraseToAnyPublisher()
}.flatMap { entity -> AnyPublisher<Entity, Error> in
self.makeFuture(for: entity) // [Derivative]
.flatMap { derivatives -> AnyPublisher<Derivative, Error> in
Publishers.Sequence(sequence: derivatives).eraseToAnyPublisher()
}
.flatMap { derivative -> AnyPublisher<Derivative2, Error> in
self.makeFuture(for: derivative).eraseToAnyPublisher() // Derivative2
}.collect().map { derivative2s -> Entity in
self.configuredEntity(entity, from: derivative2s)
}.eraseToAnyPublisher()
}.collect()
That has exactly the elegant tightness I was looking for! So the idea is:
We receive an array of something, and we need to process each element asynchronously. The old way would have been a DispatchGroup and a for...in loop. The Combine equivalent is:
The equivalent of the for...in line is flatMap and Publishers.Sequence.
The equivalent of the DispatchGroup (dealing with asynchronousness) is a further flatMap (on the individual element) and some publisher. In my case I start with a Future based on the individual element we just received.
The equivalent of the right curly brace at the end is collect(), waiting for all elements to be processed and putting the array back together again.
So to sum up, the pattern is:
flatMap the array to a Sequence.
flatMap the individual element to a publisher that launches the asynchronous operation on that element.
Continue the chain from that publisher as needed.
collect back into an array.
By nesting that pattern, we can take advantage of Swift scoping rules to keep the thing we need to process in scope until we have acquired enough information to produce the processed object.

How to parallelize HTTP requests within an Apache Beam step?

I have an Apache Beam pipeline running on Google Dataflow whose job is rather simple:
It reads individual JSON objects from Pub/Sub
Parses them
And sends them via HTTP to some API
This API requires me to send the items in batches of 75. So I built a DoFn that accumulates events in a list and publish them via this API once they I get 75. This results to be too slow, so I thought instead of executing those HTTP requests in different threads using a thread pool.
The implementation of what I have right now looks like this:
private class WriteFn : DoFn<TheEvent, Void>() {
#Transient var api: TheApi
#Transient var currentBatch: MutableList<TheEvent>
#Transient var executor: ExecutorService
#Setup
fun setup() {
api = buildApi()
executor = Executors.newCachedThreadPool()
}
#StartBundle
fun startBundle() {
currentBatch = mutableListOf()
}
#ProcessElement
fun processElement(processContext: ProcessContext) {
val record = processContext.element()
currentBatch.add(record)
if (currentBatch.size >= 75) {
flush()
}
}
private fun flush() {
val payloadTrack = currentBatch.toList()
executor.submit {
api.sendToApi(payloadTrack)
}
currentBatch.clear()
}
#FinishBundle
fun finishBundle() {
if (currentBatch.isNotEmpty()) {
flush()
}
}
#Teardown
fun teardown() {
executor.shutdown()
executor.awaitTermination(30, TimeUnit.SECONDS)
}
}
This seems to work "fine" in the sense that data is making it to the API. But I don't know if this is the right approach and I have the sense that this is very slow.
The reason I think it's slow is that when load testing (by sending a few million events to Pub/Sub), it takes it up to 8 times more time for the pipeline to forward those messages to the API (which has response times of under 8ms) than for my laptop to feed them into Pub/Sub.
Is there any problem with my implementation? Is this the way I should be doing this?
Also... am I required to wait for all the requests to finish in my #FinishBundle method (i.e. by getting the futures returned by the executor and waiting on them)?
You have two interrelated questions here:
Are you doing this right / do you need to change anything?
Do you need to wait in #FinishBundle?
The second answer: yes. But actually you need to flush more thoroughly, as will become clear.
Once your #FinishBundle method succeeds, a Beam runner will assume the bundle has completed successfully. But your #FinishBundle only sends the requests - it does not ensure they have succeeded. So you could lose data that way if the requests subsequently fail. Your #FinishBundle method should actually be blocking and waiting for confirmation of success from the TheApi. Incidentally, all of the above should be idempotent, since after finishing the bundle, an earthquake could strike and cause a retry ;-)
So to answer the first question: should you change anything? Just the above. The practice of batching requests this way can work as long as you are sure the results are committed before the bundle is committed.
You may find that doing so will cause your pipeline to slow down, because #FinishBundle happens more frequently than #Setup. To batch up requests across bundles you need to use the lower-level features of state and timers. I wrote up a contrived version of your use case at https://beam.apache.org/blog/2017/08/28/timely-processing.html. I would be quite interested in how this works for you.
It may simply be that the extremely low latency you are expecting, in the low millisecond range, is not available when there is a durable shuffle in your pipeline.

iOS RxSwift how to prevent sequence from being disposed on (throw error)?

I have a sequence made up of multiple operators. There are total of 7 places where errors can be generated during this sequence processing. I'm running into an issue where the sequence does not behave as I expected and I'm looking for an elegant solution around the problem:
let inputRelay = PublishRelay<Int>()
let outputRelay = PublishRelay<Result<Int>>()
inputRelay
.map{ /*may throw multiple errors*/}
.flatmap{ /*may throw error*/ }
.map{}
.filter{}
.map{ _ -> Result<Int> in ...}
.catchError{}
.bind(to: outputRelay)
I thought that catchError would simply catch the error, allow me to convert it to failure result, but prevent the sequence from being deallocated. However, I see that the first time an error is caught, the entire sequence is deallocated and no more events go through.
Without this behavior, I'm left with a fugly Results<> all over the place, and have to branch my sequence multiple times to direct the Result.failure(Error) to the output. There are non-recoverable errors, so retry(n) is not an option:
let firstOp = inputRelay
.map{ /*may throw multiple errors*/}
.share()
//--Handle first error results--
firstOp
.filter{/*errorResults only*/}
.bind(to: outputRelay)
let secondOp = firstOp
.flatmap{ /*may throw error*/ }
.share()
//--Handle second error results--
secondOp
.filter{/*errorResults only*/}
.bind(to: outputRelay)
secondOp
.map{}
.filter{}
.map{ _ -> Result<Int> in ...}
.catchError{}
.bind(to: outputRelay)
^ Which is very bad, because there are around 7 places where errors can be thrown and I cannot just keep branching the sequence each time.
How can RxSwift operators catch all errors and emit a failure result at the end, but NOT dispose the entire sequence on first error?
The first trick to come to mind is using materialize. This would convert every Observable<T> to Observable<Event<T>>, so an Error would just be a .next(.error(Error)) and won't cause the termination of the sequence.
in this specific case, though, another trick would be needed. Putting your entire "trigger" chain within a flatMap, as well, and materializeing that specific piece. This is needed because a materialized sequence can still complete, which would cause a termination in case of a regular chain, but would not terminate a flatMapped chain (as complete == successfully done, inside a flatMap).
inputRelay
.flatMapLatest { val in
return Observable.just(val)
.map { value -> Int in
if value == 1 { throw SomeError.randomError }
return value + value
}
.flatMap { value in
return Observable<String>.just("hey\(value)")
}
.materialize()
}
.debug("k")
.subscribe()
inputRelay.accept(1)
inputRelay.accept(2)
inputRelay.accept(3)
inputRelay.accept(4)
This will output the following for k :
k -> subscribed
k -> Event next(error(randomError))
k -> Event next(next(hey4))
k -> Event next(completed)
k -> Event next(next(hey6))
k -> Event next(completed)
k -> Event next(next(hey8))
k -> Event next(completed)
Now all you have to do is filter just the "next" events from the materialized sequence.
If you have RxSwiftExt, you can simply use the errors() and elements() operators:
stream.elements()
.debug("elements")
.subscribe()
stream.errors()
.debug("errors")
.subscribe()
This will provide the following output:
errors -> Event next(randomError)
elements -> Event next(hey4)
elements -> Event next(hey6)
elements -> Event next(hey8)
When using this strategy, don't forget adding share() after your flatMap, so many subscriptions don't cause multiple pieces of processing.
You can read more about why you should use share in this situation here: http://adamborek.com/how-to-handle-errors-in-rxswift/
Hope this helps!
Yes, it's a pain. I've thought about the idea of making a new library where the grammar doesn't require the stream to end on an error, but trying to reproduce the entire Rx ecosystem for it seems pointless.
There are reactive libraries that allow you to specify Never as the error type (meaning an error can't be emitted at all,) and in RxCocoa you can use Driver (which can't error) but you are still left with the whole Result dance. "Monads in my Monads!".
To deal with it properly, you need a set of Monad transformers. With these, you can do all the mapping/flatMapping you want and not worry about looking at the errors until the very end.

Why is GroupByKey in beam pipeline duplicating elements (when run on Google Dataflow)?

Background
We have a pipeline that starts by receiving messages from PubSub, each with the name of a file. These files are exploded to line level, parsed to JSON object nodes and then sent to an external decoding service (which decodes some encoded data). Object nodes are eventually converted to Table Rows and written to Big Query.
It appeared that Dataflow was not acknowledging the PubSub messages until they arrived at the decoding service. The decoding service is slow, resulting in a backlog when many message are sent at once. This means that lines associated with a PubSub message can take some time to arrive at the decoding service. As a result, PubSub was receiving no acknowledgement and resending the message. My first attempt to remedy this was adding an attribute to each PubSub messages that is passed to the Reader using withAttributeId(). However, on testing, this only prevented duplicates that arrived close together.
My second attempt was to add a fusion breaker (example) after the PubSub read. This simply performs a needless GroupByKey and then ungroups, the idea being that the GroupByKey forces Dataflow to acknowledge the PubSub message.
The Problem
The fusion breaker discussed above works in that it prevents PubSub from resending messages, but I am finding that this GroupByKey outputs more elements than it receives: See image.
To try and diagnose this I have removed parts of the pipeline to get a simple pipeline that still exhibits this behavior. The behavior remains even when
PubSub is replaced by some dummy transforms that send out a fixed list of messages with a slight delay between each one.
The Writing transforms are removed.
All Side Inputs/Outputs are removed.
The behavior I have observed is:
Some number of the received messages pass straight through the GroupByKey.
After a certain point, messages are 'held' by the GroupByKey (presumably due to the backlog after the GroupByKey).
These messages eventually exit the GroupByKey (in groups of size one).
After a short delay (about 3 minutes), the same messages exit the GroupByKey again (still in groups of size one). This may happen several times (I suspect it is proportional to the time they spend waiting to enter the GroupByKey).
Example job id is 2017-10-11_03_50_42-6097948956276262224. I have not run the beam on any other runner.
The Fusion Breaker is below:
#Slf4j
public class FusionBreaker<T> extends PTransform<PCollection<T>, PCollection<T>> {
#Override
public PCollection<T> expand(PCollection<T> input) {
return group(window(input.apply(ParDo.of(new PassthroughLogger<>(PassthroughLogger.Level.Info, "Fusion break in")))))
.apply("Getting iterables after breaking fusion", Values.create())
.apply("Flattening iterables after breaking fusion", Flatten.iterables())
.apply(ParDo.of(new PassthroughLogger<>(PassthroughLogger.Level.Info, "Fusion break out")));
}
private PCollection<T> window(PCollection<T> input) {
return input.apply("Windowing before breaking fusion", Window.<T>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.discardingFiredPanes());
}
private PCollection<KV<Integer, Iterable<T>>> group(PCollection<T> input) {
return input.apply("Keying with random number", ParDo.of(new RandomKeyFn<>()))
.apply("Grouping by key to break fusion", GroupByKey.create());
}
private static class RandomKeyFn<T> extends DoFn<T, KV<Integer, T>> {
private Random random;
#Setup
public void setup() {
random = new Random();
}
#ProcessElement
public void processElement(ProcessContext context) {
context.output(KV.of(random.nextInt(), context.element()));
}
}
}
The PassthroughLoggers simply log the elements passing through (I use these to confirm that elements are indeed repeated, rather than there being an issue with the counts).
I suspect this is something to do with windows/triggers, but my understanding is that elements should never be repeated when .discardingFiredPanes() is used - regardless of the windowing setup. I have also tried FixedWindows with no success.
First, the Reshuffle transform is equivalent to your Fusion Breaker, but has some additional performance improvements that should make it preferable.
Second, both counters and logging may see an element multiple times if it is retried. As described in the Beam Execution Model, an element at a step may be retried if anything that is fused into it is retried.
Have you actually observed duplicates in what is written as the output of the pipeline?

Grails CSV Plugin - Concurrency

I am using the plugin: Grails CSV Plugin in my application with Grails 2.5.3.
I need to implement the concurrency functionality with for example: GPars, but I don't know how I can do it.
Now, the configuration is sequential processing. Example of my code fragment:
Thanks.
Implementing concurrency in this case may not give you much of a benefit. It really depends on where the bottleneck is. For example, if the bottleneck is in reading the CSV file, then there would be little advantage because the file can only be read in sequential order. With that out of the way, here's the simplest example I could come up with:
import groovyx.gpars.GParsPool
def tokens = csvFileLoad.inputStream.toCsvReader(['separatorChar': ';', 'charset': 'UTF-8', 'skipLines': 1]).readAll()
def failedSaves = GParsPool.withPool {
tokens.parallel
.map { it[0].trim() }
.filter { !Department.findByName(it) }
.map { new Department(name: it) }
.map { customImportService.saveRecordCSVDepartment(it) }
.map { it ? 0 : 1 }
.sum()
}
if(failedSaves > 0) transactionStatus.setRollbackOnly()
As you can see, the entire file is read first; hence the main bottleneck. The majority of the processing is done concurrently with the map(), filter(), and sum() methods. At the very end, the transaction is rolled back if any of the Departments failed to save.
Note: I chose to go with a map()-sum() pair instead of using anyParallel() to avoid having to convert the parallel array produced by map() to a regular Groovy collection, perform the anyParallel(), which creates a parallel array and then converts it back to a Groovy collection.
Improvements
As I already mentioned in my example the CSV file is first read completely before the concurrent execution begins. It also attempts to save all of the Department instances, even if one failed to save. You may want that (which is what you demonstrated) or not.

Resources