How to parallelize HTTP requests within an Apache Beam step? - google-cloud-dataflow

I have an Apache Beam pipeline running on Google Dataflow whose job is rather simple:
It reads individual JSON objects from Pub/Sub
Parses them
And sends them via HTTP to some API
This API requires me to send the items in batches of 75. So I built a DoFn that accumulates events in a list and publish them via this API once they I get 75. This results to be too slow, so I thought instead of executing those HTTP requests in different threads using a thread pool.
The implementation of what I have right now looks like this:
private class WriteFn : DoFn<TheEvent, Void>() {
#Transient var api: TheApi
#Transient var currentBatch: MutableList<TheEvent>
#Transient var executor: ExecutorService
#Setup
fun setup() {
api = buildApi()
executor = Executors.newCachedThreadPool()
}
#StartBundle
fun startBundle() {
currentBatch = mutableListOf()
}
#ProcessElement
fun processElement(processContext: ProcessContext) {
val record = processContext.element()
currentBatch.add(record)
if (currentBatch.size >= 75) {
flush()
}
}
private fun flush() {
val payloadTrack = currentBatch.toList()
executor.submit {
api.sendToApi(payloadTrack)
}
currentBatch.clear()
}
#FinishBundle
fun finishBundle() {
if (currentBatch.isNotEmpty()) {
flush()
}
}
#Teardown
fun teardown() {
executor.shutdown()
executor.awaitTermination(30, TimeUnit.SECONDS)
}
}
This seems to work "fine" in the sense that data is making it to the API. But I don't know if this is the right approach and I have the sense that this is very slow.
The reason I think it's slow is that when load testing (by sending a few million events to Pub/Sub), it takes it up to 8 times more time for the pipeline to forward those messages to the API (which has response times of under 8ms) than for my laptop to feed them into Pub/Sub.
Is there any problem with my implementation? Is this the way I should be doing this?
Also... am I required to wait for all the requests to finish in my #FinishBundle method (i.e. by getting the futures returned by the executor and waiting on them)?

You have two interrelated questions here:
Are you doing this right / do you need to change anything?
Do you need to wait in #FinishBundle?
The second answer: yes. But actually you need to flush more thoroughly, as will become clear.
Once your #FinishBundle method succeeds, a Beam runner will assume the bundle has completed successfully. But your #FinishBundle only sends the requests - it does not ensure they have succeeded. So you could lose data that way if the requests subsequently fail. Your #FinishBundle method should actually be blocking and waiting for confirmation of success from the TheApi. Incidentally, all of the above should be idempotent, since after finishing the bundle, an earthquake could strike and cause a retry ;-)
So to answer the first question: should you change anything? Just the above. The practice of batching requests this way can work as long as you are sure the results are committed before the bundle is committed.
You may find that doing so will cause your pipeline to slow down, because #FinishBundle happens more frequently than #Setup. To batch up requests across bundles you need to use the lower-level features of state and timers. I wrote up a contrived version of your use case at https://beam.apache.org/blog/2017/08/28/timely-processing.html. I would be quite interested in how this works for you.
It may simply be that the extremely low latency you are expecting, in the low millisecond range, is not available when there is a durable shuffle in your pipeline.

Related

Using FinishBundle in Apache Beam Go SDK

I am attempting to use FinishBundle() to batch requests in beam on dataflow. These requests are fetching information and emitting it for further processing downstream in the pipeline, a la:
func BatchRpcFn {
client RpcClient
bufferRequest *RpcRequest
}
func (f *BatchRpcFn) Setup(ctx context.Context) {
// setup client
}
func (f *BatchRpcFn) ProcessBundle(ctx context.Context, id string, emit func(string, bool)) error {
f.bufferRequest.Ids = append(f.bufferRequest.Ids, id)
if len(f.bufferRequest.Ids) > bufferLimit {
return f.performRequestAndEmit(ctx, emit)
}
return nil
}
func (f *BatchRpcFn) FinishBundle(ctx context.Context, emit func(string, bool)) error {
return f.performRequestAndEmit(ctx, emit)
}
In unit tests, this function works as expected, however when running on dataflow, I get this error:
panic: interface conversion: typex.Window is window.GlobalWindow, not window.IntervalWindow
//...
github.com/apache/beam/sdks/v2/go/pkg/beam/core/runtime/exec.(*intervalWindowEncoder).EncodeSingle()
The documentation on FinishBundle() is a little sparse, so it wasn't clear to me if this is possible. Most of the examples I see of using FinishBundle() are flushing data to some sink instead of adding to the resultant PCollection.
Is this a bug, or am I using FinishBundle incorrectly here?
I think that the processing should be done in ProcessElement() itself which produces the resultant PCollection. StartBundle() and FinishBundle() are one time calls per bundle that have common use-case of connecting/disconnecting to the external service/database, etc.
I guess that having a stateful DoFn to batch the requests may be a good way to do so. For example, Do processing after five elements have been observed, and finally onTimer() callback to process the remaining elements at the end of window.
However, only State support has been added to the Go SDK for 2.42.0 release. Timers are yet to be implemented.

Can Python gRPC do computation when sending messages out?

Suppose I need to send a large amount of data from the client to the server using python gRPC. And I want to continue the rest computation when sending the message out instead of blocking the code. Is there any way can implement this?
I will illustrate the question by an example using the modified code from the greeter_client.py
for i in range(5):
res=computation()
response = stub.SayHello(helloworld_pb2.HelloRequest(data=res))
I want the computation of the next iteration continue while sending the "res" of last iteration. To this end, I have tried the "async/await", which looks like this
async with aio.insecure_channel('localhost:50051') as channel:
stub = helloworld_pb2_grpc.GreeterStub(channel)
for j in range(5):
res=computation()
response = await stub.SayHello(helloworld_pb2.HelloRequest(data=res))
But the running time is actually the same with the version without async/await. The async/await does not work. I am wondering is there anything wrong in my codes or there are other ways?
Concurrency is different than parallelism. AsyncIO allows multiple coroutines to run on the same thread, but they are not actually computed at the same time. If the thread is given a CPU-heavy work like "computation()" in your snippet, it doesn't yield control back to the event loop, hence there won't be any progress on other coroutines.
Besides, in the snippet, the RPC depends on the result of "computation()", this meant the work will be serialized for each RPC. But we can still gain some concurrency from AsyncIO, by handing them over to the event loop with asyncio.gather():
async with aio.insecure_channel('localhost:50051') as channel:
stub = helloworld_pb2_grpc.GreeterStub(channel)
async def one_hello():
res=computation()
response = await stub.SayHello(helloworld_pb2.HelloRequest(data=res))
await asyncio.gather(*(one_hello() for _ in range(5)))

Why is GroupByKey in beam pipeline duplicating elements (when run on Google Dataflow)?

Background
We have a pipeline that starts by receiving messages from PubSub, each with the name of a file. These files are exploded to line level, parsed to JSON object nodes and then sent to an external decoding service (which decodes some encoded data). Object nodes are eventually converted to Table Rows and written to Big Query.
It appeared that Dataflow was not acknowledging the PubSub messages until they arrived at the decoding service. The decoding service is slow, resulting in a backlog when many message are sent at once. This means that lines associated with a PubSub message can take some time to arrive at the decoding service. As a result, PubSub was receiving no acknowledgement and resending the message. My first attempt to remedy this was adding an attribute to each PubSub messages that is passed to the Reader using withAttributeId(). However, on testing, this only prevented duplicates that arrived close together.
My second attempt was to add a fusion breaker (example) after the PubSub read. This simply performs a needless GroupByKey and then ungroups, the idea being that the GroupByKey forces Dataflow to acknowledge the PubSub message.
The Problem
The fusion breaker discussed above works in that it prevents PubSub from resending messages, but I am finding that this GroupByKey outputs more elements than it receives: See image.
To try and diagnose this I have removed parts of the pipeline to get a simple pipeline that still exhibits this behavior. The behavior remains even when
PubSub is replaced by some dummy transforms that send out a fixed list of messages with a slight delay between each one.
The Writing transforms are removed.
All Side Inputs/Outputs are removed.
The behavior I have observed is:
Some number of the received messages pass straight through the GroupByKey.
After a certain point, messages are 'held' by the GroupByKey (presumably due to the backlog after the GroupByKey).
These messages eventually exit the GroupByKey (in groups of size one).
After a short delay (about 3 minutes), the same messages exit the GroupByKey again (still in groups of size one). This may happen several times (I suspect it is proportional to the time they spend waiting to enter the GroupByKey).
Example job id is 2017-10-11_03_50_42-6097948956276262224. I have not run the beam on any other runner.
The Fusion Breaker is below:
#Slf4j
public class FusionBreaker<T> extends PTransform<PCollection<T>, PCollection<T>> {
#Override
public PCollection<T> expand(PCollection<T> input) {
return group(window(input.apply(ParDo.of(new PassthroughLogger<>(PassthroughLogger.Level.Info, "Fusion break in")))))
.apply("Getting iterables after breaking fusion", Values.create())
.apply("Flattening iterables after breaking fusion", Flatten.iterables())
.apply(ParDo.of(new PassthroughLogger<>(PassthroughLogger.Level.Info, "Fusion break out")));
}
private PCollection<T> window(PCollection<T> input) {
return input.apply("Windowing before breaking fusion", Window.<T>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.discardingFiredPanes());
}
private PCollection<KV<Integer, Iterable<T>>> group(PCollection<T> input) {
return input.apply("Keying with random number", ParDo.of(new RandomKeyFn<>()))
.apply("Grouping by key to break fusion", GroupByKey.create());
}
private static class RandomKeyFn<T> extends DoFn<T, KV<Integer, T>> {
private Random random;
#Setup
public void setup() {
random = new Random();
}
#ProcessElement
public void processElement(ProcessContext context) {
context.output(KV.of(random.nextInt(), context.element()));
}
}
}
The PassthroughLoggers simply log the elements passing through (I use these to confirm that elements are indeed repeated, rather than there being an issue with the counts).
I suspect this is something to do with windows/triggers, but my understanding is that elements should never be repeated when .discardingFiredPanes() is used - regardless of the windowing setup. I have also tried FixedWindows with no success.
First, the Reshuffle transform is equivalent to your Fusion Breaker, but has some additional performance improvements that should make it preferable.
Second, both counters and logging may see an element multiple times if it is retried. As described in the Beam Execution Model, an element at a step may be retried if anything that is fused into it is retried.
Have you actually observed duplicates in what is written as the output of the pipeline?

Query timeout in Neo4j 3.0.6

It looks like previously working approach is deprecated now:
unsupported.dbms.executiontime_limit.enabled=true
unsupported.dbms.executiontime_limit.time=1s
According to the documentation new variables are responsible for timeouts handling:
dbms.transaction.timeout
dbms.transaction_timeout
At the same time the new variables look related to the transactions.
The new timeout variables look not working. They were set in the neo4j.conf as follows:
dbms.transaction_timeout=5s
dbms.transaction.timeout=5s
Slow cypher query isn't terminated.
Then the Neo4j plugin was added to model a slow query with transaction:
#Procedure("test.slowQuery")
public Stream<Res> slowQuery(#Name("delay") Number Delay )
{
ArrayList<Res> res = new ArrayList<>();
try ( Transaction tx = db.beginTx() ){
Thread.sleep(Delay.intValue(), 0);
tx.success();
} catch (Exception e) {
System.out.println(e);
}
return res.stream();
}
The function served by the plugin is executed with neoism Golang package. And the timeout isn't triggered as well.
The timeout is only honored if your procedure code invokes either operations on the graph like reading nodes and rels or explicitly checks if the current transaction is marked as terminate.
For the later, see https://github.com/neo4j-contrib/neo4j-apoc-procedures/blob/master/src/main/java/apoc/util/Utils.java#L41-L51 as example.
According to the documentation the transaction guard is interested in orphaned transactions only.
The server guards against orphaned transactions by using a timeout. If there are no requests for a given transaction within the timeout period, the server will roll it back. You can configure the timeout in the server configuration, by setting dbms.transaction_timeout to the number of seconds before timeout. The default timeout is 60 seconds.
I've not found a way how to trigger timeout for a query which isn't orphaned with a native functionality.
#StefanArmbruster pointed a good direction. The timeout triggering functionality can be got with creating a wrapper function in Neo4j plugin like it is made in apoc.

Groovy PromiseMap - Can I limit the asynchronous thread pool?

I am making a (quick and dirty) Batching API that allows the UI to send a selection of REST API calls and get results for all of them at once.
I am using PromiseMap to make some asynchronous REST calls to the relevant services, which get collected afterward.
There could be a large number of threads that need to run, and I would like to throttle the number of threads that run at the same time, similar to Executor's thread pool.
Is this possible without physically separating the threads into multiple PromiseMaps and chaining them? I haven't found anything online describing limiting the thread pool.
//get requested calls
JSONArray callsToMake=request.JSON as JSONArray
//registers calls in promise map
def promiseMap = new PromiseMap()
//Can I limit this Map as a thread pool to, say, run 10 at a time until finished
data.each {
def tempVar=it
promiseMap[tempVar.id]={makeCall(tempVar.method, "${basePath}${tempVar.to}" as String, tempVar.body)}
}
def result=promiseMap.get()
def resultList=parseResults(result)
response.status=HttpStatusCodes.ACCEPTED
render resultList as JSON
I'm hoping there's a fairly straight-forward setting that I may be ignorant of.
Thank you.
The default Async implementation in Grails is GPars. To configure the number of threads you need to use a GParsPool. See:
http://gpars.org/guide/guide/dataParallelism.html#dataParallelism_parallelCollections_GParsPool
Example:
withPool(10) {...}
withPool doesn't seem to be working. Just incase if anyone is looking to limit threads here is what i did. We can create a custom Group with custom ThreadPool and specify the number of the Threads.
def customGroup = new DefaultPGroup(new DefaultPool(true, 5))
try {
Dataflow.usingGroup(customGroup, {
def promises = new PromiseList()
(1..100).each { number ->
promises << {
log.info "Performing Task ${number}"
Thread.sleep(200)
number++
}
}
def result = promises.get()
})
}
finally {
customGroup.shutdown()
}
Use
runtime 'org.grails:grails-async-gpars'
at build.gradle
And
GParsExecutorsPool.withPool(10){service ->
Shop.list().each{shop ->
Item.list().each{item ->
service.submit({createOrder(shop, item)} as Runnable)
}
}
}
in your Service for example

Resources