SCDF processor reads one message and outputs arrays of objects, but sink can handle single item

SCDF processor reads one message and outputs arrays of objects, but sink can handle single item - spring-cloud-dataflow

My Processor process one payload and produce a List
#StreamListener(Processor.INPUT)
#SendTo(Processor.OUTPUT)
public List<XYZObject> getAll(
XYZInput inp) {
List<XYZObject> xyzs = dbService.findAllByDataType(inp.getDataType());
return xyzs;
}
The stream has RabbitMQ middleware, and my sink looks like below:
#StreamListener(Sink.INPUT)
public void writeToX(XYZInput input) {
....
}
I took a look into a similar discussion Similar Problem with Kafka Binder. How to achieve this with Rabbit binder?
Is it achieveable with RabbitMQ as binder?

This is a Spring Cloud Stream question and is controlled by the spring.cloud.stream.bindings.<binding-name>.consumer.batch-mode property.
Please see the reference guide section for Batch consumers/producers to learn more.

Related

How to create read transform using ParDo and DoFn in Apache Beam

According to the Apache Beam documentation the recommended way
to write simple sources is by using Read Transforms and ParDo. Unfortunately the Apache Beam docs has let me down here.
I'm trying to write a simple unbounded data source which emits events using a ParDo but the compiler keeps complaining about the input type of the DoFn object:
message: 'The method apply(PTransform<? super PBegin,OutputT>) in the type PBegin is not applicable for the arguments (ParDo.SingleOutput<PBegin,Event>)'
My attempt:
public class TestIO extends PTransform<PBegin, PCollection<Event>> {
#Override
public PCollection<Event> expand(PBegin input) {
return input.apply(ParDo.of(new ReadFn()));
}
private static class ReadFn extends DoFn<PBegin, Event> {
#ProcessElement
public void process(#TimerId("poll") Timer pollTimer) {
Event testEvent = new Event(...);
//custom logic, this can happen infinitely
for(...) {
context.output(testEvent);
}
}
}
}

A DoFn performs element-wise processing. As written, ParDo.of(new ReadFn()) will have type PTransform<PCollection<PBegin>, PCollection<Event>>. Specifically, the ReadFn indicates it takes an element of type PBegin and returns 0 or more elements of type Event.
Instead, you should use an actual Read operation. There are a variety provided. You can also use Create if you have a specific set of in-memory collections to use.
If you need to create a custom source you should use the Read transform. Since you're using timers, you likely want to create an Unbounded Source (a stream of elements).

JSR 352 :How to collect data from the Writer of each Partition of a Partitioned Step?

So, I have 2 partitions in a step which writes into a database. I want to record the number of rows written in each partition, get the sum, and print it to the log;
I was thinking of using a static variable in the Writer and use Step Context/Job Context to get it in afterStep() of the Step Listener. However when I tried it I got null. I am able to get these values in close() of the Reader.
Is this the right way to go about it? Or should I use Partition Collector/Reducer/ Analyzer?
I am using a java batch in Websphere Liberty. And I am developing in Eclipse.

I was thinking of using a static variable in the Writer and use Step Context/Job Context to get it in afterStep() of the Step Listener. However when i tried it i got null.
The ItemWriter might already be destroyed at this point, but I'm not sure.
Is this the right way to go about it?
Yes, it should be good enough. However, you need to ensure the total row count is shared for all partitions because the batch runtime maintains a StepContext clone per partition. You should rather use JobContext.
I think using PartitionCollector and PartitionAnalyzer is a good choice, too. Interface PartitionCollector has a method collectPartitionData() to collect data coming from its partition. Once collected, batch runtime passes this data to PartitionAnalyzer to analyze the data. Notice that there're
N PartitionCollector per step (1 per partition)
N StepContext per step (1 per partition)
1 PartitionAnalyzer per step
The records written can be passed via StepContext's transientUserData. Since the StepContext is reserved for its own step-partition, the transient user data won't be overwritten by other partition.
Here's the implementation :
MyItemWriter :
#Inject
private StepContext stepContext;
#Override
public void writeItems(List<Object> items) throws Exception {
// ...
Object userData = stepContext.getTransientUserData();
stepContext.setTransientUserData(partRowCount);
}
MyPartitionCollector
#Inject
private StepContext stepContext;
#Override
public Serializable collectPartitionData() throws Exception {
// get transient user data
Object userData = stepContext.getTransientUserData();
int partRowCount = userData != null ? (int) userData : 0;
return partRowCount;
}
MyPartitionAnalyzer
private int rowCount = 0;
#Override
public void analyzeCollectorData(Serializable fromCollector) throws Exception {
rowCount += (int) fromCollector;
System.out.printf("%d rows processed (all partitions).%n", rowCount);
}
Reference : JSR352 v1.0 Final Release.pdf

Let me offer a bit of an alternative on the accepted answer and add some comments.
PartitionAnalyzer variant - Use analyzeStatus() method
Another technique would be to use analyzeStatus which only gets called at the end of each entire partition, and is passed the partition-level exit status.
public void analyzeStatus(BatchStatus batchStatus, String exitStatus)
In contrast, the above answer using analyzeCollectorData gets called at the end of each chunk on each partition.
E.g.
public class MyItemWriteListener extends AbstractItemWriteListener {
#Inject
StepContext stepCtx;
#Override
public void afterWrite(List<Object> items) throws Exception {
// update 'newCount' based on items.size()
stepCtx.setExitStatus(Integer.toString(newCount));
}
Obviously this only works if you weren't using the exit status for some other purpose. You can set the exit status from any artifact (though this freedom might be one more thing to have to keep track of).
Comments
The API is designed to facilitate an implementation dispatching individual partitions across JVMs, (e.g. in Liberty you can see this here.) But using a static ties you to a single JVM, so it's not a recommended approach.
Also note that both the JobContext and the StepContext are implemented in the "thread-local"-like fashion we see in batch.

How to send messages from Google Dataflow (Apache Beam) on the Flink runner to Kafka

I’m trying to write a proof-of-concept which takes messages from Kafka, transforms them using Beam on Flink, then pushes the results onto a different Kafka topic.
I’ve used the KafkaWindowedWordCountExample as a starting point, and that’s doing the first part of what I want to do, but it outputs to text files as opposed to Kafka. FlinkKafkaProducer08 looks promising, but I can’t figure out how to plug it into the pipeline. I was thinking that it would be wrapped with an UnboundedFlinkSink, or some such, but that doesn’t seem to exist.
Any advice or thoughts on what I’m trying to do?
I’m running the latest incubator-beam (as of last night from Github), Flink 1.0.0 in cluster mode and Kafka 0.9.0.1, all on Google Compute Engine (Debian Jessie).

There is currently no UnboundedSink class in Beam. Most unbounded sinks are implemented using a ParDo.
You may wish to check out the KafkaIO connector. This is a Kafka reader that works in all Beam runners, and implements the parallel reading, checkpointing, and other UnboundedSource APIs. That pull request also includes a crude sink in the TopHashtags example pipeline by writing to Kafka in a ParDo:
class KafkaWriter extends DoFn<String, Void> {
private final String topic;
private final Map<String, Object> config;
private transient KafkaProducer<String, String> producer = null;
public KafkaWriter(Options options) {
this.topic = options.getOutputTopic();
this.config = ImmutableMap.<String, Object>of(
"bootstrap.servers", options.getBootstrapServers(),
"key.serializer", StringSerializer.class.getName(),
"value.serializer", StringSerializer.class.getName());
}
#Override
public void startBundle(Context c) throws Exception {
if (producer == null) { // in Beam, startBundle might be called multiple times.
producer = new KafkaProducer<String, String>(config);
}
}
#Override
public void finishBundle(Context c) throws Exception {
producer.close();
}
#Override
public void processElement(ProcessContext ctx) throws Exception {
producer.send(new ProducerRecord<String, String>(topic, ctx.element()));
}
}
Of course, we would like to add sink support in KafkaIO as well. It would effectively be same as KafkaWriter above, but much simpler to use.

Sink transform for writing to Kafka was added to Apache Beam / Dataflow in 2016. See JavaDoc for KafkaIO in Apache Beam for usage example.

Dynamic table name when writing to BQ from dataflow pipelines

As a followup question to the following question and answer:
https://stackoverflow.com/questions/31156774/about-key-grouping-with-groupbykey
I'd like to confirm with google dataflow engineering team (#jkff) if the 3rd option proposed by Eugene is at all possible with google dataflow:
"have a ParDo that takes these keys and creates the BigQuery tables, and another ParDo that takes the data and streams writes to the tables"
My understanding is that ParDo/DoFn will process each element, how could we specify a table name (function of the keys passed in from side inputs) when writing out from processElement of a ParDo/DoFn?
Thanks.
Updated with a DoFn, which is not working obviously since c.element().value is not a pcollection.
PCollection<KV<String, Iterable<String>>> output = ...;
public class DynamicOutput2Fn extends DoFn<KV<String, Iterable<String>>, Integer> {
private final PCollectionView<List<String>> keysAsSideinputs;
public DynamicOutput2Fn(PCollectionView<List<String>> keysAsSideinputs) {
this.keysAsSideinputs = keysAsSideinputs;
}
#Override
public void processElement(ProcessContext c) {
List<String> keys = c.sideInput(keysAsSideinputs);
String key = c.element().getKey();
//the below is not working!!! How could we write the value out to a sink, be it gcs file or bq table???
c.element().getValue().apply(Pardo.of(new FormatLineFn()))
.apply(TextIO.Write.to(key));
c.output(1);
}
}

The BigQueryIO.Write transform does not support this. The closest thing you can do is to use per-window tables, and encode whatever information you need to select the table in the window objects by using a custom WindowFn.
If you don't want to do that, you can make BigQuery API calls directly from your DoFn. With this, you can set the table name to anything you want, as computed by your code. This could be looked up from a side input, or computed directly from the element the DoFn is currently processing. To avoid making too many small calls to BigQuery, you can batch up the requests using finishBundle();
You can see how the Dataflow runner does the streaming import here:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/BigQueryTableInserter.java

Using Guice's #SessionScoped with Netty

How do I implement #SessionScoped in a Netty based TCP server? Creating Custom Scopes is documented in Guice manual, but it seems that the solution only works for thread based and not asynchronous IO servers.
Is it enough to create the Channel Pipeline between scope.enter() and scope.exit()?

Disclaimer : this answer is for Netty 3. I've not had the opportunity to try Netty 4 yet, so I don't know if what follows can be applied to the newer version.
Netty is asynchronous on the network side, but unless you explicity submit tasks to Executors, or change threads by any other means, the handling of ChannelEvents by the ChannelHandlers on a pipeline is synchronous and sequential. For instance, if you use Netty 3 and have an ExecutionHandler on the pipeline, the scope handler should be upstream of the ExecutionHandler; for Netty 4, see Trustin Lee's comment.
Thus, you can put a handler near the beginning of your pipeline that manages the session scope, for example:
public class ScopeHandler implements ChannelUpstreamHandler {
#Override
public void handleUpstream(ChannelHandlerContext ctx, ChannelEvent e) {
if (e instanceof WriteCompletionEvent || e instanceof ExceptionEvent)
ctx.sendUpstream(e);
Session session = ...; // get session, presumably using e.getChannel()
scope.enter();
try {
scope.seed(Key.get(Session.class), session);
ctx.sendUpstream(e);
}
finally {
scope.exit();
}
}
private SessionScope scope;
}
A couple of quick remarks:
You will want to filter some event types out, especially WriteCompletionEvent and ExceptionEvent which the framework will put at the downstream end of the pipeline during event processing and wil cause reentrancy issues if not excluded. In our application, we use this kind of handler but actually only consider UpstreamMessageEvents.
The try/finally construct is not actually necessary as Netty will catch any Throwables and fire an ExceptionEvent, but it feels more idiomatic this way.
HTH

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

SCDF processor reads one message and outputs arrays of objects, but sink can handle single item - spring-cloud-dataflow

This is a Spring Cloud Stream question and is controlled by the spring.cloud.stream.bindings.<binding-name>.consumer.batch-mode property. Please see the reference guide section for Batch consumers/producers to learn more.

Related

How to create read transform using ParDo and DoFn in Apache Beam

JSR 352 :How to collect data from the Writer of each Partition of a Partitioned Step?

How to send messages from Google Dataflow (Apache Beam) on the Flink runner to Kafka

Dynamic table name when writing to BQ from dataflow pipelines

Using Guice's #SessionScoped with Netty

Categories

Resources