Any way to use Pipeline and DataflowPipelineOptions instances inside ParDo? - google-cloud-dataflow

As it is apparent that Pipeline and DataflowPipelineOptions instances can be used inside ParDo when written in static block only.Since, processElement() is not static. Therefore, ParDo can't accommodate any logic inside this function which deals with objects of Pipeline and DataflowPipelineOptions.
Sample :
public void processElement(ProcessContext c)
{
PCollection<String> dummy=p.apply("dummy"); // It will fail with a SerializableException (What I want to do)
}
So, just wanted to check whether the same can be made possible by calling a static function from ParDo itself.
public void processElement(ProcessContext c)
{
logicCaller(); // It's a static function
}
The logicCaller() static method can contain anything from creation of a simple PCollection to any ParDo block execution having some logical steps inside.
I tried this way around and the function got called successfully as well. The weird behaviour was that everything got executed inside that static function but the ParDo part got skipped without any error or exception.
Any insights?

Related

Side inputs vs normal constructor parameters in Apache Beam

I have a general question on side inputs and broadcasting in the context of Apache Beam. Does any additional variables, lists, maps that are need for computation during processElement, need to be passed as side input? Is it ok if they are passed as normal constructor arguments for the DoFn ? For example, what if I have some fixed (not computed) values variables (constants, like start date, end date) that I want to make use of during the per element computation of processElement. Now, I can make singleton PCollectionViews out of each of those variables separately and pass them to the DoFn constructor as side input. However, instead of doing that, can I not just pass each of those constants as normal constructor arguments to the DoFn? Am I missing anything subtle here?
In terms of code, when should I do:
public static class MyFilter extends DoFn<KV<String, Iterable<MyData>> {
// these are singleton views
private final PCollectionView<LocalDateTime> dateStartView;
private final PCollectionView<LocalDateTime> dateEndView;
public MyFilter(PCollectionView<LocalDateTime> dateStartView,
PCollectionView<LocalDateTime> dateEndView){
this.dateStartView = dateStartView;
this.dateEndView = dateEndView;
}
#ProcessElement
public void processElement(ProcessContext c) throws Exception{
// extract date values from the singleton views here and use them
As opposed to :
public static class MyFilter extends DoFn<KV<String, Iterable<MyData>> {
private final LocalDateTime dateStart;
private final LocalDateTime dateEnd;
public MyFilter(LocalDateTime dateStart,
LocalDateTime dateEnd){
this.dateStart = dateStart;
this.dateEnd = dateEnd;
}
#ProcessElement
public void processElement(ProcessContext c) throws Exception{
// use the passed in date values directly here
Notice that in these examples, startDate and endDate are fixed values and not the dynamic results of any previous computation of the pipeline.
When you call something like pipeline.apply(ParDo.of(new MyFilter(...)) the DoFn gets instantiated in the main program that you use to start the pipeline. It then gets serialized and passed to the runner for execution. Runner then decides where to execute it, e.g. on a fleet of a 100 VMs each of which will receive its own copy of the code and serialized data. If the member variables are serializable and you don't mutate them at execution time, it should be fine (link, link), the DoFn will get deserialized on each node with all the fields populated, and will get executed as expected. However you don't control the number of instances or basically their lifecycle (to some extent), so mutate them at your own risk.
The benefit of PCollections and side inputs is that you are not limited to static values, so for couple of simple unmutable values you should be fine .

Apply not applicable with ParDo and DoFn using Apache Beam

I am implementing a Pub/Sub to BigQuery pipeline. It looks similar to How to create read transform using ParDo and DoFn in Apache Beam, but here, I have already a PCollection created.
I am following what is described in the Apache Beam documentation to implement a ParDo operation to prepare a table row using the following pipeline:
static class convertToTableRowFn extends DoFn<PubsubMessage, TableRow> {
#ProcessElement
public void processElement(ProcessContext c) {
PubsubMessage message = c.element();
// Retrieve data from message
String rawData = message.getData();
Instant timestamp = new Instant(new Date());
// Prepare TableRow
TableRow row = new TableRow().set("message", rawData).set("ts_reception", timestamp);
c.output(row);
}
}
// Read input from Pub/Sub
pipeline.apply("Read from Pub/Sub",PubsubIO.readMessagesWithAttributes().fromTopic(topicPath))
.apply("Prepare raw data for insertion", ParDo.of(new convertToTableRowFn()))
.apply("Insert in Big Query", BigQueryIO.writeTableRows().to(BQTable));
I found the DoFn function in a gist.
I keep getting the following error:
The method apply(String, PTransform<? super PCollection<PubsubMessage>,OutputT>) in the type PCollection<PubsubMessage> is not applicable for the arguments (String, ParDo.SingleOutput<PubsubMessage,TableRow>)
I always understood that a ParDo/DoFn operations is a element-wise PTransform operation, am I wrong ? I never got this type of error in Python, so I'm a bit confused about why this is happening.
You're right, ParDos are element-wise transforms and your approach looks correct.
What you're seeing is the compilation error. Something like this happens when the argument type of the apply() method that was inferred by java compiler doesn't match the type of the actual input, e.g. convertToTableRowFn.
From the error you're seeing it looks like java infers that the second parameter for apply() is of type PTransform<? super PCollection<PubsubMessage>,OutputT>, while you're passing the subclass of ParDo.SingleOutput<PubsubMessage,TableRow> instead (your convertToTableRowFn). Looking at the definition of SingleOutput your convertToTableRowFn is basically a PTransform<PCollection<? extends PubsubMessage>, PCollection<TableRow>>. And java fails to use it in apply where it expects PTransform<? super PCollection<PubsubMessage>,OutputT>.
What looks suspicious is that java didn't infer the OutputT to PCollection<TableRow>. One reason it would fail to do so if you have other errors. Are you sure you don't have other errors as well?
For example, looking at convertToTableRowFn you're calling message.getData() which doesn't exist when I'm trying to do it and it fails compilation there. In my case I need to do something like this instead: rawData = new String(message.getPayload(), Charset.defaultCharset()). Also .to(BQTable)) expects a string (e.g. a string representing the BQ table name) as an argument, and you're passing some unknown symbol BQTable (maybe it exists in your program somewhere though and this is not a problem in your case).
After I fix these two errors your code compiles for me, apply() is fully inferred and the types are compatible.

How to create read transform using ParDo and DoFn in Apache Beam

According to the Apache Beam documentation the recommended way
to write simple sources is by using Read Transforms and ParDo. Unfortunately the Apache Beam docs has let me down here.
I'm trying to write a simple unbounded data source which emits events using a ParDo but the compiler keeps complaining about the input type of the DoFn object:
message: 'The method apply(PTransform<? super PBegin,OutputT>) in the type PBegin is not applicable for the arguments (ParDo.SingleOutput<PBegin,Event>)'
My attempt:
public class TestIO extends PTransform<PBegin, PCollection<Event>> {
#Override
public PCollection<Event> expand(PBegin input) {
return input.apply(ParDo.of(new ReadFn()));
}
private static class ReadFn extends DoFn<PBegin, Event> {
#ProcessElement
public void process(#TimerId("poll") Timer pollTimer) {
Event testEvent = new Event(...);
//custom logic, this can happen infinitely
for(...) {
context.output(testEvent);
}
}
}
}
A DoFn performs element-wise processing. As written, ParDo.of(new ReadFn()) will have type PTransform<PCollection<PBegin>, PCollection<Event>>. Specifically, the ReadFn indicates it takes an element of type PBegin and returns 0 or more elements of type Event.
Instead, you should use an actual Read operation. There are a variety provided. You can also use Create if you have a specific set of in-memory collections to use.
If you need to create a custom source you should use the Read transform. Since you're using timers, you likely want to create an Unbounded Source (a stream of elements).

Access side input in a non - anonymous DoFn

How to access the elements of a side input if I have my class extend DoFn?
For example:
Say I have a ParDo transform like:
PCollection<String> data = myData.apply("Get data",
ParDo.of(new MyClass()).withSideInputs(myDataView));
And I have a class:-
static class MyClass extends DoFn<String,String>
{
//How to access side input here
}
c.sideInput() isn't working in this case.
Thanks.
In this case, the problem is that the processElement method in your DoFn does not have access to the PCollectionView instance in your main method.
You can pass the PCollectionView to the DoFn in the constructor:
class MyClass extends DoFn<String,String>
{
private final PCollectionView<..> mySideInput;
public MyClass(PCollectionView<..> mySideInput) {
// List, or Map or anything:
this.mySideInput = mySideInput;
}
#ProcessElement
public void processElement(ProcessContext c) throws IOException
{
// List or Map or any type you need:
List<..> sideInputList = c.sideInput(mySideInput);
}
}
You would then pass the side input to the class when you instantiate it, and indicate it as a side input like so:
p.apply(ParDo.of(new MyClass(mySideInput)).withSideInputs(mySideInput));
The explanation for this is that when you use an anonymous DoFn, the process method has a closure with access to all the objects within the scope that encloses the DoFn (among them is the PCollectionView). When you're not using an anonymous DoFn, there is no closure, and you need another way of passing the PCollectionView.
So although the answer above is correct, it is still a little incomplete.
So once you finish implementing the above answer, you need to execute your pipeline like this:
p.apply(ParDo.of(new MyClass(mySideInput)).withSideInputs(mySideInput));

JSR 352 :How to collect data from the Writer of each Partition of a Partitioned Step?

So, I have 2 partitions in a step which writes into a database. I want to record the number of rows written in each partition, get the sum, and print it to the log;
I was thinking of using a static variable in the Writer and use Step Context/Job Context to get it in afterStep() of the Step Listener. However when I tried it I got null. I am able to get these values in close() of the Reader.
Is this the right way to go about it? Or should I use Partition Collector/Reducer/ Analyzer?
I am using a java batch in Websphere Liberty. And I am developing in Eclipse.
I was thinking of using a static variable in the Writer and use Step Context/Job Context to get it in afterStep() of the Step Listener. However when i tried it i got null.
The ItemWriter might already be destroyed at this point, but I'm not sure.
Is this the right way to go about it?
Yes, it should be good enough. However, you need to ensure the total row count is shared for all partitions because the batch runtime maintains a StepContext clone per partition. You should rather use JobContext.
I think using PartitionCollector and PartitionAnalyzer is a good choice, too. Interface PartitionCollector has a method collectPartitionData() to collect data coming from its partition. Once collected, batch runtime passes this data to PartitionAnalyzer to analyze the data. Notice that there're
N PartitionCollector per step (1 per partition)
N StepContext per step (1 per partition)
1 PartitionAnalyzer per step
The records written can be passed via StepContext's transientUserData. Since the StepContext is reserved for its own step-partition, the transient user data won't be overwritten by other partition.
Here's the implementation :
MyItemWriter :
#Inject
private StepContext stepContext;
#Override
public void writeItems(List<Object> items) throws Exception {
// ...
Object userData = stepContext.getTransientUserData();
stepContext.setTransientUserData(partRowCount);
}
MyPartitionCollector
#Inject
private StepContext stepContext;
#Override
public Serializable collectPartitionData() throws Exception {
// get transient user data
Object userData = stepContext.getTransientUserData();
int partRowCount = userData != null ? (int) userData : 0;
return partRowCount;
}
MyPartitionAnalyzer
private int rowCount = 0;
#Override
public void analyzeCollectorData(Serializable fromCollector) throws Exception {
rowCount += (int) fromCollector;
System.out.printf("%d rows processed (all partitions).%n", rowCount);
}
Reference : JSR352 v1.0 Final Release.pdf
Let me offer a bit of an alternative on the accepted answer and add some comments.
PartitionAnalyzer variant - Use analyzeStatus() method
Another technique would be to use analyzeStatus which only gets called at the end of each entire partition, and is passed the partition-level exit status.
public void analyzeStatus(BatchStatus batchStatus, String exitStatus)
In contrast, the above answer using analyzeCollectorData gets called at the end of each chunk on each partition.
E.g.
public class MyItemWriteListener extends AbstractItemWriteListener {
#Inject
StepContext stepCtx;
#Override
public void afterWrite(List<Object> items) throws Exception {
// update 'newCount' based on items.size()
stepCtx.setExitStatus(Integer.toString(newCount));
}
Obviously this only works if you weren't using the exit status for some other purpose. You can set the exit status from any artifact (though this freedom might be one more thing to have to keep track of).
Comments
The API is designed to facilitate an implementation dispatching individual partitions across JVMs, (e.g. in Liberty you can see this here.) But using a static ties you to a single JVM, so it's not a recommended approach.
Also note that both the JobContext and the StepContext are implemented in the "thread-local"-like fashion we see in batch.

Resources