Apply not applicable with ParDo and DoFn using Apache Beam - google-cloud-dataflow

I am implementing a Pub/Sub to BigQuery pipeline. It looks similar to How to create read transform using ParDo and DoFn in Apache Beam, but here, I have already a PCollection created.
I am following what is described in the Apache Beam documentation to implement a ParDo operation to prepare a table row using the following pipeline:
static class convertToTableRowFn extends DoFn<PubsubMessage, TableRow> {
#ProcessElement
public void processElement(ProcessContext c) {
PubsubMessage message = c.element();
// Retrieve data from message
String rawData = message.getData();
Instant timestamp = new Instant(new Date());
// Prepare TableRow
TableRow row = new TableRow().set("message", rawData).set("ts_reception", timestamp);
c.output(row);
}
}
// Read input from Pub/Sub
pipeline.apply("Read from Pub/Sub",PubsubIO.readMessagesWithAttributes().fromTopic(topicPath))
.apply("Prepare raw data for insertion", ParDo.of(new convertToTableRowFn()))
.apply("Insert in Big Query", BigQueryIO.writeTableRows().to(BQTable));
I found the DoFn function in a gist.
I keep getting the following error:
The method apply(String, PTransform<? super PCollection<PubsubMessage>,OutputT>) in the type PCollection<PubsubMessage> is not applicable for the arguments (String, ParDo.SingleOutput<PubsubMessage,TableRow>)
I always understood that a ParDo/DoFn operations is a element-wise PTransform operation, am I wrong ? I never got this type of error in Python, so I'm a bit confused about why this is happening.

You're right, ParDos are element-wise transforms and your approach looks correct.
What you're seeing is the compilation error. Something like this happens when the argument type of the apply() method that was inferred by java compiler doesn't match the type of the actual input, e.g. convertToTableRowFn.
From the error you're seeing it looks like java infers that the second parameter for apply() is of type PTransform<? super PCollection<PubsubMessage>,OutputT>, while you're passing the subclass of ParDo.SingleOutput<PubsubMessage,TableRow> instead (your convertToTableRowFn). Looking at the definition of SingleOutput your convertToTableRowFn is basically a PTransform<PCollection<? extends PubsubMessage>, PCollection<TableRow>>. And java fails to use it in apply where it expects PTransform<? super PCollection<PubsubMessage>,OutputT>.
What looks suspicious is that java didn't infer the OutputT to PCollection<TableRow>. One reason it would fail to do so if you have other errors. Are you sure you don't have other errors as well?
For example, looking at convertToTableRowFn you're calling message.getData() which doesn't exist when I'm trying to do it and it fails compilation there. In my case I need to do something like this instead: rawData = new String(message.getPayload(), Charset.defaultCharset()). Also .to(BQTable)) expects a string (e.g. a string representing the BQ table name) as an argument, and you're passing some unknown symbol BQTable (maybe it exists in your program somewhere though and this is not a problem in your case).
After I fix these two errors your code compiles for me, apply() is fully inferred and the types are compatible.

Related

Saxon - s9api - setParameter as node and access in transformation

we are trying to add parameters to a transformation at the runtime. The only possible way to do so, is to set every single parameter and not a node. We don't know yet how to create a node for the setParameter.
Current setParameter:
QName TEST XdmAtomicValue 24
Expected setParameter:
<TempNode> <local>Value1</local> </TempNode>
We searched and tried to create a XdmNode and XdmItem.
If you want to create an XdmNode by parsing XML, the best way to do it is:
DocumentBuilder db = processor.newDocumentBuilder();
XdmNode node = db.build(new StreamSource(
new StringReader("<doc><elem/></doc>")));
You could also pass a string containing lexical XML as the parameter value, and then convert it to a tree by calling the XPath parse-xml() function.
If you want to construct the XdmNode programmatically, there are a number of options:
DocumentBuilder.newBuildingStreamWriter() gives you an instance of BuildingStreamWriter which extends XmlStreamWriter, and you can create the document by writing events to it using methods such as writeStartElement, writeCharacters, writeEndElement; at the end call getDocumentNode() on the BuildingStreamWriter, which gives you an XdmNode. This has the advantage that XmlStreamWriter is a standard API, though it's not actually a very nice one, because the documentation isn't very good and as a result implementations vary in their behaviour.
Another event-based API is Saxon's Push class; this differs from most push-based event APIs in that rather than having a flat sequence of methods like:
builder.startElement('x');
builder.characters('abc');
builder.endElement();
you have a nested sequence:
Element x = Document.elem('x');
x.text('abc');
x.close();
As mentioned by Martin, there is the "sapling" API: Saplings.doc().withChild(elem(...).withChild(elem(...)) etc. This API is rather radically different from anything you might be familiar with (though it's influenced by the LINQ API for tree construction on .NET) but once you've got used to it, it reads very well. The Sapling API constructs a very light-weight tree in memory (hance the name), and converts it to a fully-fledged XDM tree with a final call of SaplingDocument.toXdmNode().
If you're familiar with DOM, JDOM2, or XOM, you can construct a tree using any of those libraries and then convert it for use by Saxon. That's a bit convoluted and only really intended for applications that are already using a third-party tree model heavily (or for users who love these APIs and prefer them to anything else).
In the Saxon Java s9api, you can construct temporary trees as SaplingNode/SaplingElement/SaplingDocument, see https://www.saxonica.com/html/documentation12/javadoc/net/sf/saxon/sapling/SaplingDocument.html and https://www.saxonica.com/html/documentation12/javadoc/net/sf/saxon/sapling/SaplingElement.html.
To give you a simple example constructing from a Map, as you seem to want to do:
Processor processor = new Processor();
Map<String, String> xsltParameters = new HashMap<>();
xsltParameters.put("foo", "value 1");
xsltParameters.put("bar", "value 2");
SaplingElement saplingElement = new SaplingElement("Test");
for (Map.Entry<String, String> param : xsltParameters.entrySet())
{
saplingElement = saplingElement.withChild(new SaplingElement(param.getKey()).withText(param.getValue()));
}
XdmNode paramNode = saplingElement.toXdmNode(processor);
System.out.println(paramNode);
outputs e.g. <Test><bar>value 2</bar><foo>value 1</foo></Test>.
So the key is to understand that withChild() returns a new SaplingElement.
The code can be compacted using streams e.g.
XdmNode paramNode2 = Saplings.elem("root").withChild(
xsltParameters
.entrySet()
.stream()
.map(p -> Saplings.elem(p.getKey()).withText(p.getValue()))
.collect(Collectors.toList())
.toArray(SaplingElement[]::new))
.toXdmNode(processor);
System.out.println(paramNode2);

How to solve Duplicate values exception when I create PCollectionView<Map<String,String>>

I'm setting up a slow-changing lookup Map in my Apache-Beam pipeline. It continuously updates the lookup map. For each key in lookup map, I retrieve the latest value in the global window with accumulating mode.
But it always meets Exception :
org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.lang.IllegalArgumentException: Duplicate values for mykey
Is anything wrong with this snippet code?
If I use .discardingFiredPanes() instead, I will lose information in the last emit.
pipeline
.apply(GenerateSequence.from(0).withRate(1, Duration.standardMinutes(1L)))
.apply(
Window.<Long>into(new GlobalWindows())
.triggering(Repeatedly.forever(
AfterProcessingTime.pastFirstElementInPane()))
.accumulatingFiredPanes())
.apply(new ReadSlowChangingTable())
.apply(Latest.perKey())
.apply(View.asMap());
Example Input Trigger:
t1 : KV<k1,v1> KV< k2,v2>
t2 : KV<k1,v1>
accumulatingFiredPanes => expected result at t2 => KV(k1,v1), KV(k2,v2) but failed due to duplicated exception
discardingFiredPanes => expected result at t2 => KV(k1,v1) Success
Specifically with regards to view.asMap and accumulating panes discussion in the comments:
If you would like to make use of the View.asMap side input (for example, when the source of the map elements is itself distributed – often because you are creating a side input from the output of a previous transform), there are some other factors that will need to be taken into consideration: View.asMap is itself an aggregation, it will inherit triggering and accumulate its input. In this specific pattern, setting the pipeline to accumulatingPanes mode before this transform will result in duplicate key errors even if a transform such as Latest.perKey is used before the View.asMap transform.
Given the read updates the whole map, then the use of View.asSingleton would I think be a better approach for this use case.
Some general notes around this pattern, which will hopefully be useful for others as well:
For this pattern we can use the GenerateSequence source transform to emit a value periodically for example once a day. Pass this value into a global window via a data-driven trigger that activates on each element. In a DoFn, use this process as a trigger to pull data from your bounded source Create your SideInput for use in downstream transforms.
It's important to note that because this pattern uses a global-window side input triggering on processing time, matching to elements being processed in event time will be nondeterministic. For example if we have a main pipeline which is windowed on event time, the version of the SideInput View that those windows will see will depend on the latest trigger that has fired in processing time rather than the event time.
Also important to note that in general the side input should be something that fits into memory.
Java (SDK 2.9.0):
In the sample below the side input is updated at very short intervals, this is so that effects can be easily seen. The expectation is that the side input is updating slowly, for example every few hours or once a day.
In the example code below we make use of a Map that we create in a DoFn which becomes the View.asSingleton, this is the recommended approach for this pattern.
The sample below illustrates the pattern, please note the View.asSingleton is rebuilt on every counter update.
public static void main(String[] args) {
// Create pipeline
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(PipelineOptions.class);
// Using View.asSingleton, this pipeline uses a dummy external service as illustration.
// Run in debug mode to see the output
Pipeline p = Pipeline.create(options);
// Create slowly updating sideinput
PCollectionView<Map<String, String>> map = p
.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L)))
.apply(Window.<Long>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(ParDo.of(new DoFn<Long, Map<String, String>>() {
#ProcessElement public void process(#Element Long input,
OutputReceiver<Map<String, String>> o) {
// Do any external reads needed here...
// We will make use of our dummy external service.
// Every time this triggers, the complete map will be replaced with that read from
// the service.
o.output(DummyExternalService.readDummyData());
}
})).apply(View.asSingleton());
// ---- Consume slowly updating sideinput
// GenerateSequence is only used here to generate dummy data for this illustration.
// You would use your real source for example PubSubIO, KafkaIO etc...
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1L)))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(1))))
.apply(Sum.longsGlobally().withoutDefaults())
.apply(ParDo.of(new DoFn<Long, KV<Long, Long>>() {
#ProcessElement public void process(ProcessContext c) {
Map<String, String> keyMap = c.sideInput(map);
c.outputWithTimestamp(KV.of(1L, c.element()), Instant.now());
LOG.debug("Value is {} key A is {} and key B is {}"
, c.element(), keyMap.get("Key_A"),keyMap.get("Key_B"));
}
}).withSideInputs(map));
p.run();
}
public static class DummyExternalService {
public static Map<String, String> readDummyData() {
Map<String, String> map = new HashMap<>();
Instant now = Instant.now();
DateTimeFormatter dtf = DateTimeFormat.forPattern("HH:MM:SS");
map.put("Key_A", now.minus(Duration.standardSeconds(30)).toString(dtf));
map.put("Key_B", now.minus(Duration.standardSeconds(30)).toString());
return map;
}
}

How to create read transform using ParDo and DoFn in Apache Beam

According to the Apache Beam documentation the recommended way
to write simple sources is by using Read Transforms and ParDo. Unfortunately the Apache Beam docs has let me down here.
I'm trying to write a simple unbounded data source which emits events using a ParDo but the compiler keeps complaining about the input type of the DoFn object:
message: 'The method apply(PTransform<? super PBegin,OutputT>) in the type PBegin is not applicable for the arguments (ParDo.SingleOutput<PBegin,Event>)'
My attempt:
public class TestIO extends PTransform<PBegin, PCollection<Event>> {
#Override
public PCollection<Event> expand(PBegin input) {
return input.apply(ParDo.of(new ReadFn()));
}
private static class ReadFn extends DoFn<PBegin, Event> {
#ProcessElement
public void process(#TimerId("poll") Timer pollTimer) {
Event testEvent = new Event(...);
//custom logic, this can happen infinitely
for(...) {
context.output(testEvent);
}
}
}
}
A DoFn performs element-wise processing. As written, ParDo.of(new ReadFn()) will have type PTransform<PCollection<PBegin>, PCollection<Event>>. Specifically, the ReadFn indicates it takes an element of type PBegin and returns 0 or more elements of type Event.
Instead, you should use an actual Read operation. There are a variety provided. You can also use Create if you have a specific set of in-memory collections to use.
If you need to create a custom source you should use the Read transform. Since you're using timers, you likely want to create an Unbounded Source (a stream of elements).

Dataflow output parameterized type to avro file

I have a pipeline that successfully outputs an Avro file as follows:
#DefaultCoder(AvroCoder.class)
class MyOutput_T_S {
T foo;
S bar;
Boolean baz;
public MyOutput_T_S() {}
}
#DefaultCoder(AvroCoder.class)
class T {
String id;
public T() {}
}
#DefaultCoder(AvroCoder.class)
class S {
String id;
public S() {}
}
...
PCollection<MyOutput_T_S> output = input.apply(myTransform);
output.apply(AvroIO.Write.to("/out").withSchema(MyOutput_T_S.class));
How can I reproduce this exact behavior except with a parameterized output MyOutput<T, S> (where T and S are both Avro code-able using reflection).
The main issue is that Avro reflection doesn't work for parameterized types. So based on these responses:
Setting Custom Coders & Handling Parameterized types
Using Avrocoder for Custom Types with Generics
1) I think I need to write a custom CoderFactory but, I am having difficulty figuring out exactly how this works (I'm having trouble finding examples). Oddly enough, a completely naive coder factory appears to let me run the pipeline and inspect proper output using DataflowAssert:
cr.RegisterCoder(MyOutput.class, new CoderFactory() {
#Override
public Coder<?> create(List<? excents Coder<?>> componentCoders) {
Schema schema = new Schema.Parser().parse("{\"type\":\"record\,"
+ "\"name\":\"MyOutput\","
+ "\"namespace\":\"mypackage"\","
+ "\"fields\":[]}"
return AvroCoder.of(MyOutput.class, schema);
}
#Override
public List<Object> getInstanceComponents(Object value) {
MyOutput<Object, Object> myOutput = (MyOutput<Object, Object>) value;
List components = new ArrayList();
return components;
}
While I can successfully assert against the output now, I expect this will not cut it for writing to a file. I haven't figured out how I'm supposed to use the provided componentCoders to generate the correct schema and if I try to just shove the schema of T or S into fields I get:
java.lang.IllegalArgumentException: Unable to get field id from class null
2) Assuming I figure out how to encode MyOutput. What do I pass to AvroIO.Write.withSchema? If I pass either MyOutput.class or the schema I get type mismatch errors.
I think there are two questions (correct me if I am wrong):
How do I enable the coder registry to provide coders for various parameterizations of MyOutput<T, S>?
How do I values of MyOutput<T, S> to a file using AvroIO.Write.
The first question is to be solved by registering a CoderFactory as in the linked question you found.
Your naive coder is probably allowing you to run the pipeline without issues because serialization is being optimized away. Certainly an Avro schema with no fields will result in those fields being dropped in a serialization+deserialization round trip.
But assuming you fill in the schema with the fields, your approach to CoderFactory#create looks right. I don't know the exact cause of the message java.lang.IllegalArgumentException: Unable to get field id from class null, but the call to AvroCoder.of(MyOutput.class, schema) should work, for an appropriately assembled schema. If there is an issue with this, more details (such as the rest of the stack track) would be helpful.
However, your override of CoderFactory#getInstanceComponents should return a list of values, one per type parameter of MyOutput. Like so:
#Override
public List<Object> getInstanceComponents(Object value) {
MyOutput<Object, Object> myOutput = (MyOutput<Object, Object>) value;
return ImmutableList.of(myOutput.foo, myOutput.bar);
}
The second question can be answered using some of the same support code as the first, but otherwise is independent. AvroIO.Write.withSchema always explicitly uses the provided schema. It does use AvroCoder under the hood, but this is actually an implementation detail. Providing a compatible schema is all that is necessary - such a schema will have to be composed for each value of T and S for which you want to output MyOutput<T, S>.

Dynamic table name when writing to BQ from dataflow pipelines

As a followup question to the following question and answer:
https://stackoverflow.com/questions/31156774/about-key-grouping-with-groupbykey
I'd like to confirm with google dataflow engineering team (#jkff) if the 3rd option proposed by Eugene is at all possible with google dataflow:
"have a ParDo that takes these keys and creates the BigQuery tables, and another ParDo that takes the data and streams writes to the tables"
My understanding is that ParDo/DoFn will process each element, how could we specify a table name (function of the keys passed in from side inputs) when writing out from processElement of a ParDo/DoFn?
Thanks.
Updated with a DoFn, which is not working obviously since c.element().value is not a pcollection.
PCollection<KV<String, Iterable<String>>> output = ...;
public class DynamicOutput2Fn extends DoFn<KV<String, Iterable<String>>, Integer> {
private final PCollectionView<List<String>> keysAsSideinputs;
public DynamicOutput2Fn(PCollectionView<List<String>> keysAsSideinputs) {
this.keysAsSideinputs = keysAsSideinputs;
}
#Override
public void processElement(ProcessContext c) {
List<String> keys = c.sideInput(keysAsSideinputs);
String key = c.element().getKey();
//the below is not working!!! How could we write the value out to a sink, be it gcs file or bq table???
c.element().getValue().apply(Pardo.of(new FormatLineFn()))
.apply(TextIO.Write.to(key));
c.output(1);
}
}
The BigQueryIO.Write transform does not support this. The closest thing you can do is to use per-window tables, and encode whatever information you need to select the table in the window objects by using a custom WindowFn.
If you don't want to do that, you can make BigQuery API calls directly from your DoFn. With this, you can set the table name to anything you want, as computed by your code. This could be looked up from a side input, or computed directly from the element the DoFn is currently processing. To avoid making too many small calls to BigQuery, you can batch up the requests using finishBundle();
You can see how the Dataflow runner does the streaming import here:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/util/BigQueryTableInserter.java

Resources