I'm unit testing (with TestStream and PAssert) a DoFn that resets event timers. Test hangs forever if DoFn resets timers and this behavior seems specific to event domain timers.
Is this a bug in beam testing facilities or expected timer behavior?
Here is a toy example that I can reproduce this behavior with beam 2.3 SDK.
static class KeyElements extends DoFn<String, KV<String, String>> {
#ProcessElement
public void processElement(ProcessContext context) {
final String[] parts = context.element().split(":");
if (parts.length == 2) {
context.output(KV.of(parts[0], parts[1]));
}
}
}
static class TimerDoFn extends DoFn<KV<String, String>, KV<String, String>> {
#TimerId("expiry")
private final TimerSpec timerSpec = TimerSpecs.timer(TimeDomain.EVENT_TIME);
#ProcessElement
public void processElement(ProcessContext context, #TimerId("expiry") Timer timer) {
timer.set(context.timestamp().plus(Duration.standardHours(1)));
final KV<String, String> e = context.element();
context.output(KV.of(e.getKey(), e.getValue() + "_output"));
}
#OnTimer("expiry")
public void onExpiry(OnTimerContext context) {
// do nothing
}
}
#Rule
public TestPipeline p = TestPipeline.create();
#Test
public void testTimerDoFn() {
TestStream<String> stream = TestStream
.create(StringUtf8Coder.of())
.addElements(
TimestampedValue.of("a:0", new Instant(0)),
TimestampedValue.of("a:1", new Instant(1)),
TimestampedValue.of("a:2", new Instant(2)),
TimestampedValue.of("a:3", new Instant(3)))
.advanceWatermarkToInfinity();
PCollection<KV<String, String>> result = p
.apply(stream)
.apply(ParDo.of(new KeyElements()))
.apply(ParDo.of(new TimerDoFn()));
PAssert.that(result).containsInAnyOrder(
KV.of("a", "0_output"),
KV.of("a", "1_output"),
KV.of("a", "2_output"),
KV.of("a", "3_output"));
p.run();
}
The above test would halt if input elements are a:1, b:2, c:3, d:4.
Related
I'm working with Apache Beam, trying to enrich data (based on this), but it seems that Beam has changed in a while, as GroupByKey does not work with unbounded sources (like PubSub) without windowing.
This is what I've got (overly simplified):
PCollection<String> input = pipeline.apply("Read pubsub",
PubsubIO.readStrings().fromTopic(options.getInputTopic()))
.apply("Log element", ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
System.out.println(String.format("incomig %s", c.element()));
c.output(c.element());
}
}))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(5))));
PCollection<KV<String, String>> incomingData = input
.apply("Apply Random Key", MapElements
.via(new SimpleFunction<String, KV<String, String>>() {
public KV<String, String> apply(String json) {
JSONObject jsonObject = new JSONObject(json);
System.out.println(String.format("JSON: %s, %s", jsonObject.getString("id"), jsonObject.get("usageRules")));
return KV.of(jsonObject.getString("id"), json);
}
})
);
PCollection<KV<String,String>> enrichedData = incomingData
.apply("Search in db",
JdbcIO.<KV<String,String>, KV<String,String>>readAll()
.withDataSourceConfiguration(config)
.withQuery("SELECT * FROM myTable WHERE id = ?")
.withParameterSetter((element, preparedStatement) ->
preparedStatement.setString(1, element.getKey())
)
.withRowMapper(resultSet -> {
System.out.println(String.format("Result from db: %s", resultSet.getString("id")));
return KV.of(resultSet.getString("id"), resultSet.getString("id"));
})
.withCoder(KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of())));
GroupByKey.applicableTo(enrichedData);
TupleTag<String> CREATE_TAG = new TupleTag<>();
TupleTag<String> UPDATE_TAG = new TupleTag<>();
KeyedPCollectionTuple
.of(CREATE_TAG, incomingData)
.and(UPDATE_TAG, enrichedData)
.apply("Combine", CoGroupByKey.create())
.apply("Show data?", ParDo.of(new DoFn<KV<String, CoGbkResult>, String>() {
#ProcessElement
public void processElement(ProcessContext context) {
System.out.println("Print from CoGbkResult");
System.out.println(context.element().getKey());
System.out.println(context.element().getValue());
}
}));
At the moment, with windowing, getting incoming data, transforming it into JSONObject and searching in BD is working fine, the problem is that any .apply done after the JdbcIO.readAll is not working at all. The line "Print from CoGbkResult" just doesn't get printed at all.
I've tried modifying the window, adding other triggers, trying just to output a result just immediately, but it just stop at the RowMapper.
Thanks for your help
I ve tested sideinput into a streaming pipeline with DirectRunner and DataflowRunner with this code :
public class Testsideinput {
private static final Logger LOG = LoggerFactory.getLogger(Testsideinput.class);
static class RefreshCache extends DoFn<Long, String> {
private static final long serialVersionUID = 1;
private static final Random RANDOM = new Random();
#ProcessElement
public void processElement(ProcessContext c) {
c.output("A"+c.element());
c.output("B"+c.element());
c.output("C"+c.element());
c.output("D"+c.element());
c.output("E"+c.element());
c.output("F"+c.element());
}
}
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).create();
Pipeline pipeline = Pipeline.create(options);
final PCollectionView<List<String>> sideInput2 =
pipeline.apply("TextIO", TextIO.read().from("<Put your gs://>))
.apply("viewTags", View.asList());
final PCollectionView<List<String>> sideInput =
pipeline.apply("GenerateSequence",
GenerateSequence
.from(0)
.withRate(1, Duration.standardSeconds(1)))
.apply("Window GenerateSequence",
Window.into(FixedWindows.of(Duration.standardSeconds(5))))
.apply("Counts", Combine.globally(Sum.ofLongs()).withoutDefaults())
.apply("RefreshCache", ParDo.of(new RefreshCache()))
.apply("viewTags", View.asList());
final PubsubIO.Read<PubsubMessage> pubsubRead =
PubsubIO.readMessages()
.withIdAttribute("id")
.withTimestampAttribute("ts")
.fromTopic("<put your topic>");
// PCollection<KV<String,Long>> taxi =;
PCollection<String> taxi =
pipeline.apply("Read from", pubsubRead)
.apply("Window Fixed",
Window.into(FixedWindows.of(Duration.standardSeconds(15))))
.apply(MapElements.via(new PubSubToTableRow()))
.apply("key rides by rideid",
MapElements
.into(TypeDescriptors
.kvs(TypeDescriptors.strings(),
TypeDescriptor.of(TableRow.class)))
.via(ride -> KV.of(ride.get("ride_id").toString(), ride)))
.apply("Count Per Element", Count.perKey())
.apply(
ParDo.of(new DoFn<KV<String,Long>, String>() {
#ProcessElement
public void processElement(
#Element KV<String,Long> value,
OutputReceiver<String> out, ProcessContext c) {
// In our DoFn, access the side input.
List<String> sideinput = c.sideInput(sideInput);
List<String> sideinput2 = c.sideInput(sideInput2);
LOG.info("sideinput" + sideinput.toString());
LOG.info("sideinput2 " + sideinput2.toString());
LOG.info("value " + value);
out.output("test");
}
}).withSideInputs(sideInput,sideInput2));
pipeline.run();
}
I have all value of my sideinput (list and map) on DirectRunner but I don't have value with DataflowRunner ( I have no output with View.CreatePCollectionView/ParDo(StreamingPCollectionViewWriter) step)
do you have an idea to solve this?
I am trying to get the value of a property that is passed from a cloud function to a dataflow template. I am getting errors because the value being passed is a wrapper, and using the .get() method fails during the compile. with this error
An exception occurred while executing the Java class. null: InvocationTargetException: Not called from a runtime context.
public interface MyOptions extends DataflowPipelineOptions {
...
#Description("schema of csv file")
ValueProvider<String> getHeader();
void setHeader(ValueProvider<String> header);
...
}
public static void main(String[] args) throws IOException {
...
List<String> sideInputColumns = Arrays.asList(options.getHeader().get().split(","));
...
//ultimately use the getHeaders as side inputs
PCollection<String> input = p.apply(Create.of(sideInputColumns));
final PCollectionView<List<String>> finalColumnView = input.apply(View.asList());
}
How do I extract the value from the ValueProvider type?
The value of a ValueProvider is not available during pipeline construction. As such, you need to organize your pipeline so that it always has the same structure, and serializes the ValueProvider. At runtime, the individual transforms within your pipeline can inspect the value to determine how to operate.
Based on your example, you may need to do something like the following. It creates a single element, and then uses a DoFn that is evaluated at runtime to expand the headers:
public static class HeaderDoFn extends DoFn<String, String> {
private final ValueProvider<String> header;
public HeaderDoFn(ValueProvider<String> header) {
this.header = header;
}
#ProcessElement
public void processElement(ProcessContext c) {
// Ignore input element -- there should be exactly one
for (String column : this.header().get().split(",")) {
c.output(column);
}
}
}
public static void main(String[] args) throws IOException {
PCollection<String> input = p
.apply(Create.of("one")) // create a single element
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
}
});
// Note that the order of this list is not guaranteed.
final PCollectionView<List<String>> finalColumnView =
input.apply(View.asList());
}
Another option would be to use a NestedValueProvider to create a ValueProvider<List<String>> from the option, and pass that ValueProvider<List<String>> to the necessary DoFns rather than using a side input.
when trying to run a large transform on ~ 800.000 files, I get the above error message when trying to run the pipeline.
Here is the code:
public static void main(String[] args) {
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
GcsUtil u = getUtil(p.getOptions());
try{
List<GcsPath> paths = u.expand(GcsPath.fromUri("gs://tlogdataflow/stage/*.zip"));
List<String> strPaths = new ArrayList<String>();
for(GcsPath pa: paths){
strPaths.add(pa.toUri().toString());
}
p.apply(Create.of(strPaths))
.apply("Unzip Files", Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")));
p.run();
}
catch(IOException io){
//
}
}
I thought thats exactly what google data flow is for? Handling large amounts of files / data?
Is there a way to split the load in order to make it work?
Thanks & BR
Phil
Dataflow is good at handling large amounts of data, but has limitations in terms of how large the description of the pipeline can be. Data passed to Create.of() is currently embedded in the pipeline description, so you can't pass very large amounts of data there - instead, large amounts of data should be read from external storage, and the pipeline should specify only their locations.
Think of it as the distinction between the amount of data a program can process vs. the size of the program's code itself.
You can get around this issue by making the expansion happen in a ParDo:
p.apply(Create.of("gs://tlogdataflow/stage/*.zip"))
.apply(ParDo.of(new ExpandFn()))
.apply(...fusion break (see below)...)
.apply(Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")))
where ExpandFn is something like as follows:
private static class ExpandFn extends DoFn<String, String> {
#ProcessElement
public void process(ProcessContext c) {
GcsUtil util = getUtil(c.getPipelineOptions());
for (String path : util.expand(GcsPath.fromUri(c.element()))) {
c.output(path);
}
}
}
and by fusion break I'm referring to this (basically, ParDo(add unique key) + group by key + Flatten.iterables() + Values.create()). It's not very convenient and there are discussions happening about adding a built-in transform to do this (see this PR and this thread).
Thank you very much! Using your input I solved it like this:
public class ZipPipeline {
private static final Logger LOG = LoggerFactory.getLogger(ZipPipeline.class);
public static void main(String[] args) {
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
try{
p.apply(Create.of("gs://tlogdataflow/stage/*.zip"))
.apply(ParDo.of(new ExpandFN()))
.apply(ParDo.of(new AddKeyFN()))
.apply(GroupByKey.<String,String>create())
.apply(ParDo.of(new FlattenFN()))
.apply("Unzip Files", Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")));
p.run();
}
catch(Exception e){
LOG.error(e.getMessage());
}
}
private static class FlattenFN extends DoFn<KV<String,Iterable<String>>, String>{
private static final long serialVersionUID = 1L;
#Override
public void processElement(ProcessContext c){
KV<String,Iterable<String>> kv = c.element();
for(String s: kv.getValue()){
c.output(s);
}
}
}
private static class ExpandFN extends DoFn<String,String>{
private static final long serialVersionUID = 1L;
#Override
public void processElement(ProcessContext c) throws Exception{
GcsUtil u = getUtil(c.getPipelineOptions());
for(GcsPath path : u.expand(GcsPath.fromUri(c.element()))){
c.output(path.toUri().toString());
}
}
}
private static class AddKeyFN extends DoFn<String, KV<String,String>>{
private static final long serialVersionUID = 1L;
#Override
public void processElement(ProcessContext c){
String path = c.element();
String monthKey = path.split("_")[4].substring(0, 6);
c.output(KV.of(monthKey, path));
}
}
How can I create my own counters in my DoFns?
In my DoFn I'd like to increment a counter every time a condition is met when processing a record. I'd like this counter to sum the values across all records.
You can use Aggregators, and the total values of the counters will show up in the UI.
Here is an example where I experimented with Aggregators in a pipeline that just sleeps numOutputShards workers for sleepSecs seconds. (The GenFakeInput PTransform at the beginning just returns a flattened PCollection<String> of size numOutputShards):
PCollection<String> output = p
.apply(new GenFakeInput(options.getNumOutputShards()))
.apply(ParDo.named("Sleep").of(new DoFn<String, String>() {
private Aggregator<Long> tSleepSecs;
private Aggregator<Integer> tWorkers;
private Aggregator<Long> tExecTime;
private long startTimeMillis;
#Override
public void startBundle(Context c) {
tSleepSecs = c.createAggregator("Total Slept (sec)", new Sum.SumLongFn());
tWorkers = c.createAggregator("Num Workers", new Sum.SumIntegerFn());
tExecTime = c.createAggregator("Total Wallclock (sec)", new Sum.SumLongFn());
startTimeMillis = System.currentTimeMillis();
}
#Override
public void finishBundle(Context c) {
tExecTime.addValue((System.currentTimeMillis() - startTimeMillis)/1000);
}
#Override
public void processElement(ProcessContext c) {
try {
LOG.info("Sleeping for {} seconds.", sleepSecs);
tSleepSecs.addValue(sleepSecs);
tWorkers.addValue(1);
TimeUnit.SECONDS.sleep(sleepSecs);
} catch (InterruptedException e) {
LOG.info("Ignoring caught InterruptedException during sleep.");
}
c.output(c.element());
}}));