Trying to join two Pcollection using SideInput transform. In the ParDo function while mapping the value, from the sideinput collection we may get the multiple mapping records as a collection. In such a case how to handle the collection and how to return those collection of values to the PCollection.
It would be good if some one help to solve this case. Here is the code snippet that I tried.
PCollection<TableRow> pc1 = ...;
PCollection<Row> pc1Rows = pc1.apply(
ParDo.of(new fnConvertTableRowToRow())).setRowSchema(schemaPc1);
PCollection<KV<Integer, Row>> keyed_pc1Rows = pc1Rows.apply(
WithKeys.of(new SerializableFunction<Row, Integer>() {
public Integer apply(Row s) {
return Integer.parseInt(s.getValue("LOCATION_ID").toString());
}
}));
PCollection<TableRow> pc2 = ...;
PCollection<Row> pc2Rows = pc2.apply(
ParDo.of(new fnConvertTableRowToRow())).setRowSchema(schemaPc2);
PCollection<KV<Integer, Iterable<Row>>> keywordGroups = pc2Rows.apply(
new fnGroupKeyWords());
PCollectionView<Map<Integer, Iterable<Row>>> sideInputView =
keywordGroups.apply("Side Input",
View.<Integer, Iterable<Row>>asMap());
PCollection<Row> finalResultCollection = keyed_pc1Rows.apply("Process",
ParDo.of(new DoFn<KV<Integer,Row>, Row>() {
#ProcessElement
public void processElement(ProcessContext c) {
Integer key = Integer.parseInt(c.element().getKey().toString());
Row leftRow = c.element().getValue();
Map<Integer, Iterable<Row>> key2Rows = c.sideInput(sideInputView);
Iterable<Row> rightRowsIterable = key2Rows.get(key);
for (Iterator<Row> i = rightRowsIterable.iterator(); i.hasNext(); ) {
Row suit = (Row) i.next();
Row targetRow = Row.withSchema(schemaOutput)
.addValues(leftRow.getValues())
.addValues(suit.getValues())
.build();
c.output(targetRow);
}
}
}).withSideInputs(sideInputView));
public static class fnGroupKeyWords extends
PTransform<PCollection<Row>, PCollection<KV<Integer, Iterable<Row>>>> {
#Override
public PCollection<KV<Integer, Iterable<Row>>> expand(
PCollection<Row> rows) {
PCollection<KV<Integer, Row>> kvs = rows.apply(
ParDo.of(new TransferKeyValueFn()));
PCollection<KV<Integer, Iterable<Row>>> group = kvs.apply(
GroupByKey.<Integer, Row> create());
return group;
}
}
public static class TransferKeyValueFn extends
DoFn<Row, KV<Integer, Row>> {
#ProcessElement
public void processElement(ProcessContext c) throws ParseException {
Row tRow = c.element();
c.output(
KV.of(
Integer.parseInt(tRow.getValue("DW_LOCATION_ID").toString()),
tRow));
}
}
If you wish to join two PCollections together using a common key. the CoGroupByKey might make more sense. Please consider this approach instead of side inputs
Also this blog post has a great explanation as well.
I think using the SideInput suggestion would perform well if you have a very small collection which could fit into memory. You could use it as a side input with view.asMultimap. Then in a ParDo processing the larger PCollection (After a GBK, to give you an iterable over all elements for the key), lookup the key you are interested in from the side input. Here is an example test pipeline using a multimap pcollection.
However, if your collection is quite large then using Flatten to combine both pcollections together would be a better approach. Then using a GroupByKey afterward, which will give you an iterable for element under the same key. This will still be processed sequentially. Though, I believe you will will have issues with performance, unless you eliminate the hot key. Please see the explanation of using combiners to alleviate this.
In Dropwizard there is something like meter:
https://metrics.dropwizard.io/3.1.0/getting-started/#meters
It lets me measure rate of events just by invoking mark() method on the metric.
How can I do that in Micrometer?
I can use timers, but I don't want to pass Timer.Sample object to wherever place where I need to call stop() method.
The other missing thing in Micrometer comparing to Dropwizard is a metric that can contain a text message, like gauge in Dropwizard.
Micrometer leverages the strengths of modern metrics backends. So the specific answer to your question depends on which you are using. Take Prometheus for example. The backend can calculate the rate for you.
If you are measuring the rate of how often something is happening you can determine that using a Counter. Take the logback_events_total counter as an example. It is merely counting the number for log messages written.
When alerting or graphing you can then write a query like rate(logback_events_total[1m]) and you will be able to see the rate at which logs have been writen at the 1m rate. You have the ability to change to window from 1m, to 5m or 1h without changing the code.
Regarding text based metrics, those aren't useful for alerting (but can be useful when using a join clause). The typical solution in that case is to create a gauge with a value of 1 or 0 and make your text value a tag. For example:
registry.gaugle('app.info', Tags.of("version","1.0.beta3", this, () -> 1.0));
We had the same problem. In DropWizard we were able to use meters to get the rate of events per minute, but in Micrometer we could not find a built-in way that worked for us.
We needed rates for counters and percentiles for timers. The PrometheusMeterRegistry gave us percentiles, but no rates.
So we built our own Gauge that tracks a Counter. Every time getValue() is called, it fetches the value from the counter and adds it to the right bucket with the current timestamp. Then from all available measurements it can compute the rate over the last minute.
It looks like this:
import io.micrometer.core.instrument.Clock;
import io.micrometer.core.instrument.Gauge;
import io.micrometer.core.instrument.MeterRegistry;
import java.util.LinkedList;
import java.util.function.Supplier;
public class OneMinuteRateGauge {
private static final int WINDOW_SECONDS = 60;
private final Supplier<Double> valueSupplier;
private final LinkedList<Bucket> buckets;
private final Clock clock;
public OneMinuteRateGauge(String name, Supplier<Double> valueSupplier, MeterRegistry meterRegistry) {
this(name, valueSupplier, meterRegistry, Clock.SYSTEM);
}
public OneMinuteRateGauge(String name, Supplier<Double> valueSupplier, MeterRegistry meterRegistry, Clock clock) {
this.valueSupplier = valueSupplier;
this.buckets = new LinkedList<>();
Gauge.builder(name, this::getValue).register(meterRegistry);
this.clock = clock;
// Collect one measurement so we have a faster start
getValue();
}
public synchronized double getValue() {
// Update the last bucket or create a new one
long now_millis = clock.monotonicTime() / 1_000_000;
long now_seconds = now_millis / 1_000;
short millis = (short) (now_millis - (now_seconds * 1000));
double value = valueSupplier.get();
if (buckets.size() != 0 && buckets.getLast().getSeconds() == now_seconds) {
buckets.getLast().updateValue(millis, value);
} else {
buckets.addLast(new Bucket(now_seconds, millis, value));
}
// Delete all buckets outside the window except one
while (2 < buckets.size() && buckets.get(1).getSeconds() + WINDOW_SECONDS < now_seconds) {
buckets.pollFirst();
}
if (buckets.size() == 1) {
// Not enough data
return 0;
} else if (now_seconds <= buckets.getFirst().getSeconds() + WINDOW_SECONDS) {
// First bucket is inside the window
return buckets.getLast().getValue() - buckets.getFirst().getValue();
} else {
// Find the weighted average between the first two points
Bucket p0 = buckets.get(0);
Bucket p1 = buckets.get(1);
double px = now_millis - (WINDOW_SECONDS * 1000);
double m = (p1.getValue() - p0.getValue()) / (p1.getTimestampInMillis() - p0.getTimestampInMillis());
double py = m * (px - p0.getTimestampInMillis()) + p0.getValue();
return value - py;
}
}
}
public class Bucket {
private long seconds; // Seconds since 1.1.1970, used as bucket ID
private short millis; // 0-999, used for a more exact calculation
private double value;
public Bucket(long seconds, short millis, double value) {
this.seconds = seconds;
this.millis = millis;
this.value = value;
}
public long getSeconds() {
return seconds;
}
public double getValue() {
return value;
}
public long getTimestampInMillis() {
return seconds * 1000 + millis;
}
public void updateValue(short millis, double value) {
this.millis = millis;
this.value = value;
}
}
An alternative way could have been to use CompositeMeterRegistry on the top level and then add both a PrometheusMeterRegistry and a StepMeterRegistry. Prometheus reports percentiles and Step reports rates. Our monitoring system would then have to query two endpoints.
This was a temporary solution until we modified our monitoring system to read the prometheus endpoint and calculate its own rates.
I am comparing the Apache Beam SDK with the Flink SDK for stream processing, in order to establish the cost/advantages of using Beam as an additional framework.
I have a very simple setup where a stream of data is read from a Kafka source and processed in parallel by a cluster of nodes running Flink.
From my understanding of how these SDKs work, the simplest way to process a stream of data window by window is:
Using Apache Beam (running on Flink):
1.1. Create a Pipeline object.
1.2. Create a PCollection of Kafka records.
1.3. Apply windowing function.
1.4. Transform pipeline to key by window.
1.5. Group records by key (window).
1.6. Apply whatever function is needed to the windowed records.
Using the Flink SDK
2.1. Create a Data Stream from a Kafka source.
2.2. Transform it into a Keyed Stream by providing a key function.
2.3. Apply windowing function.
2.4. Apply whatever function is needed to the windowed records.
While the Flink solution appears programmatically more succinct, in my experience, it is less efficient at high volumes of data. I can only imagine the overhead is introduced by the key extraction function, since this step is not required by Beam.
My question is: am I comparing like for like? Are these processes not equivalent? What could explain the Beam way being more efficient, since it uses Flink as a runner (and all the other conditions are the same)?
This is the code using the Beam SDK
PipelineOptions options = PipelineOptionsFactory.create();
//Run with Flink
FlinkPipelineOptions flinkPipelineOptions = options.as(FlinkPipelineOptions.class);
flinkPipelineOptions.setRunner(FlinkRunner.class);
flinkPipelineOptions.setStreaming(true);
flinkPipelineOptions.setParallelism(-1); //Pick this up from the user interface at runtime
// Create the Pipeline object with the options we defined above.
Pipeline p = Pipeline.create(flinkPipelineOptions);
// Create a PCollection of Kafka records
PCollection<KafkaRecord<byte[], byte[]>> kafkaCollection = p.apply(KafkaIO.<Long, String>readBytes()
.withBootstrapServers(KAFKA_IP + ":" + KAFKA_PORT)
.withTopics(ImmutableList.of(REAL_ENERGY_TOPIC, IT_ENERGY_TOPIC))
.updateConsumerProperties(ImmutableMap.of("group.id", CONSUMER_GROUP)));
//Apply Windowing Function
PCollection<KafkaRecord<byte[], byte[]>> windowedKafkaCollection = kafkaCollection.apply(Window.into(SlidingWindows.of(Duration.standardSeconds(5)).every(Duration.standardSeconds(1))));
//Transform the pipeline to key by window
PCollection<KV<IntervalWindow, KafkaRecord<byte[], byte[]>>> keyedByWindow =
windowedKafkaCollection.apply(
ParDo.of(
new DoFn<KafkaRecord<byte[], byte[]>, KV<IntervalWindow, KafkaRecord<byte[], byte[]>>>() {
#ProcessElement
public void processElement(ProcessContext context, IntervalWindow window) {
context.output(KV.of(window, context.element()));
}
}));
//Group records by key (window)
PCollection<KV<IntervalWindow, Iterable<KafkaRecord<byte[], byte[]>>>> groupedByWindow = keyedByWindow
.apply(GroupByKey.<IntervalWindow, KafkaRecord<byte[], byte[]>>create());
//Process windowed data
PCollection<KV<IIntervalWindowResult, IPueResult>> processed = groupedByWindow
.apply("filterAndProcess", ParDo.of(new PueCalculatorFn()));
// Run the pipeline.
p.run().waitUntilFinish();
And this is the code using the Flink SDK
// Create a Streaming Execution Environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
env.setParallelism(6);
//Connect to Kafka
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", KAFKA_IP + ":" + KAFKA_PORT);
properties.setProperty("group.id", CONSUMER_GROUP);
DataStream<ObjectNode> stream = env
.addSource(new FlinkKafkaConsumer010<>(Arrays.asList(REAL_ENERGY_TOPIC, IT_ENERGY_TOPIC), new JSONDeserializationSchema(), properties));
//Key by id
stream.keyBy((KeySelector<ObjectNode, Integer>) jsonNode -> jsonNode.get("id").asInt())
//Set the windowing function.
.timeWindow(Time.seconds(5L), Time.seconds(1L))
//Process Windowed Data
.process(new PueCalculatorFn(), TypeInformation.of(ImmutablePair.class));
// execute program
env.execute("Using Flink SDK");
Many thanks in advance for any insight.
Edit
I thought I should add some indicators that may be relevant.
Network Received Bytes
Flink SDK
taskmanager.2
2,644,786,446
taskmanager.3
2,645,765,232
taskmanager.1
2,827,676,598
taskmanager.6
2,422,309,148
taskmanager.4
2,428,570,491
taskmanager.5
2,431,368,644
Beam
taskmanager.2
4,092,154,160
taskmanager.3
4,435,132,862
taskmanager.1
4,766,399,314
taskmanager.6
4,425,190,393
taskmanager.4
4,096,576,110
taskmanager.5
4,092,849,114
CPU Utilisation (Max)
Flink SDK
taskmanager.2
93.00%
taskmanager.3
92.00%
taskmanager.1
91.00%
taskmanager.6
90.00%
taskmanager.4
90.00%
taskmanager.5
92.00%
Beam
taskmanager.2
52.0%
taskmanager.3
71.0%
taskmanager.1
72.0%
taskmanager.6
40.0%
taskmanager.4
56.0%
taskmanager.5
26.0%
Beam seems to use a lot more networking, whereas Flink uses significantly more CPU. Could this suggest that Beam is parallelising the processing in a more efficient way?
Edit No2
I am pretty sure that the PueCalculatorFn classes are equivalent, but I will share the code here to see if any obvious discrepancies between the two processes become apparent.
Beam
public class PueCalculatorFn extends DoFn<KV<IntervalWindow, Iterable<KafkaRecord<byte[], byte[]>>>, KV<IIntervalWindowResult, IPueResult>> implements Serializable {
private transient List<IKafkaConsumption> realEnergyRecords;
private transient List<IKafkaConsumption> itEnergyRecords;
#ProcessElement
public void procesElement(DoFn<KV<IntervalWindow, Iterable<KafkaRecord<byte[], byte[]>>>, KV<IIntervalWindowResult, IPueResult>>.ProcessContext c, BoundedWindow w) {
KV<IntervalWindow, Iterable<KafkaRecord<byte[], byte[]>>> element = c.element();
Instant windowStart = Instant.ofEpochMilli(element.getKey().start().getMillis());
Instant windowEnd = Instant.ofEpochMilli(element.getKey().end().getMillis());
Iterable<KafkaRecord<byte[], byte[]>> records = element.getValue();
//Calculate Pue
IPueResult result = calculatePue(element.getKey(), records);
//Create IntervalWindowResult object to return
DateTimeFormatter formatter = DateTimeFormatter.ISO_LOCAL_DATE_TIME.withZone(ZoneId.of("UTC"));
IIntervalWindowResult intervalWindowResult = new IntervalWindowResult(formatter.format(windowStart),
formatter.format(windowEnd), realEnergyRecords, itEnergyRecords);
//Return Pue keyed by Window
c.output(KV.of(intervalWindowResult, result));
}
private PueResult calculatePue(IntervalWindow window, Iterable<KafkaRecord<byte[], byte[]>> records) {
//Define accumulators to gather readings
final DoubleAccumulator totalRealIncrement = new DoubleAccumulator((x, y) -> x + y, 0.0);
final DoubleAccumulator totalItIncrement = new DoubleAccumulator((x, y) -> x + y, 0.0);
//Declare variable to store the result
BigDecimal pue = BigDecimal.ZERO;
//Initialise transient lists
realEnergyRecords = new ArrayList<>();
itEnergyRecords = new ArrayList<>();
//Transform the results into a stream
Stream<KafkaRecord<byte[], byte[]>> streamOfRecords = StreamSupport.stream(records.spliterator(), false);
//Iterate through each reading and add to the increment count
streamOfRecords
.map(record -> {
byte[] valueBytes = record.getKV().getValue();
assert valueBytes != null;
String valueString = new String(valueBytes);
assert !valueString.isEmpty();
return KV.of(record, valueString);
}).map(kv -> {
Gson gson = new GsonBuilder().registerTypeAdapter(KafkaConsumption.class, new KafkaConsumptionDeserialiser()).create();
KafkaConsumption consumption = gson.fromJson(kv.getValue(), KafkaConsumption.class);
return KV.of(kv.getKey(), consumption);
}).forEach(consumptionRecord -> {
switch (consumptionRecord.getKey().getTopic()) {
case REAL_ENERGY_TOPIC:
totalRealIncrement.accumulate(consumptionRecord.getValue().getEnergyConsumed());
realEnergyRecords.add(consumptionRecord.getValue());
break;
case IT_ENERGY_TOPIC:
totalItIncrement.accumulate(consumptionRecord.getValue().getEnergyConsumed());
itEnergyRecords.add(consumptionRecord.getValue());
break;
}
}
);
assert totalRealIncrement.doubleValue() > 0.0;
assert totalItIncrement.doubleValue() > 0.0;
//Beware of division by zero
if (totalItIncrement.doubleValue() != 0.0) {
//Calculate PUE
pue = BigDecimal.valueOf(totalRealIncrement.getThenReset()).divide(BigDecimal.valueOf(totalItIncrement.getThenReset()), 9, BigDecimal.ROUND_HALF_UP);
}
//Create a PueResult object to return
IWindow intervalWindow = new Window(window.start().getMillis(), window.end().getMillis());
return new PueResult(intervalWindow, pue.stripTrailingZeros());
}
#Override
protected void finalize() throws Throwable {
super.finalize();
RecordSenderFactory.closeSender();
WindowSenderFactory.closeSender();
}
}
Flink
public class PueCalculatorFn extends ProcessWindowFunction<ObjectNode, ImmutablePair, Integer, TimeWindow> {
private transient List<KafkaConsumption> realEnergyRecords;
private transient List<KafkaConsumption> itEnergyRecords;
#Override
public void process(Integer integer, Context context, Iterable<ObjectNode> iterable, Collector<ImmutablePair> collector) throws Exception {
Instant windowStart = Instant.ofEpochMilli(context.window().getStart());
Instant windowEnd = Instant.ofEpochMilli(context.window().getEnd());
BigDecimal pue = calculatePue(iterable);
//Create IntervalWindowResult object to return
DateTimeFormatter formatter = DateTimeFormatter.ISO_LOCAL_DATE_TIME.withZone(ZoneId.of("UTC"));
IIntervalWindowResult intervalWindowResult = new IntervalWindowResult(formatter.format(windowStart),
formatter.format(windowEnd), realEnergyRecords
.stream()
.map(e -> (IKafkaConsumption) e)
.collect(Collectors.toList()), itEnergyRecords
.stream()
.map(e -> (IKafkaConsumption) e)
.collect(Collectors.toList()));
//Create PueResult object to return
IPueResult pueResult = new PueResult(new Window(windowStart.toEpochMilli(), windowEnd.toEpochMilli()), pue.stripTrailingZeros());
//Collect result
collector.collect(new ImmutablePair<>(intervalWindowResult, pueResult));
}
protected BigDecimal calculatePue(Iterable<ObjectNode> iterable) {
//Define accumulators to gather readings
final DoubleAccumulator totalRealIncrement = new DoubleAccumulator((x, y) -> x + y, 0.0);
final DoubleAccumulator totalItIncrement = new DoubleAccumulator((x, y) -> x + y, 0.0);
//Declare variable to store the result
BigDecimal pue = BigDecimal.ZERO;
//Initialise transient lists
realEnergyRecords = new ArrayList<>();
itEnergyRecords = new ArrayList<>();
//Iterate through each reading and add to the increment count
StreamSupport.stream(iterable.spliterator(), false)
.forEach(object -> {
switch (object.get("topic").textValue()) {
case REAL_ENERGY_TOPIC:
totalRealIncrement.accumulate(object.get("energyConsumed").asDouble());
realEnergyRecords.add(KafkaConsumptionDeserialiser.deserialize(object));
break;
case IT_ENERGY_TOPIC:
totalItIncrement.accumulate(object.get("energyConsumed").asDouble());
itEnergyRecords.add(KafkaConsumptionDeserialiser.deserialize(object));
break;
}
});
assert totalRealIncrement.doubleValue() > 0.0;
assert totalItIncrement.doubleValue() > 0.0;
//Beware of division by zero
if (totalItIncrement.doubleValue() != 0.0) {
//Calculate PUE
pue = BigDecimal.valueOf(totalRealIncrement.getThenReset()).divide(BigDecimal.valueOf(totalItIncrement.getThenReset()), 9, BigDecimal.ROUND_HALF_UP);
}
return pue;
}
}
And here is my custom deserialiser used in the Beam example.
KafkaConsumptionDeserialiser
public class KafkaConsumptionDeserialiser implements JsonDeserializer<KafkaConsumption> {
public KafkaConsumption deserialize(JsonElement jsonElement, Type type, JsonDeserializationContext jsonDeserializationContext) throws JsonParseException {
if(jsonElement == null) {
return null;
} else {
JsonObject jsonObject = jsonElement.getAsJsonObject();
JsonElement id = jsonObject.get("id");
JsonElement energyConsumed = jsonObject.get("energyConsumed");
Gson gson = (new GsonBuilder()).registerTypeAdapter(Duration.class, new DurationDeserialiser()).registerTypeAdapter(ZonedDateTime.class, new ZonedDateTimeDeserialiser()).create();
Duration duration = (Duration)gson.fromJson(jsonObject.get("duration"), Duration.class);
JsonElement topic = jsonObject.get("topic");
Instant eventTime = (Instant)gson.fromJson(jsonObject.get("eventTime"), Instant.class);
return new KafkaConsumption(Integer.valueOf(id != null?id.getAsInt():0), Double.valueOf(energyConsumed != null?energyConsumed.getAsDouble():0.0D), duration, topic != null?topic.getAsString():"", eventTime);
}
}
}
Not sure why the Beam pipeline you wrote is faster, but semantically it is not the same as the Flink job. Similar to how windowing works in Flink, once you assign windows in Beam, all following operations automatically take the windowing into account. You don't need to group by window.
Your Beam pipeline definition can be simplified as follows:
// Create the Pipeline object with the options we defined above.
Pipeline p = Pipeline.create(flinkPipelineOptions);
// Create a PCollection of Kafka records
PCollection<KafkaRecord<byte[], byte[]>> kafkaCollection = ...
//Apply Windowing Function
PCollection<KafkaRecord<byte[], byte[]>> windowedKafkaCollection = kafkaCollection.apply(
Window.into(SlidingWindows.of(Duration.standardSeconds(5)).every(Duration.standardSeconds(1))));
//Process windowed data
PCollection<KV<IIntervalWindowResult, IPueResult>> processed = windowedKafkaCollection
.apply("filterAndProcess", ParDo.of(new PueCalculatorFn()));
// Run the pipeline.
p.run().waitUntilFinish();
As for the performance, it depends on many factors but keep in mind that Beam is an abstraction layer on top of Flink. Generally speaking, I would be surprised if you saw increased performance with Beam on Flink.
edit: Just to clarify further, you don't group on the JSON "id" field in the Beam pipeline, which you do in the Flink snippet.
For what's worth, if the window processing can be pre-aggregated via reduce() or aggregate(), then the native Flink job should perform better than it currently does.
Many details, such as choice of state backend, serialization, checkpointing, etc. can also have a big impact on performance.
Is the same Flink being used in both cases -- i.e., same version, same configuration?
I want to count total number of rows in a file.
Please explain your code if possible.
String fileAbsolutePath = "gs://sourav_bucket_dataflow/" + fileName;
PCollection<String> data = p.apply("Reading Data From File", TextIO.read().from(fileAbsolutePath));
PCollection<Long> count = data.apply(Count.<String>globally());
Now i want to get the value.
There are a variety of sinks that you can use to get data out of your pipeline. https://beam.apache.org/documentation/io/built-in/ has a list of the current built in IO transforms.
It sort of depends on what you want to do with that number. Assuming you want to use it in your future transformations, you may want to convert it to a PCollectionView object and pass it as a side input to other transformations.
PCollection<String> data = p.apply("Reading Data From File", TextIO.read().from(fileAbsolutePath));
PCollection<Long> count = data.apply(Count.<String>globally());
final PCollectionView<Long> view = count.apply(View.asSingleton());
A quick example to show you how to use the value as a side count:
data.apply(ParDo.of(new FuncFn(view)).withSideInputs(view));
Where:
class FuncFn extends DoFn<String,String>
{
private final PCollectionView<Long> mySideInput;
public FuncFn(PCollectionView<Long> mySideInput) {
this.mySideInput = mySideInput;
}
#ProcessElement
public void processElement(ProcessContext c) throws IOException
{
Long count = c.sideInput(mySideInput);
//other stuff you may want to do
}
}
Hope that helps!
where "input" in line 1 is the input. This will work.
PCollection<Long> number = input.apply(Count.globally());
number.apply(MapElements.via(new SimpleFunction<Long, Long>()
{
public Long apply(Long total)
{
System.out.println("Length is: " + total);
return total;
}
}));
I don't believe I adequately understand the XsltTransformer class enough to explain why method f1 is superior to f2. In fact, f1 finishes in about 40 seconds, consuming between 750mb and 1gb of memory. I was expecting f2 to be a better solution but it never finishes for the same lengthy list of input files. By the time I kill it, it has processed only about 1000 input files while consuming over 4gb of memory.
import java.io.*;
import javax.xml.transform.stream.StreamSource;
import net.sf.saxon.s9api.*;
public class foreachfile {
private static long f1 (Processor p, XsltExecutable e, Serializer ser, String args[]) {
long maxTotalMemory = 0;
Runtime rt = Runtime.getRuntime();
for (int i=1; i<args.length; i++) {
String xmlfile = args[i];
try {
XsltTransformer t = e.load();
t.setDestination(ser);
t.setInitialContextNode(p.newDocumentBuilder().build(new StreamSource(new File(xmlfile))));
t.transform();
long tm = rt.totalMemory();
if (tm > maxTotalMemory)
maxTotalMemory = tm;
} catch (Throwable ex) {
System.err.println(ex);
}
}
return maxTotalMemory;
}
private static long f2 (Processor p, XsltExecutable e, Serializer ser, String args[]) {
long maxTotalMemory = 0;
Runtime rt = Runtime.getRuntime();
XsltTransformer t = e.load();
t.setDestination(ser);
for (int i=1; i<args.length; i++) {
String xmlfile = args[i];
try {
t.setInitialContextNode(p.newDocumentBuilder().build(new StreamSource(new File(xmlfile))));
t.transform();
long tm = rt.totalMemory();
if (tm > maxTotalMemory)
maxTotalMemory = tm;
} catch (Throwable ex) {
System.err.println(ex);
}
}
return maxTotalMemory;
}
public static void main (String args[]) throws SaxonApiException, Exception {
String usecase = System.getProperty("xslt.usecase");
int uc = Integer.parseInt(usecase);
String xslfile = args[0];
Processor p = new Processor(true);
XsltCompiler c = p.newXsltCompiler();
XsltExecutable e = c.compile(new StreamSource(new File(xslfile)));
Serializer ser = new Serializer();
ser.setOutputStream(System.out);
long maxTotalMemory = uc == 1 ? f1(p, e, ser, args) : f2(p, e, ser, args);
System.err.println(String.format("Max total memory was %d", maxTotalMemory));
}
}
I normally recommend using a new XsltTransformer for each transformation. However, the class is serially reusable (you can perform multiple transformations one after another, but not concurrently). The XsltTransformer keeps certain resources in memory, in case they are needed again: notably, all documents read using the doc() or document() functions. This can be useful, for example, if you want to transform one set of input documents to five different output formats as part of your publishing workflow. But if this reuse of resources doesn't give you any benefits, it merely imposes a cost in memory use, which you can avoid by creating a new transformer each time. The same applies if you use the JAXP interface.