Trying to join two Pcollection using SideInput transform. In the ParDo function while mapping the value, from the sideinput collection we may get the multiple mapping records as a collection. In such a case how to handle the collection and how to return those collection of values to the PCollection.
It would be good if some one help to solve this case. Here is the code snippet that I tried.
PCollection<TableRow> pc1 = ...;
PCollection<Row> pc1Rows = pc1.apply(
ParDo.of(new fnConvertTableRowToRow())).setRowSchema(schemaPc1);
PCollection<KV<Integer, Row>> keyed_pc1Rows = pc1Rows.apply(
WithKeys.of(new SerializableFunction<Row, Integer>() {
public Integer apply(Row s) {
return Integer.parseInt(s.getValue("LOCATION_ID").toString());
}
}));
PCollection<TableRow> pc2 = ...;
PCollection<Row> pc2Rows = pc2.apply(
ParDo.of(new fnConvertTableRowToRow())).setRowSchema(schemaPc2);
PCollection<KV<Integer, Iterable<Row>>> keywordGroups = pc2Rows.apply(
new fnGroupKeyWords());
PCollectionView<Map<Integer, Iterable<Row>>> sideInputView =
keywordGroups.apply("Side Input",
View.<Integer, Iterable<Row>>asMap());
PCollection<Row> finalResultCollection = keyed_pc1Rows.apply("Process",
ParDo.of(new DoFn<KV<Integer,Row>, Row>() {
#ProcessElement
public void processElement(ProcessContext c) {
Integer key = Integer.parseInt(c.element().getKey().toString());
Row leftRow = c.element().getValue();
Map<Integer, Iterable<Row>> key2Rows = c.sideInput(sideInputView);
Iterable<Row> rightRowsIterable = key2Rows.get(key);
for (Iterator<Row> i = rightRowsIterable.iterator(); i.hasNext(); ) {
Row suit = (Row) i.next();
Row targetRow = Row.withSchema(schemaOutput)
.addValues(leftRow.getValues())
.addValues(suit.getValues())
.build();
c.output(targetRow);
}
}
}).withSideInputs(sideInputView));
public static class fnGroupKeyWords extends
PTransform<PCollection<Row>, PCollection<KV<Integer, Iterable<Row>>>> {
#Override
public PCollection<KV<Integer, Iterable<Row>>> expand(
PCollection<Row> rows) {
PCollection<KV<Integer, Row>> kvs = rows.apply(
ParDo.of(new TransferKeyValueFn()));
PCollection<KV<Integer, Iterable<Row>>> group = kvs.apply(
GroupByKey.<Integer, Row> create());
return group;
}
}
public static class TransferKeyValueFn extends
DoFn<Row, KV<Integer, Row>> {
#ProcessElement
public void processElement(ProcessContext c) throws ParseException {
Row tRow = c.element();
c.output(
KV.of(
Integer.parseInt(tRow.getValue("DW_LOCATION_ID").toString()),
tRow));
}
}
If you wish to join two PCollections together using a common key. the CoGroupByKey might make more sense. Please consider this approach instead of side inputs
Also this blog post has a great explanation as well.
I think using the SideInput suggestion would perform well if you have a very small collection which could fit into memory. You could use it as a side input with view.asMultimap. Then in a ParDo processing the larger PCollection (After a GBK, to give you an iterable over all elements for the key), lookup the key you are interested in from the side input. Here is an example test pipeline using a multimap pcollection.
However, if your collection is quite large then using Flatten to combine both pcollections together would be a better approach. Then using a GroupByKey afterward, which will give you an iterable for element under the same key. This will still be processed sequentially. Though, I believe you will will have issues with performance, unless you eliminate the hot key. Please see the explanation of using combiners to alleviate this.
Related
I'd like to accomplish the following using Apache Beam:
calculate every 5 seconds the events that are read from pubsub in the last minute
The goal is to have a semi-realtime view on the rate data comes in. This can then be expanded towards more complex use cases afterwards.
After searching, I've not come across a way to solve this seemingly simple problem. Things that do not work:
global window + repeated triggers (triggers do not fire when there is no input)
sliding window + withoutDefaults (does not allow empty windows to be emitted apparently)
Any suggestion on how to solve this problem?
As already discussed, Beam does not emit data for empty windows. In addition to the reasons given by Rui Wang we can add the challenge of how the latter stages would handle those empty panes.
Anyway, the specific use case that you describe -monitoring rolling count of number of messages - should be possible with some work even if the metric falls down to zero eventually. One possibility would be to publish a steady number of dummy messages which would advance the watermark and fire the panes but are filtered out later within the pipeline. The problem with this approach is that the publishing source needs to be adapted and that might not always be convenient/possible. Another one would involve generating this fake data as another input and co-group it with the main stream. The advantage is that everything can be done in Dataflow without the need to tweak the source or the sink. To illustrate this I provide an example.
The inputs are divided in two streams. For the dummy one, I used GenerateSequence to create a new element every 5 seconds. I then window the PCollection (windowing strategy needs to be compatible with the one for the main stream so I will use the same). Then I map the element to a key-value pair where the value is 0 (we could use other values as we know from which stream the element comes but I want to evince that dummy records are not counted).
PCollection<KV<String,Integer>> dummyStream = p
.apply("Generate Sequence", GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5)))
.apply("Window Messages - Dummy", Window.<Long>into(
...
.apply("Count Messages - Dummy", ParDo.of(new DoFn<Long, KV<String, Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
c.output(KV.of("num_messages", 0));
}
}));
For the main stream, that reads from Pub/Sub, I map each record to value 1. Later on, I will add all the ones as in typical word count examples using map-reduce stages.
PCollection<KV<String,Integer>> mainStream = p
.apply("Get Messages - Data", PubsubIO.readStrings().fromTopic(topic))
.apply("Window Messages - Data", Window.<String>into(
...
.apply("Count Messages - Data", ParDo.of(new DoFn<String, KV<String, Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
c.output(KV.of("num_messages", 1));
}
}));
Then we need to join them using a CoGroupByKey (I used the same num_messages key to group counts). This stage will output results when one of the two inputs has elements, therefore unblocking the main issue here (empty windows with no Pub/Sub messages).
final TupleTag<Integer> dummyTag = new TupleTag<>();
final TupleTag<Integer> dataTag = new TupleTag<>();
PCollection<KV<String, CoGbkResult>> coGbkResultCollection = KeyedPCollectionTuple.of(dummyTag, dummyStream)
.and(dataTag, mainStream).apply(CoGroupByKey.<String>create());
Finally, we add all the ones to obtain the total number of messages for the window. If there are no elements coming from dataTag then the sum will just default to 0.
public void processElement(ProcessContext c, BoundedWindow window) {
Integer total_sum = new Integer(0);
Iterable<Integer> dataTagVal = c.element().getValue().getAll(dataTag);
for (Integer val : dataTagVal) {
total_sum += val;
}
LOG.info("Window: " + window.toString() + ", Number of messages: " + total_sum.toString());
}
This should result in something like:
Note that results from different windows can come unordered (this can happen anyway when writing to BigQuery) and I did not play with the window settings to optimize the example.
Full code:
public class EmptyWindows {
private static final Logger LOG = LoggerFactory.getLogger(EmptyWindows.class);
public static interface MyOptions extends PipelineOptions {
#Description("Input topic")
String getInput();
void setInput(String s);
}
#SuppressWarnings("serial")
public static void main(String[] args) {
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline p = Pipeline.create(options);
String topic = options.getInput();
PCollection<KV<String,Integer>> mainStream = p
.apply("Get Messages - Data", PubsubIO.readStrings().fromTopic(topic))
.apply("Window Messages - Data", Window.<String>into(
SlidingWindows.of(Duration.standardMinutes(1))
.every(Duration.standardSeconds(5)))
.triggering(AfterWatermark.pastEndOfWindow())
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes())
.apply("Count Messages - Data", ParDo.of(new DoFn<String, KV<String, Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
//LOG.info("New data element in main output");
c.output(KV.of("num_messages", 1));
}
}));
PCollection<KV<String,Integer>> dummyStream = p
.apply("Generate Sequence", GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5)))
.apply("Window Messages - Dummy", Window.<Long>into(
SlidingWindows.of(Duration.standardMinutes(1))
.every(Duration.standardSeconds(5)))
.triggering(AfterWatermark.pastEndOfWindow())
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes())
.apply("Count Messages - Dummy", ParDo.of(new DoFn<Long, KV<String, Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
//LOG.info("New dummy element in main output");
c.output(KV.of("num_messages", 0));
}
}));
final TupleTag<Integer> dummyTag = new TupleTag<>();
final TupleTag<Integer> dataTag = new TupleTag<>();
PCollection<KV<String, CoGbkResult>> coGbkResultCollection = KeyedPCollectionTuple.of(dummyTag, dummyStream)
.and(dataTag, mainStream).apply(CoGroupByKey.<String>create());
coGbkResultCollection
.apply("Log results", ParDo.of(new DoFn<KV<String, CoGbkResult>, Void>() {
#ProcessElement
public void processElement(ProcessContext c, BoundedWindow window) {
Integer total_sum = new Integer(0);
Iterable<Integer> dataTagVal = c.element().getValue().getAll(dataTag);
for (Integer val : dataTagVal) {
total_sum += val;
}
LOG.info("Window: " + window.toString() + ", Number of messages: " + total_sum.toString());
}
}));
p.run();
}
}
Another way to approach this problem is using a stateful DoFn with a looping Timer that triggers at each 5 second tick. This looping timer generates the default data necessary for the live monitoring, and ensures that each window has at least one event to process.
One issue with the approach described by https://stackoverflow.com/a/54543527/430128 is that, in a system with multiple keys, these "dummy" events need to be generated for every key.
See https://beam.apache.org/blog/looping-timers/. Option 1 and 2 in that article are an external heartbeat source and a generated source in the beam pipeline respectively. Option 3 is the looping timer.
I want to count total number of rows in a file.
Please explain your code if possible.
String fileAbsolutePath = "gs://sourav_bucket_dataflow/" + fileName;
PCollection<String> data = p.apply("Reading Data From File", TextIO.read().from(fileAbsolutePath));
PCollection<Long> count = data.apply(Count.<String>globally());
Now i want to get the value.
There are a variety of sinks that you can use to get data out of your pipeline. https://beam.apache.org/documentation/io/built-in/ has a list of the current built in IO transforms.
It sort of depends on what you want to do with that number. Assuming you want to use it in your future transformations, you may want to convert it to a PCollectionView object and pass it as a side input to other transformations.
PCollection<String> data = p.apply("Reading Data From File", TextIO.read().from(fileAbsolutePath));
PCollection<Long> count = data.apply(Count.<String>globally());
final PCollectionView<Long> view = count.apply(View.asSingleton());
A quick example to show you how to use the value as a side count:
data.apply(ParDo.of(new FuncFn(view)).withSideInputs(view));
Where:
class FuncFn extends DoFn<String,String>
{
private final PCollectionView<Long> mySideInput;
public FuncFn(PCollectionView<Long> mySideInput) {
this.mySideInput = mySideInput;
}
#ProcessElement
public void processElement(ProcessContext c) throws IOException
{
Long count = c.sideInput(mySideInput);
//other stuff you may want to do
}
}
Hope that helps!
where "input" in line 1 is the input. This will work.
PCollection<Long> number = input.apply(Count.globally());
number.apply(MapElements.via(new SimpleFunction<Long, Long>()
{
public Long apply(Long total)
{
System.out.println("Length is: " + total);
return total;
}
}));
My Dataflow job (using Java SDK 2.1.0) is quite slow and it is going to take more than a day to process just 50GB. I just pull a whole table from BigQuery (50GB), join with one csv file from GCS (100+MB).
https://cloud.google.com/dataflow/model/group-by-key
I use sideInputs to perform join (the latter way in the documentation above) while I think using CoGroupByKey is more efficient, however I'm not sure that is the only reason my job is super slow.
I googled and it looks by default, a cache of sideinputs set as 100MB and I assume my one is slightly over that limit then each worker continuously re-reads sideinputs. To improve it, I thought I can use setWorkerCacheMb method to increase the cache size.
However it looks DataflowPipelineOptions does not have this method and DataflowWorkerHarnessOptions is hidden. Just passing --workerCacheMb=200 in -Dexec.args results in
An exception occured while executing the Java class.
null: InvocationTargetException:
Class interface com.xxx.yyy.zzz$MyOptions missing a property
named 'workerCacheMb'. -> [Help 1]
How can I use this option? Thanks.
My pipeline:
MyOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(MyOptions.class);
Pipeline p = Pipeline.create(options);
PCollection<TableRow> rows = p.apply("Read from BigQuery",
BigQueryIO.read().from("project:MYDATA.events"));
// Read account file
PCollection<String> accounts = p.apply("Read from account file",
TextIO.read().from("gs://my-bucket/accounts.csv")
.withCompressionType(CompressionType.GZIP));
PCollection<TableRow> accountRows = accounts.apply("Convert to TableRow",
ParDo.of(new DoFn<String, TableRow>() {
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
String line = c.element();
CSVParser csvParser = new CSVParser();
String[] fields = csvParser.parseLine(line);
TableRow row = new TableRow();
row = row.set("account_id", fields[0]).set("account_uid", fields[1]);
c.output(row);
}
}));
PCollection<KV<String, TableRow>> kvAccounts = accountRows.apply("Populate account_uid:accounts KV",
ParDo.of(new DoFn<TableRow, KV<String, TableRow>>() {
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
TableRow row = c.element();
String uid = (String) row.get("account_uid");
c.output(KV.of(uid, row));
}
}));
final PCollectionView<Map<String, TableRow>> uidAccountView = kvAccounts.apply(View.<String, TableRow>asMap());
// Add account_id from account_uid to event data
PCollection<TableRow> rowsWithAccountID = rows.apply("Join account_id",
ParDo.of(new DoFn<TableRow, TableRow>() {
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
TableRow row = c.element();
if (row.containsKey("account_uid") && row.get("account_uid") != null) {
String uid = (String) row.get("account_uid");
TableRow accRow = (TableRow) c.sideInput(uidAccountView).get(uid);
if (accRow == null) {
LOG.warn("accRow null, {}", row.toPrettyString());
} else {
row = row.set("account_id", accRow.get("account_id"));
}
}
c.output(row);
}
}).withSideInputs(uidAccountView));
// Insert into BigQuery
WriteResult result = rowsWithAccountID.apply(BigQueryIO.writeTableRows()
.to(new TableRefPartition(StaticValueProvider.of("MYDATA"), StaticValueProvider.of("dev"),
StaticValueProvider.of("deadletter_bucket")))
.withFormatFunction(new SerializableFunction<TableRow, TableRow>() {
private static final long serialVersionUID = 1L;
#Override
public TableRow apply(TableRow row) {
return row;
}
}).withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(WriteDisposition.WRITE_APPEND));
p.run();
Historically my system have two identifiers of users, new one (account_id) and old one(account_uid). Now I need to add new account_id to our event data stored in BigQuery retroactively, because old data only has old account_uid. Accounts table (which has relation between account_uid and account_id) is already converted as csv and stored in GCS.
The last func TableRefPartition just store data into BQ's corresponding partition depending on each event timestamp. The job is still running (2017-10-30_22_45_59-18169851018279768913) and bottleneck looks Join account_id part.
That part of throughput (xxx elements/s) goes up and down according to the graph. According to the graph, estimated size of sideInputs is 106MB.
If switching to CoGroupByKey improves performance dramatically, I will do so. I was just lazy and thought using sideInputs is easier to handle event data which doesn't have account info as well.
Try one of:
1) setting the option using some code:
options.as(DataflowWorkerHarnessOptions.class).setWorkerCacheMb(500);
2) having your application register DataflowWorkerHarnessOptions with the PipelineOptionsFactory
3) Having your own options class extend DataflowWorkerHarnessOptions
There's a few things you can do to improve the performance of your code:
Your side input is a Map<String, TableRow>, but you're using only a single field in the TableRow - accRow.get("account_id"). How about making it a Map<String, String> instead, having the value be the account_id itself? That'll likely be quite a bit more efficient than the bulky TableRow object.
You could extract the value of the side input into a lazily initialized member variable in your DoFn, to avoid repeated invocations of .sideInput().
That said, this performance is unexpected and we are investigating whether there's something else going on.
Based on Javadocs and the blog post at https://beam.apache.org/blog/2017/02/13/stateful-processing.html, I tried using a simple de-duplication example using 2.0.0-beta-2 SDK which reads a file from GCS (containing a list of jsons each with a user_id field) and then running it through a pipeline as explained below.
The input data contains about 146K events of which only 50 events are unique. The entire input is about 50MB which should be processable in considerably less time than the 2 min Fixed window. I just placed a window there to make sure the per-key-per-window semantics hold without using a GlobalWindow. I run the windowed data through 3 parallel stages to compare the results, each of which are explained below.
just copies the contents into a new file on GCS - this ensures all the events were being processed as expected and I verified the contents are exactly the same as input
Combine.PerKey on the user_id and pick only the first element from the Iterable - this essentially should deduplicate the data and it works as expected. The resulting file has the exact number of unique items from the original list of events - 50 elements
stateful ParDo which checks if the key has been seen already and emits an output only when its not. Ideally, the result from this should match the deduped data as [2] but all I am seeing is only 3 unique events. These 3 unique events always point to the same 3 user_ids in a few runs I did.
Interestingly, when I just switch from the DataflowRunner to the DirectRunner running this whole process locally, I see that the output from [3] matches [2] having only 50 unique elements as expected. So, I am doubting if there are any issues with the DataflowRunner for the Stateful ParDo.
public class StatefulParDoSample {
private static Logger logger = LoggerFactory.getLogger(StatefulParDoSample.class.getName());
static class StatefulDoFn extends DoFn<KV<String, String>, String> {
final Aggregator<Long, Long> processedElements = createAggregator("processed", Sum.ofLongs());
final Aggregator<Long, Long> skippedElements = createAggregator("skipped", Sum.ofLongs());
#StateId("keyTracker")
private final StateSpec<Object, ValueState<Integer>> keyTrackerSpec =
StateSpecs.value(VarIntCoder.of());
#ProcessElement
public void processElement(
ProcessContext context,
#StateId("keyTracker") ValueState<Integer> keyTracker) {
processedElements.addValue(1l);
final String userId = context.element().getKey();
int wasSeen = firstNonNull(keyTracker.read(), 0);
if (wasSeen == 0) {
keyTracker.write( 1);
context.output(context.element().getValue());
} else {
keyTracker.write(wasSeen + 1);
skippedElements.addValue(1l);
}
}
}
public static void main(String[] args) {
DataflowPipelineOptions pipelineOptions = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
pipelineOptions.setRunner(DataflowRunner.class);
pipelineOptions.setProject("project-name");
pipelineOptions.setStagingLocation(GCS_STAGING_LOCATION);
pipelineOptions.setStreaming(false);
pipelineOptions.setAppName("deduper");
Pipeline p = Pipeline.create(pipelineOptions);
final ObjectMapper mapper = new ObjectMapper();
PCollection<KV<String, String>> keyedEvents =
p
.apply(TextIO.Read.from(GCS_SAMPLE_INPUT_FILE_PATH))
.apply(WithKeys.of(new SerializableFunction<String, String>() {
#Override
public String apply(String input) {
try {
Map<String, Object> eventJson =
mapper.readValue(input, Map.class);
return (String) eventJson.get("user_id");
} catch (Exception e) {
}
return "";
}
}))
.apply(
Window.into(
FixedWindows.of(Duration.standardMinutes(2))
)
);
keyedEvents
.apply(ParDo.of(new StatefulDoFn()))
.apply(TextIO.Write.to(GCS_SAMPLE_OUTPUT_FILE_PATH).withNumShards(1));
keyedEvents
.apply(Values.create())
.apply(TextIO.Write.to(GCS_SAMPLE_COPY_FILE_PATH).withNumShards(1));
keyedEvents
.apply(Combine.perKey(new SerializableFunction<Iterable<String>, String>() {
#Override
public String apply(Iterable<String> input) {
return !input.iterator().hasNext() ? "empty" : input.iterator().next();
}
}))
.apply(Values.create())
.apply(TextIO.Write.to(GCS_SAMPLE_COMBINE_FILE_PATH).withNumShards(1));
PipelineResult result = p.run();
result.waitUntilFinish();
}
}
This was a bug in the Dataflow service in batch mode, fixed in the upcoming 0.6.0 Beam release (or HEAD if you track the bleeding edge).
Thank you for bringing it to my attention! For reference, or if anything else comes up, this was tracked by BEAM-1611.
I have following type of sample data.
s.n., time, user, time_span, user_level
1, 2016-01-04T1:26:13, Hari, 8, admin
2, 2016-01-04T11:6:13, Gita, 2, admin
3, 2016-01-04T11:26:13, Gita, 0, user
Now I need to find average_time_span/user, average_time_span/user_level and total_time_span/user.
I'm able to find each of above mention value but couldn't able to find all of those at once. As I'm new to DataFlow, please suggest me appropriate method to do so.
static class ExtractUserAndUserLevelFn extends DoFn<String, KV<String, Long>> {
#Override
public void processElement(ProcessContext c) {
String[] words = c.element().split(",");
if (words.length == 5) {
Instant timestamp = Instant.parse(words[1].trim());
KV<String, Long> userTime = KV.of(words[2].trim(), Long.valueOf(words[3].trim()));
KV<String, Long> userLevelTime = KV.of(words[4].trim(), Long.valueOf(words[3].trim()));
c.outputWithTimestamp(userTime, timestamp);
c.outputWithTimestamp(userLevelTime, timestamp);
}
}
}
public static void main(String[] args) {
TestOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(TestOptions.class);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.named("ReadLines").from(options.getInputFile()))
.apply(ParDo.of(new ExtractUserAndUserLevelFn()))
.apply(Window.<KV<String, Long>>into(
FixedWindows.of(Duration.standardSeconds(options.getMyWindowSize()))))
.apply(GroupByKey.<String, Long>create())
.apply(ParDo.of(new DoFn<KV<String, Iterable<Long>>, KV<String, Long>>() {
public void processElement(ProcessContext c) {
String key = c.element().getKey();
Iterable<Long> docsWithThatUrl = c.element().getValue();
Long sum = 0L;
for (Long item : docsWithThatUrl)
sum += item;
KV<String, Long> userTime = KV.of(key, sum);
c.output(userTime);
}
}))
.apply(MapElements.via(new FormatAsTextFn()))
.apply(TextIO.Write.named("WriteCounts").to(options.getOutput()).
withNumShards(options.getShardsNumber()));
p.run();
}
One approach would be to first parse the lines into one PCollection that contains a record per line, and the from that collection create two PCollection of key-value pairs. Let's say you define a record representing a line like this:
static class Record implements Serializable {
final String user;
final String role;
final long duration;
// need a constructor here
}
Now, create a LineToRecordFn that create Records from the input lines, so that you can do:
PCollection<Record> records = p.apply(TextIO.Read.named("ReadLines")
.from(options.getInputFile()))
.apply(ParDo.of(new LineToRecordFn()));
You can window here, if you want. Whether you window or not, you can then create your keyed-by-role and keyed-by-user PCollections:
PCollection<KV<String,Long>> role_duration = records.apply(MapElements.via(
new SimpleFunction<Record,KV<String,Long>>() {
#Override
public KV<String,Long> apply(Record r) {
return KV.of(r.role,r.duration);
}
}));
PCollection<KV<String,Long>> user_duration = records.apply(MapElements.via(
new SimpleFunction<Record,KV<String,Long>>() {
#Override
public KV<String,Long> apply(Record r) {
return KV.of(r.user, r.duration);
}
}));
Now, you can get the means and sum in just a few lines:
PCollection<KV<String,Double>> mean_by_user = user_duration.apply(
Mean.<String,Long>perKey());
PCollection<KV<String,Double>> mean_by_role = role_duration.apply(
Mean.<String,Long>perKey());
PCollection<KV<String,Long>> sum_by_role = role_duration.apply(
Sum.<String>longsPerKey());
Note that dataflow does some optimization before running your job. So, while it might look like you're doing two passes over the records PCollection, that may not be true.
The Mean and Sum transforms look like they would work well for this use case. Basic usage looks like this:
PCollection<KV<String, Double>> meanPerKey =
input.apply(Mean.<String, Integer>perKey());
PCollection<KV<String, Integer>> sumPerKey = input
.apply(Sum.<String>integersPerKey());