Google Dataflow: Request payload size exceeds the limit: 10485760 bytes - google-cloud-dataflow

when trying to run a large transform on ~ 800.000 files, I get the above error message when trying to run the pipeline.
Here is the code:
public static void main(String[] args) {
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
GcsUtil u = getUtil(p.getOptions());
try{
List<GcsPath> paths = u.expand(GcsPath.fromUri("gs://tlogdataflow/stage/*.zip"));
List<String> strPaths = new ArrayList<String>();
for(GcsPath pa: paths){
strPaths.add(pa.toUri().toString());
}
p.apply(Create.of(strPaths))
.apply("Unzip Files", Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")));
p.run();
}
catch(IOException io){
//
}
}
I thought thats exactly what google data flow is for? Handling large amounts of files / data?
Is there a way to split the load in order to make it work?
Thanks & BR
Phil

Dataflow is good at handling large amounts of data, but has limitations in terms of how large the description of the pipeline can be. Data passed to Create.of() is currently embedded in the pipeline description, so you can't pass very large amounts of data there - instead, large amounts of data should be read from external storage, and the pipeline should specify only their locations.
Think of it as the distinction between the amount of data a program can process vs. the size of the program's code itself.
You can get around this issue by making the expansion happen in a ParDo:
p.apply(Create.of("gs://tlogdataflow/stage/*.zip"))
.apply(ParDo.of(new ExpandFn()))
.apply(...fusion break (see below)...)
.apply(Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")))
where ExpandFn is something like as follows:
private static class ExpandFn extends DoFn<String, String> {
#ProcessElement
public void process(ProcessContext c) {
GcsUtil util = getUtil(c.getPipelineOptions());
for (String path : util.expand(GcsPath.fromUri(c.element()))) {
c.output(path);
}
}
}
and by fusion break I'm referring to this (basically, ParDo(add unique key) + group by key + Flatten.iterables() + Values.create()). It's not very convenient and there are discussions happening about adding a built-in transform to do this (see this PR and this thread).

Thank you very much! Using your input I solved it like this:
public class ZipPipeline {
private static final Logger LOG = LoggerFactory.getLogger(ZipPipeline.class);
public static void main(String[] args) {
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
try{
p.apply(Create.of("gs://tlogdataflow/stage/*.zip"))
.apply(ParDo.of(new ExpandFN()))
.apply(ParDo.of(new AddKeyFN()))
.apply(GroupByKey.<String,String>create())
.apply(ParDo.of(new FlattenFN()))
.apply("Unzip Files", Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")));
p.run();
}
catch(Exception e){
LOG.error(e.getMessage());
}
}
private static class FlattenFN extends DoFn<KV<String,Iterable<String>>, String>{
private static final long serialVersionUID = 1L;
#Override
public void processElement(ProcessContext c){
KV<String,Iterable<String>> kv = c.element();
for(String s: kv.getValue()){
c.output(s);
}
}
}
private static class ExpandFN extends DoFn<String,String>{
private static final long serialVersionUID = 1L;
#Override
public void processElement(ProcessContext c) throws Exception{
GcsUtil u = getUtil(c.getPipelineOptions());
for(GcsPath path : u.expand(GcsPath.fromUri(c.element()))){
c.output(path.toUri().toString());
}
}
}
private static class AddKeyFN extends DoFn<String, KV<String,String>>{
private static final long serialVersionUID = 1L;
#Override
public void processElement(ProcessContext c){
String path = c.element();
String monthKey = path.split("_")[4].substring(0, 6);
c.output(KV.of(monthKey, path));
}
}

Related

Unexpected behavior change from dataflow 1.9 to 2.0/2.1

We found a very strange difference between Dataflow SDK 1.9 and 2.0/2.1 for a very simple pipeline.
We have CoGroupByKey step that joins two PCollections by their keys and outputs two PCollections (via TupleTags). For instance, one PCollection may contain {"str1", "str2"} and the other may contains {"str3"}.
These two PCollections are written to GCS (at different locations), and their union (basically, the PCollection produced by applying Flatten on the two PCollections) would be used by subsequent steps in a pipeline. Using the previous example, we will store {"str1", "str2"} and {"str3"} in GCS under respective locations, and the pipeline will further transform their union (Flattened PCollection) {"str1", "str2", "str3"}, and so on.
In Dataflow SDK 1.9, that is exactly what is happening, and we've built our pipelines around this logic.
As we were slowly migrating to 2.0/2.1, we noticed that this behavior is no longer observed. Instead, all the steps followed by the Flatten step are run correctly and as expected, but those two PCollections (being Flattened) are no longer written to GCS as if they are nonexistent. In the execution graph though, the steps are shown, and this is very strange to us.
We were able to reproduce this issue reliably so that we can share the data and code as an example.
We have two text files stored in GCS:
data1.txt:
k1,v1
k2,v2
data2.txt:
k2,w2
k3,w3
We will read these two files to create two PCollections, a PC for each file.
We'll parse each line to create KV<String, String> (so the keys are k1, k2, k3 in this example).
We then apply CoGroupByKey and produce PCollection to be output to GCS.
Two PCollections are produced after the CoGroupByKey step depending on the number of values associated with each key (it's a contrived example, but it is to demonstrate the issue we are experiencing) -- whether the number is even or odd.
So one of the PCs will contain keys "k1, " and "k3" (with some value strings appended to them, see the code below) as they have one value each and the other will contain a single key "k2" as it has two values (found in each file).
These two PCs are written to GCS at different locations, and the flattened PC of the two will also be written to GCS (but it could have been further transformed).
The three output files are expected to contain the following contents (rows may not be in order):
output1:
k2: [v2],(w2)
output2:
k3: (w3)
k1: [v1]
outputMerged:
k3: (w3)
k2: [v2],(w2)
k1: [v1]
This is exactly what we see (and expected)in Dataflow SDK 1.9.
In 2.0 and 2.1 however, output1 and output2 come out to be empty (and the TextIO steps are not even executed as if there are no elements being input to them; we verified this by adding a dummy ParDo in-between, and it's not invoked at all).
This makes us very curious as to why suddenly this behavior change was made between 1.9 and 2.0/2.1, and what would be the best way for us to achieve what we have been doing with 1.9.
Specifically, we produce output1/2 for archiving purposes, while we flatten the two PCs to transform the data further and produce another output.
Here is Java Code you can run (you will have to import properly, change the bucket name, and set Options properly, etc.).
Working code for 1.9:
//Dataflow SDK 1.9 compatible.
public class TestJob {
public static void execute(Options options) {
Pipeline pipeline = Pipeline.create(options);
PCollection<KV<String, String>> data1 =
pipeline.apply(TextIO.Read.from(GcsPath.EXPERIMENT_BUCKET + "/data1.txt")).apply(ParDo.of(new doFn()));
PCollection<KV<String, String>> data2 =
pipeline.apply(TextIO.Read.from(GcsPath.EXPERIMENT_BUCKET + "/data2.txt")).apply(ParDo.of(new doFn()));
TupleTag<String> inputTag1 = new TupleTag<String>() {
private static final long serialVersionUID = 1L;
};
TupleTag<String> inputTag2 = new TupleTag<String>() {
private static final long serialVersionUID = 1L;
};
TupleTag<String> outputTag1 = new TupleTag<String>() {
private static final long serialVersionUID = 1L;
};
TupleTag<String> outputTag2 = new TupleTag<String>() {
private static final long serialVersionUID = 1L;
};
PCollectionTuple tuple = KeyedPCollectionTuple.of(inputTag1, data1).and(inputTag2, data2)
.apply(CoGroupByKey.<String>create()).apply(ParDo.of(new doFn2(inputTag1, inputTag2, outputTag2))
.withOutputTags(outputTag1, TupleTagList.of(outputTag2)));
PCollection<String> output1 = tuple.get(outputTag1);
PCollection<String> output2 = tuple.get(outputTag2);
PCollection<String> outputMerged = PCollectionList.of(output1).and(output2).apply(Flatten.<String>pCollections());
outputMerged.apply(TextIO.Write.to(GcsPath.EXPERIMENT_BUCKET + "/test-job-1.9/outputMerged").withNumShards(1));
output1.apply(TextIO.Write.to(GcsPath.EXPERIMENT_BUCKET + "/test-job-1.9/output1").withNumShards(1));
output2.apply(TextIO.Write.to(GcsPath.EXPERIMENT_BUCKET + "/test-job-1.9/output2").withNumShards(1));
pipeline.run();
}
static class doFn2 extends DoFn<KV<String, CoGbkResult>, String> {
private static final long serialVersionUID = 1L;
final TupleTag<String> inputTag1;
final TupleTag<String> inputTag2;
final TupleTag<String> outputTag2;
public doFn2(TupleTag<String> inputTag1, TupleTag<String> inputTag2, TupleTag<String> outputTag2) {
this.inputTag1 = inputTag1;
this.inputTag2 = inputTag2;
this.outputTag2 = outputTag2;
}
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
String key = c.element().getKey();
List<String> values = new ArrayList<String>();
int numValues = 0;
for (String val1 : c.element().getValue().getAll(inputTag1)) {
values.add(String.format("[%s]", val1));
numValues++;
}
for (String val2 : c.element().getValue().getAll(inputTag2)) {
values.add(String.format("(%s)", val2));
numValues++;
}
final String line = String.format("%s: %s", key, Joiner.on(",").join(values));
if (numValues % 2 == 0) {
c.output(line);
} else {
c.sideOutput(outputTag2, line);
}
}
}
static class doFn extends DoFn<String, KV<String, String>> {
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
String[] tokens = c.element().split(",");
c.output(KV.of(tokens[0], tokens[1]));
}
}
}
Working Code for 2.0/2.1:
// Dataflow SDK 2.0 and 2.1 compatible.
public class TestJob {
public static void execute(Options options) {
Pipeline pipeline = Pipeline.create(options);
PCollection<KV<String, String>> data1 =
pipeline.apply(TextIO.read().from(GcsPath.EXPERIMENT_BUCKET + "/data1.txt")).apply(ParDo.of(new doFn()));
PCollection<KV<String, String>> data2 =
pipeline.apply(TextIO.read().from(GcsPath.EXPERIMENT_BUCKET + "/data2.txt")).apply(ParDo.of(new doFn()));
TupleTag<String> inputTag1 = new TupleTag<String>() {
private static final long serialVersionUID = 1L;
};
TupleTag<String> inputTag2 = new TupleTag<String>() {
private static final long serialVersionUID = 1L;
};
TupleTag<String> outputTag1 = new TupleTag<String>() {
private static final long serialVersionUID = 1L;
};
TupleTag<String> outputTag2 = new TupleTag<String>() {
private static final long serialVersionUID = 1L;
};
PCollectionTuple tuple = KeyedPCollectionTuple.of(inputTag1, data1).and(inputTag2, data2)
.apply(CoGroupByKey.<String>create()).apply(ParDo.of(new doFn2(inputTag1, inputTag2, outputTag2))
.withOutputTags(outputTag1, TupleTagList.of(outputTag2)));
PCollection<String> output1 = tuple.get(outputTag1);
PCollection<String> output2 = tuple.get(outputTag2);
PCollection<String> outputMerged = PCollectionList.of(output1).and(output2).apply(Flatten.<String>pCollections());
outputMerged.apply(TextIO.write().to(GcsPath.EXPERIMENT_BUCKET + "/test-job-2.1/outputMerged").withNumShards(1));
output1.apply(TextIO.write().to(GcsPath.EXPERIMENT_BUCKET + "/test-job-2.1/output1").withNumShards(1));
output2.apply(TextIO.write().to(GcsPath.EXPERIMENT_BUCKET + "/test-job-2.1/output2").withNumShards(1));
PipelineResult pipelineResult = pipeline.run();
pipelineResult.waitUntilFinish();
}
static class doFn2 extends DoFn<KV<String, CoGbkResult>, String> {
private static final long serialVersionUID = 1L;
final TupleTag<String> inputTag1;
final TupleTag<String> inputTag2;
final TupleTag<String> outputTag2;
public doFn2(TupleTag<String> inputTag1, TupleTag<String> inputTag2, TupleTag<String> outputTag2) {
this.inputTag1 = inputTag1;
this.inputTag2 = inputTag2;
this.outputTag2 = outputTag2;
}
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
String key = c.element().getKey();
List<String> values = new ArrayList<String>();
int numValues = 0;
for (String val1 : c.element().getValue().getAll(inputTag1)) {
values.add(String.format("[%s]", val1));
numValues++;
}
for (String val2 : c.element().getValue().getAll(inputTag2)) {
values.add(String.format("(%s)", val2));
numValues++;
}
final String line = String.format("%s: %s", key, Joiner.on(",").join(values));
if (numValues % 2 == 0) {
c.output(line);
} else {
c.output(outputTag2, line);
}
}
}
static class doFn extends DoFn<String, KV<String, String>> {
private static final long serialVersionUID = 1L;
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
String[] tokens = c.element().split(",");
c.output(KV.of(tokens[0], tokens[1]));
}
}
}
Also, in case it is useful, the execution graph looks like this.
(And for Google engineers, Job IDs are also specified).
With 1.9 (job id 2017-09-29_14_35_42-15149127992051688457):
With 2.1 (job id 2017-09-29_14_31_59-991964669451027883):
TextIO.Write 2,3 are not producing any output under 2.0/2.1.
Flatten, and its subsequent step works fine.
This is indeed a defect. A fix is in flight and should be documented as available in the Service Release Notes.
A workaround in the meantime is to use the 1.9.1 SDK, as this error only affects 2.x SDKs.
Users interested in picking up the fix early can also use the latest nightly build from Beam (recommended to unblock development, not for production, since it's a daily build). Instructions here.

ValueProvider Issue

I am trying to get the value of a property that is passed from a cloud function to a dataflow template. I am getting errors because the value being passed is a wrapper, and using the .get() method fails during the compile. with this error
An exception occurred while executing the Java class. null: InvocationTargetException: Not called from a runtime context.
public interface MyOptions extends DataflowPipelineOptions {
...
#Description("schema of csv file")
ValueProvider<String> getHeader();
void setHeader(ValueProvider<String> header);
...
}
public static void main(String[] args) throws IOException {
...
List<String> sideInputColumns = Arrays.asList(options.getHeader().get().split(","));
...
//ultimately use the getHeaders as side inputs
PCollection<String> input = p.apply(Create.of(sideInputColumns));
final PCollectionView<List<String>> finalColumnView = input.apply(View.asList());
}
How do I extract the value from the ValueProvider type?
The value of a ValueProvider is not available during pipeline construction. As such, you need to organize your pipeline so that it always has the same structure, and serializes the ValueProvider. At runtime, the individual transforms within your pipeline can inspect the value to determine how to operate.
Based on your example, you may need to do something like the following. It creates a single element, and then uses a DoFn that is evaluated at runtime to expand the headers:
public static class HeaderDoFn extends DoFn<String, String> {
private final ValueProvider<String> header;
public HeaderDoFn(ValueProvider<String> header) {
this.header = header;
}
#ProcessElement
public void processElement(ProcessContext c) {
// Ignore input element -- there should be exactly one
for (String column : this.header().get().split(",")) {
c.output(column);
}
}
}
public static void main(String[] args) throws IOException {
PCollection<String> input = p
.apply(Create.of("one")) // create a single element
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
}
});
// Note that the order of this list is not guaranteed.
final PCollectionView<List<String>> finalColumnView =
input.apply(View.asList());
}
Another option would be to use a NestedValueProvider to create a ValueProvider<List<String>> from the option, and pass that ValueProvider<List<String>> to the necessary DoFns rather than using a side input.

Dataflow Map side-input issue

I'm having trouble creating a Map PCollectionView with the DataflowRunner.
The pipeline below aggregates an unbouded countingInput together with values from a side-input (containing 10 generated values).
When running the pipeline on gcp it get's stuck inside the View.asMap() transform.
More specifially, the ParDo(StreamingPCollectionViewWriter) does not have any output.
I tried this with dataflow 2.0.0-beta3, as well as with beam-0.7.0-SNAPSHOT, without any result. Note that my pipeline is running without any problem when using the local DirectRunner.
Am I doing something wrong?
All help is appreciated, thanks in advance for helping me out!
public class SimpleSideInputPipeline {
private static final Logger LOG = LoggerFactory.getLogger(SimpleSideInputPipeline.class);
public interface Options extends DataflowPipelineOptions {}
public static void main(String[] args) throws IOException {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
Pipeline pipeline = Pipeline.create(options);
final PCollectionView<Map<Integer, String>> sideInput = pipeline
.apply(CountingInput.forSubrange(0L, 10L))
.apply("Create KV<Integer, String>",ParDo.of(new DoFn<Long, KV<Integer, String>>() {
#ProcessElement
public void processElement(ProcessContext c) {
c.output(KV.of(c.element().intValue(), "TEST"));
}
}))
.apply(View.asMap());
pipeline
.apply(CountingInput.unbounded().withRate(1, Duration.standardSeconds(5)))
.apply("Aggregate with side-input",ParDo.of(new DoFn<Long, KV<Long, String>>() {
#ProcessElement
public void processElement(ProcessContext c) {
Map<Integer, String> map = c.sideInput(sideInput);
//get first segment from map
Object[] values = map.values().toArray();
String firstVal = (String) values[0];
LOG.info("Combined: K: "+ c.element() + " V: " + firstVal + " MapSize: " + map.size());
c.output(KV.of(c.element(), firstVal));
}
}).withSideInputs(sideInput));
pipeline.run();
}
}
No need to worry that the ParDo(StreamingPCollectionViewWriterFn) does not record any output - what it does is actually write each element to an internal location.
You code looks OK to me, and this should be investigated. I have filed BEAM-2155.

Sharing BigTable Connection object among DataFlow DoFn sub-classes

I am setting up a Java Pipeline in DataFlow to read a .csv file and to create a bunch of BigTable rows based on the content of the file. I see in the BigTable documentation the note that connecting to BigTable is an 'expensive' operation and that it's a good idea to do it only once and to share the connection among the functions that need it.
However, if I declare the Connection object as a public static variable in the main class and first connect to BigTable in the main function, I get the NullPointerException when I subsequently try to reference the connection in instances of DoFn sub-classes' processElement() function as part of my DataFlow pipeline.
Conversely, if I declare the Connection as a static variable in the actual DoFn class, then the operation works successfully.
What is the best-practice or optimal way to do this?
I'm concerned that if I implement the second option at scale, I will be wasting a lot of time and resources. If I keep the variable as static in the DoFn class, is it enough to ensure that the APIs don't try to re-establish the connection every time?
I realize there is a special BigTable I/O call to sync DataFlow pipeline objects with BigTable, but I think I need to write one on my own to build-in some special logic into the DoFn processElement() function...
This is what the "working" code looks like:
class DigitizeBT extends DoFn<String, String>{
private static Connection m_locConn;
#Override
public void processElement(ProcessContext c)
{
try
{
m_locConn = BigtableConfiguration.connect("projectID", "instanceID");
Table tbl = m_locConn.getTable(TableName.valueOf("TableName"));
Put put = new Put(Bytes.toBytes(rowKey));
put.addColumn(
Bytes.toBytes("CF1"),
Bytes.toBytes("SomeName"),
Bytes.toBytes("SomeValue"));
tbl.put(put);
}
catch (IOException e)
{
e.printStackTrace();
System.exit(1);
}
}
}
This is what updated code looks like, FYI:
public void SmallKVJob()
{
CloudBigtableScanConfiguration config = new CloudBigtableScanConfiguration.Builder()
.withProjectId(DEF.ID_PROJ)
.withInstanceId(DEF.ID_INST)
.withTableId(DEF.ID_TBL_UNITS)
.build();
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options.setProject(DEF.ID_PROJ);
options.setStagingLocation(DEF.ID_STG_LOC);
// options.setNumWorkers(3);
// options.setMaxNumWorkers(5);
// options.setRunner(BlockingDataflowPipelineRunner.class);
options.setRunner(DirectPipelineRunner.class);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from(DEF.ID_BAL))
.apply(ParDo.of(new DoFn1()))
.apply(ParDo.of(new DoFn2()))
.apply(ParDo.of(new DoFn3(config)));
m_log.info("starting to run the job");
p.run();
m_log.info("finished running the job");
}
}
class DoFn1 extends DoFn<String, KV<String, Integer>>
{
#Override
public void processElement(ProcessContext c)
{
c.output(KV.of(c.element().split("\\,")[0],Integer.valueOf(c.element().split("\\,")[1])));
}
}
class DoFn2 extends DoFn<KV<String, Integer>, KV<String, Integer>>
{
#Override
public void processElement(ProcessContext c)
{
int max = c.element().getValue();
String name = c.element().getKey();
for(int i = 0; i<max;i++)
c.output(KV.of(name, 1));
}
}
class DoFn3 extends AbstractCloudBigtableTableDoFn<KV<String, Integer>, String>
{
public DoFn3(CloudBigtableConfiguration config)
{
super(config);
}
#Override
public void processElement(ProcessContext c)
{
try
{
Integer max = c.element().getValue();
for(int i = 0; i<max; i++)
{
String owner = c.element().getKey();
String rnd = UUID.randomUUID().toString();
Put p = new Put(Bytes.toBytes(owner+"*"+rnd));
p.addColumn(Bytes.toBytes(DEF.ID_CF1), Bytes.toBytes("Owner"), Bytes.toBytes(owner));
getConnection().getTable(TableName.valueOf(DEF.ID_TBL_UNITS)).put(p);
c.output("Success");
}
} catch (IOException e)
{
c.output(e.toString());
e.printStackTrace();
}
}
}
The input .csv file looks something like this:
Mary,3000
John,5000
Peter,2000
So, for each row in the .csv file, I have to put in x number of rows into BigTable, where x is the second cell in the .csv file...
We built AbstractCloudBigtableTableDoFn ( Source & Docs ) for this purpose. Extend that class instead of DoFn, and call getConnection() instead of creating a Connection yourself.
10,000 small rows should take a second or two of actual work.
EDIT: As per the comments, BufferedMutator should be used instead of Table.put() for best throughput.

Sum and Average Aggregation using DataFlow

I have following type of sample data.
s.n., time, user, time_span, user_level
1, 2016-01-04T1:26:13, Hari, 8, admin
2, 2016-01-04T11:6:13, Gita, 2, admin
3, 2016-01-04T11:26:13, Gita, 0, user
Now I need to find average_time_span/user, average_time_span/user_level and total_time_span/user.
I'm able to find each of above mention value but couldn't able to find all of those at once. As I'm new to DataFlow, please suggest me appropriate method to do so.
static class ExtractUserAndUserLevelFn extends DoFn<String, KV<String, Long>> {
#Override
public void processElement(ProcessContext c) {
String[] words = c.element().split(",");
if (words.length == 5) {
Instant timestamp = Instant.parse(words[1].trim());
KV<String, Long> userTime = KV.of(words[2].trim(), Long.valueOf(words[3].trim()));
KV<String, Long> userLevelTime = KV.of(words[4].trim(), Long.valueOf(words[3].trim()));
c.outputWithTimestamp(userTime, timestamp);
c.outputWithTimestamp(userLevelTime, timestamp);
}
}
}
public static void main(String[] args) {
TestOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(TestOptions.class);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.named("ReadLines").from(options.getInputFile()))
.apply(ParDo.of(new ExtractUserAndUserLevelFn()))
.apply(Window.<KV<String, Long>>into(
FixedWindows.of(Duration.standardSeconds(options.getMyWindowSize()))))
.apply(GroupByKey.<String, Long>create())
.apply(ParDo.of(new DoFn<KV<String, Iterable<Long>>, KV<String, Long>>() {
public void processElement(ProcessContext c) {
String key = c.element().getKey();
Iterable<Long> docsWithThatUrl = c.element().getValue();
Long sum = 0L;
for (Long item : docsWithThatUrl)
sum += item;
KV<String, Long> userTime = KV.of(key, sum);
c.output(userTime);
}
}))
.apply(MapElements.via(new FormatAsTextFn()))
.apply(TextIO.Write.named("WriteCounts").to(options.getOutput()).
withNumShards(options.getShardsNumber()));
p.run();
}
One approach would be to first parse the lines into one PCollection that contains a record per line, and the from that collection create two PCollection of key-value pairs. Let's say you define a record representing a line like this:
static class Record implements Serializable {
final String user;
final String role;
final long duration;
// need a constructor here
}
Now, create a LineToRecordFn that create Records from the input lines, so that you can do:
PCollection<Record> records = p.apply(TextIO.Read.named("ReadLines")
.from(options.getInputFile()))
.apply(ParDo.of(new LineToRecordFn()));
You can window here, if you want. Whether you window or not, you can then create your keyed-by-role and keyed-by-user PCollections:
PCollection<KV<String,Long>> role_duration = records.apply(MapElements.via(
new SimpleFunction<Record,KV<String,Long>>() {
#Override
public KV<String,Long> apply(Record r) {
return KV.of(r.role,r.duration);
}
}));
PCollection<KV<String,Long>> user_duration = records.apply(MapElements.via(
new SimpleFunction<Record,KV<String,Long>>() {
#Override
public KV<String,Long> apply(Record r) {
return KV.of(r.user, r.duration);
}
}));
Now, you can get the means and sum in just a few lines:
PCollection<KV<String,Double>> mean_by_user = user_duration.apply(
Mean.<String,Long>perKey());
PCollection<KV<String,Double>> mean_by_role = role_duration.apply(
Mean.<String,Long>perKey());
PCollection<KV<String,Long>> sum_by_role = role_duration.apply(
Sum.<String>longsPerKey());
Note that dataflow does some optimization before running your job. So, while it might look like you're doing two passes over the records PCollection, that may not be true.
The Mean and Sum transforms look like they would work well for this use case. Basic usage looks like this:
PCollection<KV<String, Double>> meanPerKey =
input.apply(Mean.<String, Integer>perKey());
PCollection<KV<String, Integer>> sumPerKey = input
.apply(Sum.<String>integersPerKey());

Resources