Sum and Average Aggregation using DataFlow

Sum and Average Aggregation using DataFlow - google-cloud-dataflow

I have following type of sample data.
s.n., time, user, time_span, user_level
1, 2016-01-04T1:26:13, Hari, 8, admin
2, 2016-01-04T11:6:13, Gita, 2, admin
3, 2016-01-04T11:26:13, Gita, 0, user
Now I need to find average_time_span/user, average_time_span/user_level and total_time_span/user.
I'm able to find each of above mention value but couldn't able to find all of those at once. As I'm new to DataFlow, please suggest me appropriate method to do so.
static class ExtractUserAndUserLevelFn extends DoFn<String, KV<String, Long>> {
#Override
public void processElement(ProcessContext c) {
String[] words = c.element().split(",");
if (words.length == 5) {
Instant timestamp = Instant.parse(words[1].trim());
KV<String, Long> userTime = KV.of(words[2].trim(), Long.valueOf(words[3].trim()));
KV<String, Long> userLevelTime = KV.of(words[4].trim(), Long.valueOf(words[3].trim()));
c.outputWithTimestamp(userTime, timestamp);
c.outputWithTimestamp(userLevelTime, timestamp);
}
}
}
public static void main(String[] args) {
TestOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(TestOptions.class);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.named("ReadLines").from(options.getInputFile()))
.apply(ParDo.of(new ExtractUserAndUserLevelFn()))
.apply(Window.<KV<String, Long>>into(
FixedWindows.of(Duration.standardSeconds(options.getMyWindowSize()))))
.apply(GroupByKey.<String, Long>create())
.apply(ParDo.of(new DoFn<KV<String, Iterable<Long>>, KV<String, Long>>() {
public void processElement(ProcessContext c) {
String key = c.element().getKey();
Iterable<Long> docsWithThatUrl = c.element().getValue();
Long sum = 0L;
for (Long item : docsWithThatUrl)
sum += item;
KV<String, Long> userTime = KV.of(key, sum);
c.output(userTime);
}
}))
.apply(MapElements.via(new FormatAsTextFn()))
.apply(TextIO.Write.named("WriteCounts").to(options.getOutput()).
withNumShards(options.getShardsNumber()));
p.run();
}

One approach would be to first parse the lines into one PCollection that contains a record per line, and the from that collection create two PCollection of key-value pairs. Let's say you define a record representing a line like this:
static class Record implements Serializable {
final String user;
final String role;
final long duration;
// need a constructor here
}
Now, create a LineToRecordFn that create Records from the input lines, so that you can do:
PCollection<Record> records = p.apply(TextIO.Read.named("ReadLines")
.from(options.getInputFile()))
.apply(ParDo.of(new LineToRecordFn()));
You can window here, if you want. Whether you window or not, you can then create your keyed-by-role and keyed-by-user PCollections:
PCollection<KV<String,Long>> role_duration = records.apply(MapElements.via(
new SimpleFunction<Record,KV<String,Long>>() {
#Override
public KV<String,Long> apply(Record r) {
return KV.of(r.role,r.duration);
}
}));
PCollection<KV<String,Long>> user_duration = records.apply(MapElements.via(
new SimpleFunction<Record,KV<String,Long>>() {
#Override
public KV<String,Long> apply(Record r) {
return KV.of(r.user, r.duration);
}
}));
Now, you can get the means and sum in just a few lines:
PCollection<KV<String,Double>> mean_by_user = user_duration.apply(
Mean.<String,Long>perKey());
PCollection<KV<String,Double>> mean_by_role = role_duration.apply(
Mean.<String,Long>perKey());
PCollection<KV<String,Long>> sum_by_role = role_duration.apply(
Sum.<String>longsPerKey());
Note that dataflow does some optimization before running your job. So, while it might look like you're doing two passes over the records PCollection, that may not be true.

The Mean and Sum transforms look like they would work well for this use case. Basic usage looks like this:
PCollection<KV<String, Double>> meanPerKey =
input.apply(Mean.<String, Integer>perKey());
PCollection<KV<String, Integer>> sumPerKey = input
.apply(Sum.<String>integersPerKey());

Related

Recommendation Engine using Apache Spark MLIB showing up Zero recommendations after processing all operations

I am a newbie when it comes to Implementation of ML Algorithms. I wanted to implement a recommendation Engine and Got to know after little experimenting that collaborative-filtering can be used for the same. I am using Apache Spark for the same. I got help from one of the blogs and tried to implement the same in my local. PFB Code that I tried out. Every time I execute this the Count of Recommendations that is getting printed is always zero. I don see any Evident Error as such. Could someone please help me understand this. Also, please feel free to provide any other reference that can be referred in this regard.
package mllib.example;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.PairFunction;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.mllib.recommendation.ALS;
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel;
import org.apache.spark.mllib.recommendation.Rating;
import scala.Tuple2;
public class RecommendationEngine {
public static void main(String[] args) {
// Create Java spark context
SparkConf conf = new SparkConf().setAppName("Recommendation System Example").setMaster("local[2]").set("spark.executor.memory","1g");
JavaSparkContext sc = new JavaSparkContext(conf);
// Read user-item rating file. format - userId,itemId,rating
JavaRDD<String> userItemRatingsFile = sc.textFile(args[0]);
System.out.println("Count is "+userItemRatingsFile.count());
// Read item description file. format - itemId, itemName, Other Fields,..
JavaRDD<String> itemDescritpionFile = sc.textFile(args[1]);
System.out.println("itemDescritpionFile Count is "+itemDescritpionFile.count());
// Map file to Ratings(user,item,rating) tuples
JavaRDD<Rating> ratings = userItemRatingsFile.map(new Function<String, Rating>() {
public Rating call(String s) {
String[] sarray = s.split(",");
return new Rating(Integer.parseInt(sarray[0]), Integer
.parseInt(sarray[1]), Double.parseDouble(sarray[2]));
}
});
System.out.println("Ratings RDD Object"+ratings.first().toString());
// Create tuples(itemId,ItemDescription), will be used later to get names of item from itemId
JavaPairRDD<Integer,String> itemDescritpion = itemDescritpionFile.mapToPair(
new PairFunction<String, Integer, String>() {
#Override
public Tuple2<Integer, String> call(String t) throws Exception {
String[] s = t.split(",");
return new Tuple2<Integer,String>(Integer.parseInt(s[0]), s[1]);
}
});
System.out.println("itemDescritpion RDD Object"+ratings.first().toString());
// Build the recommendation model using ALS
int rank = 10; // 10 latent factors
int numIterations = Integer.parseInt(args[2]); // number of iterations
MatrixFactorizationModel model = ALS.trainImplicit(JavaRDD.toRDD(ratings),
rank, numIterations);
//ALS.trainImplicit(arg0, arg1, arg2)
// Create user-item tuples from ratings
JavaRDD<Tuple2<Object, Object>> userProducts = ratings
.map(new Function<Rating, Tuple2<Object, Object>>() {
public Tuple2<Object, Object> call(Rating r) {
return new Tuple2<Object, Object>(r.user(), r.product());
}
});
// Calculate the itemIds not rated by a particular user, say user with userId = 1
JavaRDD<Integer> notRatedByUser = userProducts.filter(new Function<Tuple2<Object,Object>, Boolean>() {
#Override
public Boolean call(Tuple2<Object, Object> v1) throws Exception {
if (((Integer) v1._1).intValue() != 0) {
return true;
}
return false;
}
}).map(new Function<Tuple2<Object,Object>, Integer>() {
#Override
public Integer call(Tuple2<Object, Object> v1) throws Exception {
return (Integer) v1._2;
}
});
// Create user-item tuples for the items that are not rated by user, with user id 1
JavaRDD<Tuple2<Object, Object>> itemsNotRatedByUser = notRatedByUser
.map(new Function<Integer, Tuple2<Object, Object>>() {
public Tuple2<Object, Object> call(Integer r) {
return new Tuple2<Object, Object>(0, r);
}
});
// Predict the ratings of the items not rated by user for the user
JavaRDD<Rating> recomondations = model.predict(itemsNotRatedByUser.rdd()).toJavaRDD().distinct();
// Sort the recommendations by rating in descending order
recomondations = recomondations.sortBy(new Function<Rating,Double>(){
#Override
public Double call(Rating v1) throws Exception {
return v1.rating();
}
}, false, 1);
System.out.println("recomondations Total is "+recomondations.count());
// Get top 10 recommendations
JavaRDD<Rating> topRecomondations = sc.parallelize(recomondations.take(10));
// Join top 10 recommendations with item descriptions
JavaRDD<Tuple2<Rating, String>> recommendedItems = topRecomondations.mapToPair(
new PairFunction<Rating, Integer, Rating>() {
#Override
public Tuple2<Integer, Rating> call(Rating t) throws Exception {
return new Tuple2<Integer,Rating>(t.product(),t);
}
}).join(itemDescritpion).values();
System.out.println("recommendedItems count is "+recommendedItems.count());
//Print the top recommendations for user 1.
recommendedItems.foreach(new VoidFunction<Tuple2<Rating,String>>() {
#Override
public void call(Tuple2<Rating, String> t) throws Exception {
System.out.println(t._1.product() + "\t" + t._1.rating() + "\t" + t._2);
}
});
Also, I see that this job is Running for real Long time. Every time it creates a model.Is there a way I can Create the Model once, persist it and Load the same for consecutive Predictions. Can we by any chance improve the Speed of execution of this job
Thanks in Advance

DymanicDestinations in Apache Beam

I have a PCollection [String] say "X" that I need to dump in a BigQuery table.
The table destination and the schema for it is in a PCollection[TableRow] say "Y".
How to accomplish this in the simplest manner?
I tried extracting the table and schema from "Y" and saving it in static global variables (tableName and schema respectively). But somehow oddly the BigQueryIO.writeTableRows() always gets the value of the variable tableName as null. But it gets the schema. I tried logging the values of those variables and I can see the values are there for both.
Here is my pipeline code:
static String tableName;
static TableSchema schema;
PCollection<String> read = p.apply("Read from input file",
TextIO.read().from(options.getInputFile()));
PCollection<TableRow> tableRows = p.apply(
BigQueryIO.read().fromQuery(NestedValueProvider.of(
options.getfilename(),
new SerializableFunction<String, String>() {
#Override
public String apply(String filename) {
return "SELECT table,schema FROM `BigqueryTest.configuration` WHERE file='" + filename +"'";
}
})).usingStandardSql().withoutValidation());
final PCollectionView<List<String>> dataView = read.apply(View.asList());
tableRows.apply("Convert data read from file to TableRow",
ParDo.of(new DoFn<TableRow,TableRow>(){
#ProcessElement
public void processElement(ProcessContext c) {
tableName = c.element().get("table").toString();
String[] schemas = c.element().get("schema").toString().split(",");
List<TableFieldSchema> fields = new ArrayList<>();
for(int i=0;i<schemas.length;i++) {
fields.add(new TableFieldSchema()
.setName(schemas[i].split(":")[0]).setType(schemas[i].split(":")[1]));
}
schema = new TableSchema().setFields(fields);
//My code to convert data to TableRow format.
}}).withSideInputs(dataView));
tableRows.apply("write to BigQuery",
BigQueryIO.writeTableRows()
.withSchema(schema)
.to("ProjectID:DatasetID."+tableName)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
Everything works fine. Only BigQueryIO.write operation fails and I get the error TableId is null.
I also tried using SerializableFunction and returning the value from there but i still get null.
Here is the code that I tried for it:
tableRows.apply("write to BigQuery",
BigQueryIO.writeTableRows()
.withSchema(schema)
.to(new GetTable(tableName))
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED));
public static class GetTable implements SerializableFunction<String,String> {
String table;
public GetTable() {
this.table = tableName;
}
#Override
public String apply(String arg0) {
return "ProjectId:DatasetId."+table;
}
}
I also tried using DynamicDestinations but I get an error saying schema is not provided. Honestly I'm new to the concept of DynamicDestinations and I'm not sure that I'm doing it correctly.
Here is the code that I tried for it:
tableRows2.apply(BigQueryIO.writeTableRows()
.to(new DynamicDestinations<TableRow, TableRow>() {
private static final long serialVersionUID = 1L;
#Override
public TableDestination getTable(TableRow dest) {
List<TableRow> list = sideInput(bqDataView); //bqDataView contains table and schema
String table = list.get(0).get("table").toString();
String tableSpec = "ProjectId:DatasetId."+table;
String tableDescription = "";
return new TableDestination(tableSpec, tableDescription);
}
public String getSideInputs(PCollectionView<List<TableRow>> bqDataView) {
return null;
}
#Override
public TableSchema getSchema(TableRow destination) {
return schema; //schema is getting added from the global variable
}
#Override
public TableRow getDestination(ValueInSingleWindow<TableRow> element) {
return null;
}
}.getSideInputs(bqDataView)));
Please let me know what I'm doing wrong and which path I should take.
Thank You.

Part of the reason your having trouble is because of the two stages of pipeline execution. First the pipeline is constructed on your machine. This is when all of the applications of PTransforms occur. In your first example, this is when the following lines are executed:
BigQueryIO.writeTableRows()
.withSchema(schema)
.to("ProjectID:DatasetID."+tableName)
The code within a ParDo however runs when your pipeline executes, and it does so on many machines. So the following code runs much later than the pipeline construction:
#ProcessElement
public void processElement(ProcessContext c) {
tableName = c.element().get("table").toString();
...
schema = new TableSchema().setFields(fields);
...
}
This means that neither the tableName nor the schema fields will be set at when the BigQueryIO sink is created.
Your idea to use DynamicDestinations is correct, but you need to move the code to actually generate the schema the destination into that class, rather than relying on global variables that aren't available on all of the machines.

Google Dataflow: Request payload size exceeds the limit: 10485760 bytes

when trying to run a large transform on ~ 800.000 files, I get the above error message when trying to run the pipeline.
Here is the code:
public static void main(String[] args) {
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
GcsUtil u = getUtil(p.getOptions());
try{
List<GcsPath> paths = u.expand(GcsPath.fromUri("gs://tlogdataflow/stage/*.zip"));
List<String> strPaths = new ArrayList<String>();
for(GcsPath pa: paths){
strPaths.add(pa.toUri().toString());
}
p.apply(Create.of(strPaths))
.apply("Unzip Files", Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")));
p.run();
}
catch(IOException io){
//
}
}
I thought thats exactly what google data flow is for? Handling large amounts of files / data?
Is there a way to split the load in order to make it work?
Thanks & BR
Phil

Dataflow is good at handling large amounts of data, but has limitations in terms of how large the description of the pipeline can be. Data passed to Create.of() is currently embedded in the pipeline description, so you can't pass very large amounts of data there - instead, large amounts of data should be read from external storage, and the pipeline should specify only their locations.
Think of it as the distinction between the amount of data a program can process vs. the size of the program's code itself.
You can get around this issue by making the expansion happen in a ParDo:
p.apply(Create.of("gs://tlogdataflow/stage/*.zip"))
.apply(ParDo.of(new ExpandFn()))
.apply(...fusion break (see below)...)
.apply(Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")))
where ExpandFn is something like as follows:
private static class ExpandFn extends DoFn<String, String> {
#ProcessElement
public void process(ProcessContext c) {
GcsUtil util = getUtil(c.getPipelineOptions());
for (String path : util.expand(GcsPath.fromUri(c.element()))) {
c.output(path);
}
}
}
and by fusion break I'm referring to this (basically, ParDo(add unique key) + group by key + Flatten.iterables() + Values.create()). It's not very convenient and there are discussions happening about adding a built-in transform to do this (see this PR and this thread).

Thank you very much! Using your input I solved it like this:
public class ZipPipeline {
private static final Logger LOG = LoggerFactory.getLogger(ZipPipeline.class);
public static void main(String[] args) {
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
try{
p.apply(Create.of("gs://tlogdataflow/stage/*.zip"))
.apply(ParDo.of(new ExpandFN()))
.apply(ParDo.of(new AddKeyFN()))
.apply(GroupByKey.<String,String>create())
.apply(ParDo.of(new FlattenFN()))
.apply("Unzip Files", Write.to(new ZipIO.Sink("gs://tlogdataflow/outbox")));
p.run();
}
catch(Exception e){
LOG.error(e.getMessage());
}
}
private static class FlattenFN extends DoFn<KV<String,Iterable<String>>, String>{
private static final long serialVersionUID = 1L;
#Override
public void processElement(ProcessContext c){
KV<String,Iterable<String>> kv = c.element();
for(String s: kv.getValue()){
c.output(s);
}
}
}
private static class ExpandFN extends DoFn<String,String>{
private static final long serialVersionUID = 1L;
#Override
public void processElement(ProcessContext c) throws Exception{
GcsUtil u = getUtil(c.getPipelineOptions());
for(GcsPath path : u.expand(GcsPath.fromUri(c.element()))){
c.output(path.toUri().toString());
}
}
}
private static class AddKeyFN extends DoFn<String, KV<String,String>>{
private static final long serialVersionUID = 1L;
#Override
public void processElement(ProcessContext c){
String path = c.element();
String monthKey = path.split("_")[4].substring(0, 6);
c.output(KV.of(monthKey, path));
}
}

DataflowAssert doesn't pass TableRow test

We don't know why when running this simple test, DataflowAssert fails:
#Test
#Category(RunnableOnService.class)
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
PCollection<TableRow> pCollectionTable1 = p.apply("a",Create.of(TABLEROWS_ARRAY_1));
PCollection<TableRow> pCollectionTable2 = p.apply("b",Create.of(TABLEROWS_ARRAY_2));
PCollection<TableRow> joinedTables = Table.join(pCollectionTable1, pCollectionTable2);
DataflowAssert.that(joinedTables).containsInAnyOrder(TABLEROW_TEST);
p.run();
}
We are getting the following exception:
Sep 25, 2015 10:42:50 AM com.google.cloud.dataflow.sdk.testing.DataflowAssert$TwoSideInputAssert$CheckerDoFn processElement
SEVERE: DataflowAssert failed expectations.
java.lang.AssertionError:
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
at org.junit.Assert.assertThat(Assert.java:865)
at org.junit.Assert.assertThat(Assert.java:832)
at ...
In order to simplify the DataflowAssert test we hardcoded the output of Table.join to match DataflowAssert,having:
private static final TableRow TABLEROW_TEST = new TableRow()
.set("id", "x");
static PCollection<TableRow> join(PCollection<TableRow> pCollectionTable1,
PCollection<TableRow> pCollectionTable2) throws Exception {
final TupleTag<String> pCollectionTable1Tag = new TupleTag<String>();
final TupleTag<String> pCollectionTable2Tag = new TupleTag<String>();
PCollection<KV<String, String>> table1Data = pCollectionTable1
.apply(ParDo.of(new ExtractTable1DataFn()));
PCollection<KV<String, String>> table2Data = pCollectionTable2
.apply(ParDo.of(new ExtractTable2DataFn()));
PCollection<KV<String, CoGbkResult>> kvpCollection = KeyedPCollectionTuple
.of(pCollectionTable1Tag, table1Data).and(pCollectionTable2Tag, table2Data)
.apply(CoGroupByKey.<String> create());
PCollection<KV<String, String>> resultCollection = kvpCollection
.apply(ParDo.named("Process join")
.of(new DoFn<KV<String, CoGbkResult>, KV<String, String>>() {
private static final long serialVersionUID = 0;
#Override
public void processElement(ProcessContext c) {
// System.out.println(c);
KV<String, CoGbkResult> e = c.element();
String key = e.getKey();
String value = null;
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));
}
}));
PCollection<TableRow> formattedResults = resultCollection.apply(
ParDo.named("Format join").of(new DoFn<KV<String, String>, TableRow>() {
private static final long serialVersionUID = 0;
public void processElement(ProcessContext c) {
TableRow row = new TableRow().set("id", "x");
c.output(row);
}
}));
return formattedResults;
}
Does anyone know what we are doing wrong?

I think the error message is telling you that the actual collection contains more copies of that element than the expectation.
Expected: iterable over [<{id=x}>] in any order
but: Not matched: <{id=x}>
This is hamcrest indicating that you wanted an iterable over a single element, but the actual collection had an item which wasn't matched. Since all of the items coming out of "format join" have the same value, it made this harder to read than it should have been.
Specifically, this is the message produced when I run the following test, which checks to see if the collection with two copies of row is the contains exactly one copy of row:
#Category(RunnableOnService.class)
#Test
public void testTableRow() throws Exception {
Pipeline p = TestPipeline.create();
TableRow row = new TableRow().set("id", "x");
PCollection<TableRow> rows = p.apply(Create.<TableRow>of(row, row));
DataflowAssert.that(rows).containsInAnyOrder(row);
p.run();
}
In order to get that result with your code, I had to take advantage of the fact that you only iterate over entries in table2. Specifically:
// Use these as the input tables.
table1 = [("keyA", "A1a"), ("keyA", "A1b]
table2 = [("keyA", "A2a"), ("keyA", "A2b"), ("keyB", "B2")]
// The CoGroupByKey returns
[("keyA", (["A1a", "A1b"], ["A2a", "A2b"])),
("keyB", ([], ["B2"]))]
// When run through "Process join" this produces.
// For details on why see the next section.
["A2b,A2b",
"B2,B2"]
// When run through "Format join" this becomes the following.
[{id=x}, {id=x}]
Note that the DoFn for "Process join" may not produce the expected results as commented below:
String key = e.getKey();
String value = null;
// NOTE: Both table1Value and table2Value iterate over pCollectionTable2Tag
for (String table1Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
for (String table2Value : c.element().getValue().getAll(pCollectionTable2Tag)) {
// NOTE: this updates value, and doesn't output it. So for each
// key there will be a single output with the *last* value
// rather than one for each pair.
value = table1Value + "," + table2Value;
}
}
c.output(KV.of(key, value));

How do I create user defined counters in Dataflow?

How can I create my own counters in my DoFns?
In my DoFn I'd like to increment a counter every time a condition is met when processing a record. I'd like this counter to sum the values across all records.

You can use Aggregators, and the total values of the counters will show up in the UI.
Here is an example where I experimented with Aggregators in a pipeline that just sleeps numOutputShards workers for sleepSecs seconds. (The GenFakeInput PTransform at the beginning just returns a flattened PCollection<String> of size numOutputShards):
PCollection<String> output = p
.apply(new GenFakeInput(options.getNumOutputShards()))
.apply(ParDo.named("Sleep").of(new DoFn<String, String>() {
private Aggregator<Long> tSleepSecs;
private Aggregator<Integer> tWorkers;
private Aggregator<Long> tExecTime;
private long startTimeMillis;
#Override
public void startBundle(Context c) {
tSleepSecs = c.createAggregator("Total Slept (sec)", new Sum.SumLongFn());
tWorkers = c.createAggregator("Num Workers", new Sum.SumIntegerFn());
tExecTime = c.createAggregator("Total Wallclock (sec)", new Sum.SumLongFn());
startTimeMillis = System.currentTimeMillis();
}
#Override
public void finishBundle(Context c) {
tExecTime.addValue((System.currentTimeMillis() - startTimeMillis)/1000);
}
#Override
public void processElement(ProcessContext c) {
try {
LOG.info("Sleeping for {} seconds.", sleepSecs);
tSleepSecs.addValue(sleepSecs);
tWorkers.addValue(1);
TimeUnit.SECONDS.sleep(sleepSecs);
} catch (InterruptedException e) {
LOG.info("Ignoring caught InterruptedException during sleep.");
}
c.output(c.element());
}}));

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Sum and Average Aggregation using DataFlow - google-cloud-dataflow

The Mean and Sum transforms look like they would work well for this use case. Basic usage looks like this: PCollection<KV<String, Double>> meanPerKey = input.apply(Mean.<String, Integer>perKey()); PCollection<KV<String, Integer>> sumPerKey = input .apply(Sum.<String>integersPerKey());

Related

Recommendation Engine using Apache Spark MLIB showing up Zero recommendations after processing all operations

DymanicDestinations in Apache Beam

Google Dataflow: Request payload size exceeds the limit: 10485760 bytes

DataflowAssert doesn't pass TableRow test

How do I create user defined counters in Dataflow?

Categories

Resources