How Can I access the key in subclass of CombinerFn when combining a PCollection of KV pairs?

How Can I access the key in subclass of CombinerFn when combining a PCollection of KV pairs? - google-cloud-dataflow

I'm implementing the CombinePerKeyExample using a subclass of CombineFn instead of using an implementation of SerializableFunction
package me.examples;
import org.apache.beam.sdk.coders.AvroCoder;
import org.apache.beam.sdk.coders.DefaultCoder;
import org.apache.beam.sdk.transforms.Combine.CombineFn;
import java.util.HashSet;
import java.util.Set;
public class ConcatWordsCombineFn extends CombineFn<String, ConcatWordsCombineFn.Accumulator, String> {
#DefaultCoder(AvroCoder.class)
public static class Accumulator{
HashSet<String> plays;
}
#Override
public Accumulator createAccumulator(){
Accumulator accumulator = new Accumulator();
accumulator.plays = new HashSet<>();
return accumulator;
}
#Override
public Accumulator addInput(Accumulator accumulator, String input){
accumulator.plays.add(input);
return accumulator;
}
#Override
public Accumulator mergeAccumulators(Iterable<Accumulator> accumulators){
Accumulator mergeAccumulator = new Accumulator();
mergeAccumulator.plays = new HashSet<>();
for(Accumulator accumulator: accumulators){
mergeAccumulator.plays.addAll(accumulator.plays);
}
return mergeAccumulator;
}
#Override
public String extractOutput(Accumulator accumulator){
//how to access the key here ?
return String.join(",", accumulator.plays);
}
}
The pipeline is composed of a ReadFromBigQuery, ExtractAllPlaysOfWords (code below) and WriteToBigQuery
package me.examples;
import com.google.api.services.bigquery.model.TableRow;
import org.apache.beam.sdk.coders.KvCoder;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.transforms.Combine;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;
public class PlaysForWord extends PTransform<PCollection<TableRow>, PCollection<TableRow>> {
#Override
public PCollection<TableRow> expand(PCollection<TableRow> input) {
PCollection<KV<String, String>> largeWords = input.apply("ExtractLargeWords", ParDo.of(new ExtractLargeWordsFn()));
PCollection<KV<String, String>> wordNPlays = largeWords.apply("CombinePlays",Combine.perKey(new ConcatWordsCombineFn()));
wordNPlays.setCoder(KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of()));
PCollection<TableRow> rows = wordNPlays.apply("FormatToRow", ParDo.of(new FormatShakespeareOutputFn()));
return rows;
}
}
I would like to access the key in ConcatWordsCombineFn in order to do the final accumulation based on that. An example can be to join the words with , if the key begins with an a or use ; otherwise.
When looking at the programming guide
If you need the combining strategy to change based on the key (for example, MIN for some users and MAX for other users), you can define a KeyedCombineFn to access the key within the combining strategy.
I couldn't find KeyedCombineFn in org.apache.beam.sdk.transforms.Combine
I'm using Apache Beam 2.12.0 and Google Dataflow as a runner.

I don't think there is a built-in way to solve this. The straightforward workaround (not perfect, I know) is to wrap your string into another KV: KV<String, KV<String, String>> where both keys are the same.

Related

How to read log messages for CombineFn function in GCP Dataflow?

I am creating an Apache Beam streaming processing pipeline to run in GCP Dataflow. I have a number of transforms that extends DoFn and CombineFn. In DoFn logs are visualised fine using the LOGS window in the Dataflow job details. However, the logs from CombineFn transforms are not shown.
I tried different log levels and they also show fine using the DirectRunner.
Here is some sample code. I changed the input and output to String for brevity, there are some custom classes in my code.
import java.io.Serializable;
import org.apache.avro.reflect.Nullable;
import org.apache.beam.sdk.coders.AvroCoder;
import org.apache.beam.sdk.coders.DefaultCoder;
import org.apache.beam.sdk.transforms.Combine.CombineFn;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class AverageSpv extends CombineFn<String, AverageSpv.Accum, String> {
private static final Logger LOG = LoggerFactory.getLogger(AverageSpv.class);
#DefaultCoder(AvroCoder.class)
public static class Accum implements Serializable {
#Nullable String id;
}
#Override
public Accum createAccumulator() {
return new Accum();
}
#Override
public Accum addInput(Accum accumulator, String input) {
LOG.info("Add input: id {}, input);
accumulator.id = input;
return accumulator;
}
#Override
public Accum mergeAccumulators(Iterable<Accum> accumulators) {
LOG.info("Merging accumulator");
Accum merged = createAccumulator();
for (Accum accumulator : accumulators) {
merged.id = accumulator.id;
}
return merged;
}
#Override
public VehicleSpeedPerSegmentInfo extractOutput(Accum accumulator) {
LOG.info("Extracting accumulator");
LOG.info("Extract output: id {}", acummulator.id);
return acummulator.id;
}
}

Apache Beam CombineFn operations are executed across several steps in Dataflow. (Specifically, as much pre-combining happens before shuffling all results to a single key, and then all the upstream results are merged together into a final result in a subsequent post-GBK step.) The fact that there's no single execution "step" corresponding to the original Combine step in the graph is probably what's preventing the logs from being found.
This is a bug and should be fixed. As mentioned, a workaround is to look at all logs from the pipeline.

Why Apache beam can't infer the default coder when using KV<String, String>?

I'm implementing the CombinePerKeyExample using a subclass of CombineFn instead of using an implementation of SerializableFunction.
package me.examples;
import org.apache.beam.sdk.coders.AvroCoder;
import org.apache.beam.sdk.coders.DefaultCoder;
import org.apache.beam.sdk.transforms.Combine.CombineFn;
import java.util.HashSet;
import java.util.Set;
public class ConcatWordsCombineFn extends CombineFn<String, ConcatWordsCombineFn.Accumulator, String> {
#DefaultCoder(AvroCoder.class)
public static class Accumulator{
HashSet<String> plays;
}
#Override
public Accumulator createAccumulator(){
Accumulator accumulator = new Accumulator();
accumulator.plays = new HashSet<>();
return accumulator;
}
#Override
public Accumulator addInput(Accumulator accumulator, String input){
accumulator.plays.add(input);
return accumulator;
}
#Override
public Accumulator mergeAccumulators(Iterable<Accumulator> accumulators){
Accumulator mergeAccumulator = new Accumulator();
mergeAccumulator.plays = new HashSet<>();
for(Accumulator accumulator: accumulators){
mergeAccumulator.plays.addAll(accumulator.plays);
}
return mergeAccumulator;
}
#Override
public String extractOutput(Accumulator accumulator){
return String.join(",", accumulator.plays);
}
}
The pipeline is composed of a ReadFromBigQuery, ExtractAllPlaysOfWords (code below) and WriteToBigQuery
package me.examples;
import com.google.api.services.bigquery.model.TableRow;
import org.apache.beam.sdk.coders.KvCoder;
import org.apache.beam.sdk.coders.StringUtf8Coder;
import org.apache.beam.sdk.transforms.Combine;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;
public class PlaysForWord extends PTransform<PCollection<TableRow>, PCollection<TableRow>> {
#Override
public PCollection<TableRow> expand(PCollection<TableRow> input) {
PCollection<KV<String, String>> largeWords = input.apply("ExtractLargeWords", ParDo.of(new ExtractLargeWordsFn()));
//PCollection<KV<String, String>> wordNPlays = largeWords.apply("CombinePlays", Combine.perKey(new ConcatWordsCombineFunction()));
//using CombineFn instead
PCollection<KV<String, String>> wordNPlays = largeWords.apply("CombinePlays",Combine.perKey(new ConcatWordsCombineFn()));
wordNPlays.setCoder(KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of()));
PCollection<TableRow> rows = wordNPlays.apply("FormatToRow", ParDo.of(new FormatShakespeareOutputFn()));
return rows;
}
}
If I'm not adding this line in the code above
wordNPlays.setCoder(KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of()));
I'm having an exception
Exception in thread "main" java.lang.IllegalStateException: Unable to return a default Coder for ExtractAllPlaysOfWords/CombinePlays/Combine.GroupedValues/ParDo(Anonymous)/ParMultiDo(Anonymous).output [PCollection]. Correct one of the following root causes:
No Coder has been manually specified; you may do so using .setCoder().
Inferring a Coder from the CoderRegistry failed: Cannot provide coder for parameterized type org.apache.beam.sdk.values.KV<K, OutputT>: Unable to provide a Coder for K.
Building a Coder using a registered CoderProvider failed.
See suppressed exceptions for detailed failures.
Using the default output Coder from the producing PTransform failed: PTransform.getOutputCoder called.
at org.apache.beam.vendor.guava.v20_0.com.google.common.base.Preconditions.checkState(Preconditions.java:444)
at org.apache.beam.sdk.values.PCollection.getCoder(PCollection.java:278)
at org.apache.beam.sdk.values.PCollection.finishSpecifying(PCollection.java:115)
at org.apache.beam.sdk.runners.TransformHierarchy.finishSpecifyingInput(TransformHierarchy.java:191)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:536)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:488)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:370)
at me.examples.PlaysForWord.expand(PlaysForWord.java:21)
at me.examples.PlaysForWord.expand(PlaysForWord.java:10)
at org.apache.beam.sdk.Pipeline.applyInternal(Pipeline.java:537)
at org.apache.beam.sdk.Pipeline.applyTransform(Pipeline.java:488)
at org.apache.beam.sdk.values.PCollection.apply(PCollection.java:370)
at me.examples.Main.main(Main.java:41)
From the stacktrace, I think the pipeline is not able to get a coder for the String type of the KV obejct. Why is that ? Isn't supposed to be a "known" type for Apache Beam. Why is it working without specifying the coder when using the subclass of SerializableFunction in the Combine.perKey?
In addition to that, when I tried to get the default coder for String from the coder registry I get StringUTF8Coder
Coder coder = null;
try {
coder = pipeline.getCoderRegistry().getCoder(String.class);
logger.info("coder is " + coder);
} catch (Exception e){
logger.info("exception "+ e.getMessage() +"\n coder is " + coder );
}
/*result
INFO: coder is StringUtf8Coder
*/
I used Apache Beam 2.12.0 and run it on Google Dataflow

How to Batch By N Elements in Streaming Pipeline With Small Bundles?

I've implemented batching by N elements as described in this answer:
Can datastore input in google dataflow pipeline be processed in a batch of N entries at a time?
package com.example.dataflow.transform;
import com.example.dataflow.event.ClickEvent;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.windowing.GlobalWindow;
import org.joda.time.Instant;
import java.util.ArrayList;
import java.util.List;
public class ClickToClicksPack extends DoFn> {
public static final int BATCH_SIZE = 10;
private List accumulator;
#StartBundle
public void startBundle() {
accumulator = new ArrayList(BATCH_SIZE);
}
#ProcessElement
public void processElement(ProcessContext c) {
ClickEvent clickEvent = c.element();
accumulator.add(clickEvent);
if (accumulator.size() >= BATCH_SIZE) {
c.output(accumulator);
accumulator = new ArrayList(BATCH_SIZE);
}
}
#FinishBundle
public void finishBundle(FinishBundleContext c) {
if (accumulator.size() > 0) {
ClickEvent clickEvent = accumulator.get(0);
long time = clickEvent.getClickTimestamp().getTime();
c.output(accumulator, new Instant(time), GlobalWindow.INSTANCE);
}
}
}
But when I run pipeline in streaming mode there are a lot of batches with just 1 or 2 elements. As I understand it's because of small bundles size. After running for a day average number of elements in batch is roughly 4. I really need it to be closer to 10 for better performance of the next steps.
Is there a way to control bundles size?
Or should I use "GroupIntoBatches" transform for this purpose. In this case it's not clear for me, what should be selected as a key.
UPDATE:
is it a good idea to use java thread id or VM hostname for a key to apply "GroupIntoBatches" transform?

I've ended up doing composite transform with "GroupIntoBatches" inside.
The following answer contains recommendations regarding key selection:
https://stackoverflow.com/a/44956702/4888849
In my current implementation I'm using random keys to achieve parallelism and I'm windowing events in order to emit results regularly even if there are less then BATCH_SIZE events by one key.
package com.example.dataflow.transform;
import com.example.dataflow.event.ClickEvent;
import org.apache.beam.sdk.transforms.DoFn;
import org.apache.beam.sdk.transforms.GroupIntoBatches;
import org.apache.beam.sdk.transforms.PTransform;
import org.apache.beam.sdk.transforms.ParDo;
import org.apache.beam.sdk.transforms.windowing.FixedWindows;
import org.apache.beam.sdk.transforms.windowing.Window;
import org.apache.beam.sdk.values.KV;
import org.apache.beam.sdk.values.PCollection;
import org.joda.time.Duration;
import java.util.Random;
/**
* Batch clicks into packs of BATCH_SIZE size
*/
public class ClickToClicksPack extends PTransform, PCollection>> {
public static final int BATCH_SIZE = 10;
// Define window duration.
// After window's end - elements are emitted even if there are less then BATCH_SIZE elements
public static final int WINDOW_DURATION_SECONDS = 1;
private static final int DEFAULT_SHARDS_NUMBER = 20;
// Determine possible parallelism level
private int shardsNumber = DEFAULT_SHARDS_NUMBER;
public ClickToClicksPack() {
super();
}
public ClickToClicksPack(int shardsNumber) {
super();
this.shardsNumber = shardsNumber;
}
#Override
public PCollection> expand(PCollection input) {
return input
// assign keys, as "GroupIntoBatches" works only with key-value pairs
.apply(ParDo.of(new AssignRandomKeys(shardsNumber)))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(WINDOW_DURATION_SECONDS))))
.apply(GroupIntoBatches.ofSize(BATCH_SIZE))
.apply(ParDo.of(new ExtractValues()));
}
/**
* Assigns to clicks random integer between zero and shardsNumber
*/
private static class AssignRandomKeys extends DoFn> {
private int shardsNumber;
private Random random;
AssignRandomKeys(int shardsNumber) {
super();
this.shardsNumber = shardsNumber;
}
#Setup
public void setup() {
random = new Random();
}
#ProcessElement
public void processElement(ProcessContext c) {
ClickEvent clickEvent = c.element();
KV kv = KV.of(random.nextInt(shardsNumber), clickEvent);
c.output(kv);
}
}
/**
* Extract values from KV
*/
private static class ExtractValues extends DoFn>, Iterable> {
#ProcessElement
public void processElement(ProcessContext c) {
KV> kv = c.element();
c.output(kv.getValue());
}
}
}

Common Cloud Dataflow pattern - is there a better way?

We are finding ourselves frequently using the following pattern in Dataflow:
Perform a key extract ParDo from a BigQuery TableRow
Perform a GroupByKey on the result of 1
Perform a flatten ParDo on the result of 2
Is there an operation in Dataflow to achieve this in one hit (at least from the API perspective)?
I've had a look at Combine operation, but that seems more suited to be used when calculating values e.g. sums/averages etc.

Without much details in your question I can only give general advise.
You could create a PTransform that combines the above pattern into a single Composite Transform. This allows you to put together the frequently used operations into a single reusable component.
The following code should give you an idea of what I mean:
import com.google.api.services.bigquery.model.TableRow;
import com.google.cloud.dataflow.sdk.Pipeline;
import com.google.cloud.dataflow.sdk.io.BigQueryIO;
import com.google.cloud.dataflow.sdk.options.PipelineOptions;
import com.google.cloud.dataflow.sdk.transforms.*;
import com.google.cloud.dataflow.sdk.values.KV;
import com.google.cloud.dataflow.sdk.values.PCollection;
class ExtractKeyFn extends DoFn<TableRow, KV<String, TableRow>> {
#Override
public void processElement(ProcessContext c) throws Exception {
TableRow row = c.element();
Object key = row.get("key");
if (key != null) {
c.output(KV.of(key.toString(), row));
}
}
}
class CompositeTransform extends PTransform<PCollection<TableRow>, PCollection<TableRow>> {
public CompositeTransform(String name) {
super(name);
}
public static CompositeTransform named(String name) {
return new CompositeTransform(name);
}
#Override
public PCollection<TableRow> apply(PCollection<TableRow> input) {
return input.apply(ParDo.named("parse").of(new ExtractKeyFn()))
.apply(GroupByKey.create())
// potentially more transformations
.apply(Values.create()) // get only the values ( because we have a kv )
.apply(Flatten.iterables()); // flatten them out
}
}
public class Main {
public static void run(PipelineOptions options) {
Pipeline p = Pipeline.create(options);
// read input
p.apply(BigQueryIO.Read.from("inputTable...").named("inputFromBigQuery"))
// apply fancy transform
.apply(CompositeTransform.named("FancyKeyGroupAndFlatten"))
// write output
.apply(BigQueryIO.Write.to("outputTable...").named("outputToBigQuery"));
p.run();
}
}

Run JesterRecommenderEvaluationRunner, but get no results of evaluation

I downloaded the Jester example code in Mahout, and tries to run it on jester dataset to see the evaluation results. the running is done successfully, but the console only has the results:
log4j:WARN No appenders could be found for logger (org.apache.mahout.cf.taste.impl.model.file.FileDataModel).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
I expect to see the evaluation score range from 0 to 10. any one can help me found out how to get the score?
I am using mahout-core-0.6.jar and the following is the code:
JesterDataModel.java:
package Jester;
import java.io.File;
import java.io.IOException;
import java.util.Collection;
import java.util.regex.Pattern;
import com.google.common.collect.Lists;
import org.apache.mahout.cf.taste.example.grouplens.GroupLensDataModel;
import org.apache.mahout.cf.taste.impl.common.FastByIDMap;
import org.apache.mahout.cf.taste.impl.model.GenericDataModel;
import org.apache.mahout.cf.taste.impl.model.GenericPreference;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.model.Preference;
import org.apache.mahout.common.iterator.FileLineIterator;
//import org.apache.mahout.cf.taste.impl.common.FileLineIterable;
public final class JesterDataModel extends FileDataModel {
private static final Pattern COMMA_PATTERN = Pattern.compile(",");
private long userBeingRead;
public JesterDataModel() throws IOException {
this(GroupLensDataModel.readResourceToTempFile("\\jester-data-1.csv"));
}
public JesterDataModel(File ratingsFile) throws IOException {
super(ratingsFile);
}
#Override
public void reload() {
userBeingRead = 0;
super.reload();
}
#Override
protected DataModel buildModel() throws IOException {
FastByIDMap<Collection<Preference>> data = new FastByIDMap<Collection<Preference>> ();
FileLineIterator iterator = new FileLineIterator(getDataFile(), false);
FastByIDMap<FastByIDMap<Long>> timestamps = new FastByIDMap<FastByIDMap<Long>>();
processFile(iterator, data, timestamps, false);
return new GenericDataModel(GenericDataModel.toDataMap(data, true));
}
#Override
protected void processLine(String line,
FastByIDMap<?> rawData,
FastByIDMap<FastByIDMap<Long>> timestamps,
boolean fromPriorData) {
FastByIDMap<Collection<Preference>> data = (FastByIDMap<Collection<Preference>>) rawData;
String[] jokePrefs = COMMA_PATTERN.split(line);
int count = Integer.parseInt(jokePrefs[0]);
Collection<Preference> prefs = Lists.newArrayListWithCapacity(count);
for (int itemID = 1; itemID < jokePrefs.length; itemID++) { // yes skip first one, just a count
String jokePref = jokePrefs[itemID];
if (!"99".equals(jokePref)) {
float jokePrefValue = Float.parseFloat(jokePref);
prefs.add(new GenericPreference(userBeingRead, itemID, jokePrefValue));
}
}
data.put(userBeingRead, prefs);
userBeingRead++;
}
}
JesterRecommenderEvaluatorRunner.java
package Jester;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.eval.RecommenderEvaluator;
import org.apache.mahout.cf.taste.impl.eval.AverageAbsoluteDifferenceRecommenderEvaluator;
import org.apache.mahout.cf.taste.model.DataModel;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
public final class JesterRecommenderEvaluatorRunner {
private static final Logger log = LoggerFactory.getLogger(JesterRecommenderEvaluatorRunner.class);
private JesterRecommenderEvaluatorRunner() {
// do nothing
}
public static void main(String... args) throws IOException, TasteException {
RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator();
DataModel model = new JesterDataModel();
double evaluation = evaluator.evaluate(new JesterRecommenderBuilder(),
null,
model,
0.9,
1.0);
log.info(String.valueOf(evaluation));
}
}

Mahout 0.7 is old, and 0.6 is very old. Use at least 0.7, or better, later from SVN.
I think the problem is exactly what you identified: you don't have any slf4j bindings in your classpath. If you use the ".job" files in Mahout you will have all dependencies packages. Then you will actually see output.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How Can I access the key in subclass of CombinerFn when combining a PCollection of KV pairs? - google-cloud-dataflow

I don't think there is a built-in way to solve this. The straightforward workaround (not perfect, I know) is to wrap your string into another KV: KV<String, KV<String, String>> where both keys are the same.

Related

How to read log messages for CombineFn function in GCP Dataflow?

Why Apache beam can't infer the default coder when using KV<String, String>?

How to Batch By N Elements in Streaming Pipeline With Small Bundles?

Common Cloud Dataflow pattern - is there a better way?

Run JesterRecommenderEvaluationRunner, but get no results of evaluation

Categories

Resources