Google DataFlow Apache Beam - google-cloud-dataflow

I am trying to use Apache Beam to create a Dataflow pipeline and I am not able to follow the documentation and cannot find any examples.
The pipeline is simple.
Create a pipeline
Read from a pub/sub topic
Write to spanner.
Currently, I am stuck at step 2. I cannot find any example of how to read from pub/sub and and consume it.
This is the code I have so far and would like to
class ExtractFlowInfoFn extends DoFn<PubsubMessage, KV<String, String>> {
public void processElement(ProcessContext c) {
KV.of("key", "value");
}
}
public static void main(String[] args) {
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply("ReadFromPubSub", PubsubIO.readMessages().fromSubscription("test"))
.apply("ConvertToKeyValuePair", ParDo.of(new ExtractFlowInfoFn()))
.apply("WriteToLog", ));
};
I was able to come up with the code by following multiple examples. To be honest, I have no idea what I am doing here.
Please, either help me understand this or link me to the correct documentation.

Example of pulling messages from Pub/Sub and writing to Cloud Spanner:
import com.google.cloud.spanner.Mutation;
import org.apache.beam.sdk.io.gcp.spanner.SpannerIO;
import org.apache.beam.sdk.transforms.DoFn.ProcessElement;
class MessageToMutationDoFn extends DoFn<PubsubMessage, Mutation> {
#ProcessElement
public void processElement(ProcessContext c) {
// TODO: create Mutation object from PubsubMessage
Mutation mutation = Mutation.newInsertBuilder("users_backup2")
.set("column_1").to("value_1")
.set("column_2").to("value_2")
.set("column_3").to("value_3")
.build();
c.output(mutation);
}
}
public static void main(String[] args) {
Pipeline p = Pipeline.create();
p.apply("ReadFromPubSub", PubsubIO.readMessages().fromSubscription("test"))
.apply("MessageToMutation", ParDo.of(new MessageToMutationDoFn()))
.apply("WriteToSpanner", SpannerIO.write()
.withProjectId("projectId")
.withInstanceId("spannerInstanceId")
.withDatabaseId("spannerDatabaseId"));
p.run();
}
Reference: Apache Beam SpannerIO, Apache Beam PubsubIO

Related

How to read log messages for CombineFn function in GCP Dataflow?

I am creating an Apache Beam streaming processing pipeline to run in GCP Dataflow. I have a number of transforms that extends DoFn and CombineFn. In DoFn logs are visualised fine using the LOGS window in the Dataflow job details. However, the logs from CombineFn transforms are not shown.
I tried different log levels and they also show fine using the DirectRunner.
Here is some sample code. I changed the input and output to String for brevity, there are some custom classes in my code.
import java.io.Serializable;
import org.apache.avro.reflect.Nullable;
import org.apache.beam.sdk.coders.AvroCoder;
import org.apache.beam.sdk.coders.DefaultCoder;
import org.apache.beam.sdk.transforms.Combine.CombineFn;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class AverageSpv extends CombineFn<String, AverageSpv.Accum, String> {
private static final Logger LOG = LoggerFactory.getLogger(AverageSpv.class);
#DefaultCoder(AvroCoder.class)
public static class Accum implements Serializable {
#Nullable String id;
}
#Override
public Accum createAccumulator() {
return new Accum();
}
#Override
public Accum addInput(Accum accumulator, String input) {
LOG.info("Add input: id {}, input);
accumulator.id = input;
return accumulator;
}
#Override
public Accum mergeAccumulators(Iterable<Accum> accumulators) {
LOG.info("Merging accumulator");
Accum merged = createAccumulator();
for (Accum accumulator : accumulators) {
merged.id = accumulator.id;
}
return merged;
}
#Override
public VehicleSpeedPerSegmentInfo extractOutput(Accum accumulator) {
LOG.info("Extracting accumulator");
LOG.info("Extract output: id {}", acummulator.id);
return acummulator.id;
}
}
Apache Beam CombineFn operations are executed across several steps in Dataflow. (Specifically, as much pre-combining happens before shuffling all results to a single key, and then all the upstream results are merged together into a final result in a subsequent post-GBK step.) The fact that there's no single execution "step" corresponding to the original Combine step in the graph is probably what's preventing the logs from being found.
This is a bug and should be fixed. As mentioned, a workaround is to look at all logs from the pipeline.

Constructing a pipline with ValueProivder.RuntimeProvider

I have a Google Dataflow job using library version 1.9.1 job, The job was taking runtime arguments. We used the TextIO.read().from().withoutValidation(). Since we migrated to google dataflow 2.0.0 , the withoutValidation is removed in 2.0.0. The Release notes page doesnt talk about this https://cloud.google.com/dataflow/release-notes/release-notes-java-2 .
We tried to pass the input as a ValueProvider.RuntimeProvider. But During pipeline construction, we get the following error. If pass it as ValueProvider the pipeline creation is trying to validate the value provider. How do I provide a runtime value provider for a TextIO input in google cloud dataflow 2.0.0?
java.lang.RuntimeException: Method getInputFile should not have return type RuntimeValueProvider, use ValueProvider instead.
at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault(ProxyInvocationHandler.java:505)
I'm going to assume you are using templated pipelines, and that your pipeline is consuming runtime parameters. Here's a working example using the Cloud Dataflow SDK version 2.1.0. It reads a file from GCS (passed to the template at runtime), turns each row into a TableRow and writes to BigQuery. It's a trivial example, but it works with 2.1.0.
Program args are as follows:
--project=<your_project_id>
--runner=DataflowRunner
--templateLocation=gs://<your_bucket>/dataflow_pipeline
--stagingLocation=gs://<your_bucket>/jars
--tempLocation=gs://<your_bucket>/tmp
Program code is as follows:
public class TemplatePipeline {
public static void main(String[] args) {
PipelineOptionsFactory.register(TemplateOptions.class);
TemplateOptions options = PipelineOptionsFactory
.fromArgs(args)
.withValidation()
.as(TemplateOptions.class);
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("READ", TextIO.read().from(options.getInputFile()).withCompressionType(TextIO.CompressionType.GZIP))
.apply("TRANSFORM", ParDo.of(new WikiParDo()))
.apply("WRITE", BigQueryIO.writeTableRows()
.to(String.format("%s:dataset_name.wiki_demo", options.getProject()))
.withCreateDisposition(CREATE_IF_NEEDED)
.withWriteDisposition(WRITE_TRUNCATE)
.withSchema(getTableSchema()));
pipeline.run();
}
private static TableSchema getTableSchema() {
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("year").setType("INTEGER"));
fields.add(new TableFieldSchema().setName("month").setType("INTEGER"));
fields.add(new TableFieldSchema().setName("day").setType("INTEGER"));
fields.add(new TableFieldSchema().setName("wikimedia_project").setType("STRING"));
fields.add(new TableFieldSchema().setName("language").setType("STRING"));
fields.add(new TableFieldSchema().setName("title").setType("STRING"));
fields.add(new TableFieldSchema().setName("views").setType("INTEGER"));
return new TableSchema().setFields(fields);
}
public interface TemplateOptions extends DataflowPipelineOptions {
#Description("GCS path of the file to read from")
ValueProvider<String> getInputFile();
void setInputFile(ValueProvider<String> value);
}
private static class WikiParDo extends DoFn<String, TableRow> {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
String[] split = c.element().split(",");
TableRow row = new TableRow();
for (int i = 0; i < split.length; i++) {
TableFieldSchema col = getTableSchema().getFields().get(i);
row.set(col.getName(), split[i]);
}
c.output(row);
}
}
}

ValueProvider Issue

I am trying to get the value of a property that is passed from a cloud function to a dataflow template. I am getting errors because the value being passed is a wrapper, and using the .get() method fails during the compile. with this error
An exception occurred while executing the Java class. null: InvocationTargetException: Not called from a runtime context.
public interface MyOptions extends DataflowPipelineOptions {
...
#Description("schema of csv file")
ValueProvider<String> getHeader();
void setHeader(ValueProvider<String> header);
...
}
public static void main(String[] args) throws IOException {
...
List<String> sideInputColumns = Arrays.asList(options.getHeader().get().split(","));
...
//ultimately use the getHeaders as side inputs
PCollection<String> input = p.apply(Create.of(sideInputColumns));
final PCollectionView<List<String>> finalColumnView = input.apply(View.asList());
}
How do I extract the value from the ValueProvider type?
The value of a ValueProvider is not available during pipeline construction. As such, you need to organize your pipeline so that it always has the same structure, and serializes the ValueProvider. At runtime, the individual transforms within your pipeline can inspect the value to determine how to operate.
Based on your example, you may need to do something like the following. It creates a single element, and then uses a DoFn that is evaluated at runtime to expand the headers:
public static class HeaderDoFn extends DoFn<String, String> {
private final ValueProvider<String> header;
public HeaderDoFn(ValueProvider<String> header) {
this.header = header;
}
#ProcessElement
public void processElement(ProcessContext c) {
// Ignore input element -- there should be exactly one
for (String column : this.header().get().split(",")) {
c.output(column);
}
}
}
public static void main(String[] args) throws IOException {
PCollection<String> input = p
.apply(Create.of("one")) // create a single element
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
}
});
// Note that the order of this list is not guaranteed.
final PCollectionView<List<String>> finalColumnView =
input.apply(View.asList());
}
Another option would be to use a NestedValueProvider to create a ValueProvider<List<String>> from the option, and pass that ValueProvider<List<String>> to the necessary DoFns rather than using a side input.

Dataflow Map side-input issue

I'm having trouble creating a Map PCollectionView with the DataflowRunner.
The pipeline below aggregates an unbouded countingInput together with values from a side-input (containing 10 generated values).
When running the pipeline on gcp it get's stuck inside the View.asMap() transform.
More specifially, the ParDo(StreamingPCollectionViewWriter) does not have any output.
I tried this with dataflow 2.0.0-beta3, as well as with beam-0.7.0-SNAPSHOT, without any result. Note that my pipeline is running without any problem when using the local DirectRunner.
Am I doing something wrong?
All help is appreciated, thanks in advance for helping me out!
public class SimpleSideInputPipeline {
private static final Logger LOG = LoggerFactory.getLogger(SimpleSideInputPipeline.class);
public interface Options extends DataflowPipelineOptions {}
public static void main(String[] args) throws IOException {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
Pipeline pipeline = Pipeline.create(options);
final PCollectionView<Map<Integer, String>> sideInput = pipeline
.apply(CountingInput.forSubrange(0L, 10L))
.apply("Create KV<Integer, String>",ParDo.of(new DoFn<Long, KV<Integer, String>>() {
#ProcessElement
public void processElement(ProcessContext c) {
c.output(KV.of(c.element().intValue(), "TEST"));
}
}))
.apply(View.asMap());
pipeline
.apply(CountingInput.unbounded().withRate(1, Duration.standardSeconds(5)))
.apply("Aggregate with side-input",ParDo.of(new DoFn<Long, KV<Long, String>>() {
#ProcessElement
public void processElement(ProcessContext c) {
Map<Integer, String> map = c.sideInput(sideInput);
//get first segment from map
Object[] values = map.values().toArray();
String firstVal = (String) values[0];
LOG.info("Combined: K: "+ c.element() + " V: " + firstVal + " MapSize: " + map.size());
c.output(KV.of(c.element(), firstVal));
}
}).withSideInputs(sideInput));
pipeline.run();
}
}
No need to worry that the ParDo(StreamingPCollectionViewWriterFn) does not record any output - what it does is actually write each element to an internal location.
You code looks OK to me, and this should be investigated. I have filed BEAM-2155.

Sharing BigTable Connection object among DataFlow DoFn sub-classes

I am setting up a Java Pipeline in DataFlow to read a .csv file and to create a bunch of BigTable rows based on the content of the file. I see in the BigTable documentation the note that connecting to BigTable is an 'expensive' operation and that it's a good idea to do it only once and to share the connection among the functions that need it.
However, if I declare the Connection object as a public static variable in the main class and first connect to BigTable in the main function, I get the NullPointerException when I subsequently try to reference the connection in instances of DoFn sub-classes' processElement() function as part of my DataFlow pipeline.
Conversely, if I declare the Connection as a static variable in the actual DoFn class, then the operation works successfully.
What is the best-practice or optimal way to do this?
I'm concerned that if I implement the second option at scale, I will be wasting a lot of time and resources. If I keep the variable as static in the DoFn class, is it enough to ensure that the APIs don't try to re-establish the connection every time?
I realize there is a special BigTable I/O call to sync DataFlow pipeline objects with BigTable, but I think I need to write one on my own to build-in some special logic into the DoFn processElement() function...
This is what the "working" code looks like:
class DigitizeBT extends DoFn<String, String>{
private static Connection m_locConn;
#Override
public void processElement(ProcessContext c)
{
try
{
m_locConn = BigtableConfiguration.connect("projectID", "instanceID");
Table tbl = m_locConn.getTable(TableName.valueOf("TableName"));
Put put = new Put(Bytes.toBytes(rowKey));
put.addColumn(
Bytes.toBytes("CF1"),
Bytes.toBytes("SomeName"),
Bytes.toBytes("SomeValue"));
tbl.put(put);
}
catch (IOException e)
{
e.printStackTrace();
System.exit(1);
}
}
}
This is what updated code looks like, FYI:
public void SmallKVJob()
{
CloudBigtableScanConfiguration config = new CloudBigtableScanConfiguration.Builder()
.withProjectId(DEF.ID_PROJ)
.withInstanceId(DEF.ID_INST)
.withTableId(DEF.ID_TBL_UNITS)
.build();
DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options.setProject(DEF.ID_PROJ);
options.setStagingLocation(DEF.ID_STG_LOC);
// options.setNumWorkers(3);
// options.setMaxNumWorkers(5);
// options.setRunner(BlockingDataflowPipelineRunner.class);
options.setRunner(DirectPipelineRunner.class);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from(DEF.ID_BAL))
.apply(ParDo.of(new DoFn1()))
.apply(ParDo.of(new DoFn2()))
.apply(ParDo.of(new DoFn3(config)));
m_log.info("starting to run the job");
p.run();
m_log.info("finished running the job");
}
}
class DoFn1 extends DoFn<String, KV<String, Integer>>
{
#Override
public void processElement(ProcessContext c)
{
c.output(KV.of(c.element().split("\\,")[0],Integer.valueOf(c.element().split("\\,")[1])));
}
}
class DoFn2 extends DoFn<KV<String, Integer>, KV<String, Integer>>
{
#Override
public void processElement(ProcessContext c)
{
int max = c.element().getValue();
String name = c.element().getKey();
for(int i = 0; i<max;i++)
c.output(KV.of(name, 1));
}
}
class DoFn3 extends AbstractCloudBigtableTableDoFn<KV<String, Integer>, String>
{
public DoFn3(CloudBigtableConfiguration config)
{
super(config);
}
#Override
public void processElement(ProcessContext c)
{
try
{
Integer max = c.element().getValue();
for(int i = 0; i<max; i++)
{
String owner = c.element().getKey();
String rnd = UUID.randomUUID().toString();
Put p = new Put(Bytes.toBytes(owner+"*"+rnd));
p.addColumn(Bytes.toBytes(DEF.ID_CF1), Bytes.toBytes("Owner"), Bytes.toBytes(owner));
getConnection().getTable(TableName.valueOf(DEF.ID_TBL_UNITS)).put(p);
c.output("Success");
}
} catch (IOException e)
{
c.output(e.toString());
e.printStackTrace();
}
}
}
The input .csv file looks something like this:
Mary,3000
John,5000
Peter,2000
So, for each row in the .csv file, I have to put in x number of rows into BigTable, where x is the second cell in the .csv file...
We built AbstractCloudBigtableTableDoFn ( Source & Docs ) for this purpose. Extend that class instead of DoFn, and call getConnection() instead of creating a Connection yourself.
10,000 small rows should take a second or two of actual work.
EDIT: As per the comments, BufferedMutator should be used instead of Table.put() for best throughput.

Resources