I am trying to create a template for cloud data flow job that reads json file from cloud storage and writes to Big Query. I am passing 2 runtime arguments : 1. InputFile for GCS location 2. Dataset and Table Id of BigQuery.
JsonTextToBqTemplate code:
public class JsonTextToBqTemplate {
private static final Logger logger =
LoggerFactory.getLogger(TextToBQTemplate.class);
private static Gson gson = new GsonBuilder().create();
public static void main(String[] args) throws Exception {
JsonToBQTemplateOptions options =
PipelineOptionsFactory.fromArgs(args).withValidation()
.as(JsonToBQTemplateOptions.class);
String jobName = options.getJobName();
try {
logger.info("PIPELINE-INFO: jobName={} message={} ",
jobName, "starting pipeline creation");
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("ReadLines", TextIO.read().from(options.getInputFile()))
.apply("Converting to TableRows", ParDo.of(new DoFn<String, TableRow>() {
private static final long serialVersionUID = 0;
#ProcessElement
public void processElement(ProcessContext c) {
String json = c.element();
TableRow tableRow = gson.fromJson(json, TableRow.class);
c.output(tableRow);
}
}))
.apply(BigQueryIO.writeTableRows().to(options.getTableSpec())
.withCreateDisposition(CreateDisposition.CREATE_NEVER)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
logger.info("PIPELINE-INFO: jobName={} message={} ", jobName, "pipeline started");
State state = pipeline.run().waitUntilFinish();
logger.info("PIPELINE-INFO: jobName={} message={} ", jobName, "pipeline status" + state);
} catch (Exception exception) {
throw exception;
}
}
}
Options Code:
public interface JsonToBQTemplateOptions extends PipelineOptions{
ValueProvider<String> getInputFile();
void setInputFile(ValueProvider<String> value);
ValueProvider<String> getErrorOutput();
void setErrorOutput(ValueProvider<String> value);
ValueProvider<String> getTableSpec();
void setTableSpec(ValueProvider<String> value);
}
Maven command to create template:
mvn -X compile exec:java \
-Dexec.mainClass=com.xyz.adp.pipeline.template.JsonTextToBqTemplate \
-Dexec.args="--project=xxxxxx-12356 \
--stagingLocation=gs://xxx-test/template/staging/jsontobq/ \
--tempLocation=gs://xxx-test/temp/ \
--templateLocation=gs://xxx-test/template/templates/jsontobq \
--errorOutput=gs://xxx-test/template/output"
Error:
Caused by: java.lang.IllegalStateException: Cannot estimate size of a FileBasedSource with inaccessible file pattern: {}. [RuntimeValueProvider{propertyName=inputFile, default=null, value=null}]
at org.apache.beam.sdk.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:518)
at org.apache.beam.sdk.io.FileBasedSource.getEstimatedSizeBytes(FileBasedSource.java:199)
at org.apache.beam.runners.direct.BoundedReadEvaluatorFactory$InputProvider.getInitialInputs(BoundedReadEvaluatorFactory.java:207)
at org.apache.beam.runners.direct.ReadEvaluatorFactory$InputProvider.getInitialInputs(ReadEvaluatorFactory.java:87)
at org.apache.beam.runners.direct.RootProviderRegistry.getInitialInputs(RootProviderRegistry.java:62)
Mvn Build was successful when I pass values for inputFile and tableSpec as below.
mvn -X compile exec:java \
-Dexec.mainClass=com.ihm.adp.pipeline.template.JsonTextToBqTemplate \
-Dexec.args="--project=xxxxxx-123456 \
--stagingLocation=gs://xxx-test/template/staging/jsontobq/ \
--tempLocation=gs://xxx-test/temp/ \
--templateLocation=gs://xxx-test/template/templates/jsontobq \
--inputFile=gs://xxx-test/input/bqtest.json \
--tableSpec=xxx_test.jsontobq_test \
--errorOutput=gs://xxx-test/template/output"
But it won't create any template in Cloud dataflow.
Is there a way to create a template without validating these runtime arguments during maven execution?
I think the problem here is that you are not specifying a runner. By default, this is attempting to use the DirectRunner. Try to pass
--runner=TemplatingDataflowPipelineRunner
as part of your -Dexec.args. After this you also should not need to specify the ValueProvider template arguments like inputFile, etc.
More info here:
https://cloud.google.com/dataflow/docs/templates/creating-templates
If you are using Dataflow SDK version 1.x, then you need to specify the following arguments:
--runner=TemplatingDataflowPipelineRunner
--dataflowJobFile=gs://xxx-test/template/templates/jsontobq/
If you are using Dataflow SDK version 2.x (Apache Beam), then you need to specify the following arguments:
--runner=DataflowRunner
--templateLocation=gs://xxx-test/template/templates/jsontobq/
It looks like you're using Dataflow SDK version 2.x and not specifying DataflowRunner for the runner argument.
Reference: https://cloud.google.com/dataflow/docs/templates/creating-templates
Related
I have a simple SCDF stream that looks like this:
http --port=12346 | mvmn-transform | file --name=tmp.txt --directory=/tmp
The mvmn-transform is a simple custom transformer that looks like this:
#SpringBootApplication
#EnableBinding(Processor.class)
#EnableConfigurationProperties(ScdfTestTransformerProperties.class)
#Configuration
public class ScdfTestTransformer {
public static void main(String args[]) {
SpringApplication.run(ScdfTestTransformer.class, args);
}
#Autowired
protected ScdfTestTransformerProperties config;
#Transformer(inputChannel = Processor.INPUT, outputChannel = Processor.OUTPUT)
public Object transform(Message<?> message) {
Object payload = message.getPayload();
Map<String, Object> result = new HashMap<>();
Map<String, String> headersStr = new HashMap<>();
message.getHeaders().forEach((k, v) -> headersStr.put(k, v != null ? v.toString() : null));
result.put("headers", headersStr);
result.put("payload", payload);
result.put("configProp", config.getSomeConfigProp());
return result;
}
// See https://stackoverflow.com/questions/59155689/could-not-decode-json-type-for-key-file-name-in-a-spring-cloud-data-flow-stream
#Bean("kafkaBinderHeaderMapper")
public KafkaHeaderMapper kafkaBinderHeaderMapper() {
BinderHeaderMapper mapper = new BinderHeaderMapper();
mapper.setEncodeStrings(true);
return mapper;
}
}
This works fine.
But I've read that Spring Cloud Function should allow me to implement such apps without a necessity to specify binding and transformer annotations, so I've changed it to this:
#SpringBootApplication
// #EnableBinding(Processor.class)
#EnableConfigurationProperties(ScdfTestTransformerProperties.class)
#Configuration
public class ScdfTestTransformer {
public static void main(String args[]) {
SpringApplication.run(ScdfTestTransformer.class, args);
}
#Autowired
protected ScdfTestTransformerProperties config;
// #Transformer(inputChannel = Processor.INPUT, outputChannel = Processor.OUTPUT)
#Bean
public Function<Message<?>, Map<String, Object>> transform(
// Message<?> message
) {
return message -> {
Object payload = message.getPayload();
Map<String, Object> result = new HashMap<>();
Map<String, String> headersStr = new HashMap<>();
message.getHeaders().forEach((k, v) -> headersStr.put(k, v != null ? v.toString() : null));
result.put("headers", headersStr);
result.put("payload", payload);
result.put("configProp", "Config prop val: " + config.getSomeConfigProp());
return result;
};
}
// See https://stackoverflow.com/questions/59155689/could-not-decode-json-type-for-key-file-name-in-a-spring-cloud-data-flow-stream
#Bean("kafkaBinderHeaderMapper")
public KafkaHeaderMapper kafkaBinderHeaderMapper() {
BinderHeaderMapper mapper = new BinderHeaderMapper();
mapper.setEncodeStrings(true);
return mapper;
}
}
And now I have a problem - SCDF source and target topic names are ignored by Spring-Cloud-Function apparently, and topics transform-in-0 and transform-out-0 are created instead.
SCDF creates topics that have names like <stream-name>.<app-name> eg something like TestStream123.http and TestStream123.mvmn-transform
Previously they were used for transformer - as it should be, since it is a part of the SCDF stream. But now they are ignored by Spring-Cloud-Function and transform-in-0 and transform-out-0 are created instead.
Thus my transformer no longer receives any input, as it expects it on a wrong Kafka topic. And would probably produce no output to the stream as well, since it outputs to the wrong Kafka topic also.
P.S. Just in case, full project code on GitHub: https://github.com/mvmn/scdftest-transformer/tree/scfunc
In order to run locally start up Kafka, Skipper, SCDF and SCDF console, do mvn clean install in the app folder and then do app register --name mvmn-transform-1 --type processor --uri maven://x.mvmn.study.scdf.scdftest:scdftest-transformer:0.1.1-SNAPSHOT --metadata-uri maven://x.mvmn.study.scdf.scdftest:scdftest-transformer:0.1.1-SNAPSHOTin the coonsole. Then you can deploy stream using definition http --port=12346 | mvmn-transform | file --name=tmp.txt --directory=/tmp
Since you are using the functional model of writing Spring Cloud Stream applications, when you deploy this app, you need to pass two properties on the custom processor to restore the Spring Cloud Data Flow behavior.
spring.cloud.stream.function.bindings.transform-in-0=input
spring.cloud.stream.function.bindings.transform-out-0=output
Can you try that and see if that makes a difference?
I am trying to use Apache Beam to create a Dataflow pipeline and I am not able to follow the documentation and cannot find any examples.
The pipeline is simple.
Create a pipeline
Read from a pub/sub topic
Write to spanner.
Currently, I am stuck at step 2. I cannot find any example of how to read from pub/sub and and consume it.
This is the code I have so far and would like to
class ExtractFlowInfoFn extends DoFn<PubsubMessage, KV<String, String>> {
public void processElement(ProcessContext c) {
KV.of("key", "value");
}
}
public static void main(String[] args) {
Pipeline p = Pipeline.create(
PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply("ReadFromPubSub", PubsubIO.readMessages().fromSubscription("test"))
.apply("ConvertToKeyValuePair", ParDo.of(new ExtractFlowInfoFn()))
.apply("WriteToLog", ));
};
I was able to come up with the code by following multiple examples. To be honest, I have no idea what I am doing here.
Please, either help me understand this or link me to the correct documentation.
Example of pulling messages from Pub/Sub and writing to Cloud Spanner:
import com.google.cloud.spanner.Mutation;
import org.apache.beam.sdk.io.gcp.spanner.SpannerIO;
import org.apache.beam.sdk.transforms.DoFn.ProcessElement;
class MessageToMutationDoFn extends DoFn<PubsubMessage, Mutation> {
#ProcessElement
public void processElement(ProcessContext c) {
// TODO: create Mutation object from PubsubMessage
Mutation mutation = Mutation.newInsertBuilder("users_backup2")
.set("column_1").to("value_1")
.set("column_2").to("value_2")
.set("column_3").to("value_3")
.build();
c.output(mutation);
}
}
public static void main(String[] args) {
Pipeline p = Pipeline.create();
p.apply("ReadFromPubSub", PubsubIO.readMessages().fromSubscription("test"))
.apply("MessageToMutation", ParDo.of(new MessageToMutationDoFn()))
.apply("WriteToSpanner", SpannerIO.write()
.withProjectId("projectId")
.withInstanceId("spannerInstanceId")
.withDatabaseId("spannerDatabaseId"));
p.run();
}
Reference: Apache Beam SpannerIO, Apache Beam PubsubIO
I have a Google Dataflow job using library version 1.9.1 job, The job was taking runtime arguments. We used the TextIO.read().from().withoutValidation(). Since we migrated to google dataflow 2.0.0 , the withoutValidation is removed in 2.0.0. The Release notes page doesnt talk about this https://cloud.google.com/dataflow/release-notes/release-notes-java-2 .
We tried to pass the input as a ValueProvider.RuntimeProvider. But During pipeline construction, we get the following error. If pass it as ValueProvider the pipeline creation is trying to validate the value provider. How do I provide a runtime value provider for a TextIO input in google cloud dataflow 2.0.0?
java.lang.RuntimeException: Method getInputFile should not have return type RuntimeValueProvider, use ValueProvider instead.
at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault(ProxyInvocationHandler.java:505)
I'm going to assume you are using templated pipelines, and that your pipeline is consuming runtime parameters. Here's a working example using the Cloud Dataflow SDK version 2.1.0. It reads a file from GCS (passed to the template at runtime), turns each row into a TableRow and writes to BigQuery. It's a trivial example, but it works with 2.1.0.
Program args are as follows:
--project=<your_project_id>
--runner=DataflowRunner
--templateLocation=gs://<your_bucket>/dataflow_pipeline
--stagingLocation=gs://<your_bucket>/jars
--tempLocation=gs://<your_bucket>/tmp
Program code is as follows:
public class TemplatePipeline {
public static void main(String[] args) {
PipelineOptionsFactory.register(TemplateOptions.class);
TemplateOptions options = PipelineOptionsFactory
.fromArgs(args)
.withValidation()
.as(TemplateOptions.class);
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("READ", TextIO.read().from(options.getInputFile()).withCompressionType(TextIO.CompressionType.GZIP))
.apply("TRANSFORM", ParDo.of(new WikiParDo()))
.apply("WRITE", BigQueryIO.writeTableRows()
.to(String.format("%s:dataset_name.wiki_demo", options.getProject()))
.withCreateDisposition(CREATE_IF_NEEDED)
.withWriteDisposition(WRITE_TRUNCATE)
.withSchema(getTableSchema()));
pipeline.run();
}
private static TableSchema getTableSchema() {
List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName("year").setType("INTEGER"));
fields.add(new TableFieldSchema().setName("month").setType("INTEGER"));
fields.add(new TableFieldSchema().setName("day").setType("INTEGER"));
fields.add(new TableFieldSchema().setName("wikimedia_project").setType("STRING"));
fields.add(new TableFieldSchema().setName("language").setType("STRING"));
fields.add(new TableFieldSchema().setName("title").setType("STRING"));
fields.add(new TableFieldSchema().setName("views").setType("INTEGER"));
return new TableSchema().setFields(fields);
}
public interface TemplateOptions extends DataflowPipelineOptions {
#Description("GCS path of the file to read from")
ValueProvider<String> getInputFile();
void setInputFile(ValueProvider<String> value);
}
private static class WikiParDo extends DoFn<String, TableRow> {
#ProcessElement
public void processElement(ProcessContext c) throws Exception {
String[] split = c.element().split(",");
TableRow row = new TableRow();
for (int i = 0; i < split.length; i++) {
TableFieldSchema col = getTableSchema().getFields().get(i);
row.set(col.getName(), split[i]);
}
c.output(row);
}
}
}
I am trying to get the value of a property that is passed from a cloud function to a dataflow template. I am getting errors because the value being passed is a wrapper, and using the .get() method fails during the compile. with this error
An exception occurred while executing the Java class. null: InvocationTargetException: Not called from a runtime context.
public interface MyOptions extends DataflowPipelineOptions {
...
#Description("schema of csv file")
ValueProvider<String> getHeader();
void setHeader(ValueProvider<String> header);
...
}
public static void main(String[] args) throws IOException {
...
List<String> sideInputColumns = Arrays.asList(options.getHeader().get().split(","));
...
//ultimately use the getHeaders as side inputs
PCollection<String> input = p.apply(Create.of(sideInputColumns));
final PCollectionView<List<String>> finalColumnView = input.apply(View.asList());
}
How do I extract the value from the ValueProvider type?
The value of a ValueProvider is not available during pipeline construction. As such, you need to organize your pipeline so that it always has the same structure, and serializes the ValueProvider. At runtime, the individual transforms within your pipeline can inspect the value to determine how to operate.
Based on your example, you may need to do something like the following. It creates a single element, and then uses a DoFn that is evaluated at runtime to expand the headers:
public static class HeaderDoFn extends DoFn<String, String> {
private final ValueProvider<String> header;
public HeaderDoFn(ValueProvider<String> header) {
this.header = header;
}
#ProcessElement
public void processElement(ProcessContext c) {
// Ignore input element -- there should be exactly one
for (String column : this.header().get().split(",")) {
c.output(column);
}
}
}
public static void main(String[] args) throws IOException {
PCollection<String> input = p
.apply(Create.of("one")) // create a single element
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
}
});
// Note that the order of this list is not guaranteed.
final PCollectionView<List<String>> finalColumnView =
input.apply(View.asList());
}
Another option would be to use a NestedValueProvider to create a ValueProvider<List<String>> from the option, and pass that ValueProvider<List<String>> to the necessary DoFns rather than using a side input.
I'm having trouble creating a Map PCollectionView with the DataflowRunner.
The pipeline below aggregates an unbouded countingInput together with values from a side-input (containing 10 generated values).
When running the pipeline on gcp it get's stuck inside the View.asMap() transform.
More specifially, the ParDo(StreamingPCollectionViewWriter) does not have any output.
I tried this with dataflow 2.0.0-beta3, as well as with beam-0.7.0-SNAPSHOT, without any result. Note that my pipeline is running without any problem when using the local DirectRunner.
Am I doing something wrong?
All help is appreciated, thanks in advance for helping me out!
public class SimpleSideInputPipeline {
private static final Logger LOG = LoggerFactory.getLogger(SimpleSideInputPipeline.class);
public interface Options extends DataflowPipelineOptions {}
public static void main(String[] args) throws IOException {
Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
Pipeline pipeline = Pipeline.create(options);
final PCollectionView<Map<Integer, String>> sideInput = pipeline
.apply(CountingInput.forSubrange(0L, 10L))
.apply("Create KV<Integer, String>",ParDo.of(new DoFn<Long, KV<Integer, String>>() {
#ProcessElement
public void processElement(ProcessContext c) {
c.output(KV.of(c.element().intValue(), "TEST"));
}
}))
.apply(View.asMap());
pipeline
.apply(CountingInput.unbounded().withRate(1, Duration.standardSeconds(5)))
.apply("Aggregate with side-input",ParDo.of(new DoFn<Long, KV<Long, String>>() {
#ProcessElement
public void processElement(ProcessContext c) {
Map<Integer, String> map = c.sideInput(sideInput);
//get first segment from map
Object[] values = map.values().toArray();
String firstVal = (String) values[0];
LOG.info("Combined: K: "+ c.element() + " V: " + firstVal + " MapSize: " + map.size());
c.output(KV.of(c.element(), firstVal));
}
}).withSideInputs(sideInput));
pipeline.run();
}
}
No need to worry that the ParDo(StreamingPCollectionViewWriterFn) does not record any output - what it does is actually write each element to an internal location.
You code looks OK to me, and this should be investigated. I have filed BEAM-2155.