Google dataflow: AvroIO read from file in google storage passed as runtime parameter - google-cloud-dataflow

I want to read Avro files in my dataflow using java SDK 2
I have schedule my dataflow using cloud function which are triggered based on the files uploaded to the bucket.
Following is the code for options:
ValueProvider <String> getInputFile();
void setInputFile(ValueProvider<String> value);
I am trying to read this input file using following code:
PCollection<user> records = p.apply(
AvroIO.read(user.class)
.from(String.valueOf(options.getInputFile())));
I get following error while running the pipeline:
java.lang.IllegalArgumentException: Unable to find any files matching RuntimeValueProvider{propertyName=inputFile, default=gs://test_bucket/user.avro, value=null}
Same code works fine in case of TextIO.
How can we read Avro file which is uploaded for triggering cloud function which triggers the dataflow pipeline?

Please try ...from(options.getInputFile())) without converting it to a string.
For simplicity, you could even define your option as simple string:
String getInputFile();
void setInputFile(String value);

You need to use simply from(options.getInputFile()): AvroIO explicitly supports reading from a ValueProvider.
Currently the code is taking options.getInputFile() which is a ValueProvider, calling the JavatoString() function on it which gives a human-readable debug string "RuntimeValueProvider{propertyName=inputFile, default=gs://test_bucket/user.avro, value=null}" and passing that as a filename for AvroIO to read, and of course this string is not a valid filename, that's why the code currently doesn't work.
Also note that the whole point of ValueProvider is that it is placeholder for a value that is not known while constructing the pipeline and will be supplied later (potentially the pipeline will be executed several times, supplying different values) - so extracting the value of a ValueProvider at pipeline construction time is impossible by design, because there is no value. At runtime though (e.g. in a DoFn) you can extract the value by calling .get() on it.

Related

Apache Beam: Wait for AvroIO write step is done before start ImportTransform Dataflow template

I'm using apache beam to create a pipeline where basically reads an InputFile, Convert to Avro, write the AvroFile to a bucket and then Import these avro files to Spanner using Dataflow template
The problem that I'm facing is that the last step (Import the Avro files to the Database) is starting before the previous (write Avro Files to the bucket) is done.
I tried to add the Wait.on function but that only works if returns a PCollection, but when I write the files to avro it returns PDone.
Example of the Code:
// Step 1: Read Files
PCollection<String> lines = pipeline.apply("Reading Input Data exported from Cassandra",TextIO.read().from(options.getInputFile()));
// Step 2: Convert to Avro
lines .apply("Write Item Avro File",AvroIO.writeGenericRecords(spannerItemAvroSchema).to(options.getOutput()).withSuffix(".avro"));
// Step 3: Import to the DataBase
pipeline.apply( new ImportTransform(
spannerConfig,
options.getInputDir(),
options.getWaitForIndexes(),
options.getWaitForForeignKeys(),
options.getEarlyIndexCreateFlag()));
Again, the problem is because step 3 starts before Step 2 is done
any ideas?
This is a flaw in the API, see, e.g. a recent discussion on this on the beam dev list. The only solutions for now are to either fork AvroIO to return a PCollection or run two pipelines sequentially.

Does the latest version of TextIO (2.11 and above) have ability to read lines in parallel from a file?

I read through the splittable DoFn blog and from what I could gather, this feature is already available in TextIO (for Cloud dataflow runner). The thing I am not clear on is - using TextIO will I be able to read lines from a given file in parallel.
For only Java, the TextIO source will automatically read an uncompressed file in parallel.
This isn't officially documented, but the TextIO source is a subclass of the FileBaseSource that allows seeking. Meaning that if the worker decides to split the work, it can do so. See the code for FileBasedSource splitting here.
The answer from Cubez is good. I would also like to add that TextIO being both a PTransform and I/O connector has the expand() method implemented:
#Override
public PCollection<String> expand(PCollection<FileIO.ReadableFile> input) {
return input.apply(
"Read all via FileBasedSource",
new ReadAllViaFileBasedSource<>(
getDesiredBundleSizeBytes(),
new CreateTextSourceFn(getDelimiter()),
StringUtf8Coder.of()));
}
If we look further, we can see that ReadAllViaFileBasedSource class also has expand() method defined as follows:
#Override
public PCollection<T> expand(PCollection<ReadableFile> input) {
return input
.apply("Split into ranges", ParDo.of(new SplitIntoRangesFn(desiredBundleSizeBytes)))
.apply("Reshuffle", Reshuffle.viaRandomKey())
.apply("Read ranges", ParDo.of(new ReadFileRangesFn<>(createSource)))
.setCoder(coder);
}
This is how the underlying runner may distribute the PCollection across executors and read in parallel.

Assigning to GenericRecord the timestamp from inner object

Processing streaming events and writing files in hourly buckets is a challenge due to windows, as some events from incoming hour can go into previous ones and such.
I've been digging around Apache Beam and its triggers but I'm struggling to manage triggering with timestamp as follows...
Window.<GenericRecord>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes())
This is what I've been doing so far, triggering 1 min windows no matter what timestamp. However, I would like to include the timestamp within the object so that it gets triggered just for those within.
Window.<GenericRecord>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(AfterWatermark
.pastEndOfWindow())
.withAllowedLateness(Duration.ZERO)
.discardingFiredPanes())
The objects that I'm dealing with have a timestamp object, however, this is a long field and not an Instant field whatsoever.
"{ \"name\": \"timestamp\", \"type\": \"long\", \"logicalType\": \"timestamp-micros\" },"
Having my POJO class with that long field triggers nothing, but if I swap it for an Instant class and recreate the object properly, the following error is thrown whenever a PubSub message is read.
Caused by: java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to java.lang.Long
I've been also thinking to create a kind of wrapper class around GenericRecord which contains a timestamp, but would need to just use the GenericRecord part within once its ready to write with FileIO to .parquet.
Which other ways do I have to use watermark triggers?
EDIT: After #Anton comments, I've tried the following.
.apply("Apply timestamps", WithTimestamps.of(
(SerializableFunction<GenericRecord, Instant>) item -> new Instant(Long.valueOf(item.get("timestamp").toString())))
.withAllowedTimestampSkew(Duration.standardSeconds(30)))
Even it it has been deprecated this seem to pass through the pipeline but still not written (still getting discarded prior writing for some reason by the previously shown trigger?).
And also tried the other mentioned approach using outputWithTimestamp but due to the delay, it's printing the following error...
Caused by: java.lang.IllegalArgumentException: Cannot output with timestamp 2019-06-12T18:59:58.609Z. Output timestamps must be no earlier than the timestamp of the current input (2019-06-12T18:59:59.848Z) minus the allowed skew (0 milliseconds). See the DoFn#getAllowedTimestampSkew() Javadoc for details on changing the allowed skew.

Exported Dataflow Template Parameters Unknown

I've exported a Cloud Dataflow template from Dataprep as outlined here:
https://cloud.google.com/dataprep/docs/html/Export-Basics_57344556
In Dataprep, the flow pulls in text files via wildcard from Google Cloud Storage, transforms the data, and appends it to an existing BigQuery table. All works as intended.
However, when trying to start a Dataflow job from the exported template, I can't seem to get the startup parameters right. The error messages aren't overly specific but it's clear that for one thing, I'm not getting the locations (input and output) right.
The only Google-provided template for this use case (found at https://cloud.google.com/dataflow/docs/guides/templates/provided-templates#cloud-storage-text-to-bigquery) doesn't apply as it uses a UDF and also runs in Batch mode, overwriting any existing BigQuery table rather than append.
Inspecting the original Dataflow job details from Dataprep shows a number of parameters (found in the metadata file) but I haven't been able to get those to work within my code. Here's an example of one such failed configuration:
import time
from google.cloud import storage
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
def dummy(event, context):
pass
def process_data(event, context):
credentials = GoogleCredentials.get_application_default()
service = build('dataflow', 'v1b3', credentials=credentials)
data = event
gsclient = storage.Client()
file_name = data['name']
time_stamp = time.time()
GCSPATH="gs://[path to template]
BODY = {
"jobName": "GCS2BigQuery_{tstamp}".format(tstamp=time_stamp),
"parameters": {
"inputLocations" : '{{\"location1\":\"[my bucket]/{filename}\"}}'.format(filename=file_name),
"outputLocations": '{{\"location1\":\"[project]:[dataset].[table]\", [... other locations]"}}',
"customGcsTempLocation": "gs://[my bucket]/dataflow"
},
"environment": {
"zone": "us-east1-b"
}
}
print(BODY["parameters"])
request = service.projects().templates().launch(projectId=PROJECT, gcsPath=GCSPATH, body=BODY)
response = request.execute()
print(response)
The above example indicates invalid field ("location1", which I pulled from a completed Dataflow job. I know I need to specify the GCS location, the template location, and the BigQuery table but haven't found the correct syntax anywhere. As mentioned above, I found the field names and sample values in the job's generated metadata file.
I realize that this specific use case may not ring any bells but in general if anyone has had success determining and using the correct startup parameters for a Dataflow job exported from Dataprep, I'd be most grateful to learn more about that. Thx.
I think you need to review this document it explains exactly the syntax required for passing the various pipeline options available including the location parameters needed... 1
Specifically with your code snippet the following does not follow the correct syntax
""inputLocations" : '{{\"location1\":\"[my bucket]/{filename}\"}}'.format(filename=file_name)"
In addition to document1, you should also review the available pipeline options and their correct syntax 2
Please use the links...They are the official documentation links from Google.These links will never go stale or be removed they are actively monitored and maintained by a dedicated team

How to count total number of rows in a file using google dataflow

I would like to know if there is a way to find out total no rows in a file using google dataflow. Any code sample and pointer will be great help. Basically, I have a method as
int getCount(String fileName) {}
So, above method will return total count of rows and its implementation will be dataflow code.
Thanks
Seems like your use case is one that doesn't require distributed processing, because the file is compressed and hence can not be read in parallel. However, you may still find it useful to use Dataflow APIs for the sake of their ease of access to GCS and automatic decompression.
Since you also want to get the result out of your pipeline as an actual Java object, you need to use the Direct runner, which runs in-process, without talking to the Dataflow service or doing any distributed processing, however in return it provides the ability to extract PCollection's into Java objects:
Something like this:
PipelineOptions options = ...;
DirectPipelineRunner runner = DirectPipelineRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
PCollection<Long> countPC =
p.apply(TextIO.Read.from("gs://..."))
.apply(Count.<String>globally());
DirectPipelineRunner.EvaluationResults results = runner.run(p);
long count = results.getPCollection(countPC).get(0);

Resources