Exported Dataflow Template Parameters Unknown - google-cloud-dataflow

I've exported a Cloud Dataflow template from Dataprep as outlined here:
https://cloud.google.com/dataprep/docs/html/Export-Basics_57344556
In Dataprep, the flow pulls in text files via wildcard from Google Cloud Storage, transforms the data, and appends it to an existing BigQuery table. All works as intended.
However, when trying to start a Dataflow job from the exported template, I can't seem to get the startup parameters right. The error messages aren't overly specific but it's clear that for one thing, I'm not getting the locations (input and output) right.
The only Google-provided template for this use case (found at https://cloud.google.com/dataflow/docs/guides/templates/provided-templates#cloud-storage-text-to-bigquery) doesn't apply as it uses a UDF and also runs in Batch mode, overwriting any existing BigQuery table rather than append.
Inspecting the original Dataflow job details from Dataprep shows a number of parameters (found in the metadata file) but I haven't been able to get those to work within my code. Here's an example of one such failed configuration:
import time
from google.cloud import storage
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
def dummy(event, context):
pass
def process_data(event, context):
credentials = GoogleCredentials.get_application_default()
service = build('dataflow', 'v1b3', credentials=credentials)
data = event
gsclient = storage.Client()
file_name = data['name']
time_stamp = time.time()
GCSPATH="gs://[path to template]
BODY = {
"jobName": "GCS2BigQuery_{tstamp}".format(tstamp=time_stamp),
"parameters": {
"inputLocations" : '{{\"location1\":\"[my bucket]/{filename}\"}}'.format(filename=file_name),
"outputLocations": '{{\"location1\":\"[project]:[dataset].[table]\", [... other locations]"}}',
"customGcsTempLocation": "gs://[my bucket]/dataflow"
},
"environment": {
"zone": "us-east1-b"
}
}
print(BODY["parameters"])
request = service.projects().templates().launch(projectId=PROJECT, gcsPath=GCSPATH, body=BODY)
response = request.execute()
print(response)
The above example indicates invalid field ("location1", which I pulled from a completed Dataflow job. I know I need to specify the GCS location, the template location, and the BigQuery table but haven't found the correct syntax anywhere. As mentioned above, I found the field names and sample values in the job's generated metadata file.
I realize that this specific use case may not ring any bells but in general if anyone has had success determining and using the correct startup parameters for a Dataflow job exported from Dataprep, I'd be most grateful to learn more about that. Thx.

I think you need to review this document it explains exactly the syntax required for passing the various pipeline options available including the location parameters needed... 1
Specifically with your code snippet the following does not follow the correct syntax
""inputLocations" : '{{\"location1\":\"[my bucket]/{filename}\"}}'.format(filename=file_name)"
In addition to document1, you should also review the available pipeline options and their correct syntax 2
Please use the links...They are the official documentation links from Google.These links will never go stale or be removed they are actively monitored and maintained by a dedicated team

Related

Apache Beam: Wait for AvroIO write step is done before start ImportTransform Dataflow template

I'm using apache beam to create a pipeline where basically reads an InputFile, Convert to Avro, write the AvroFile to a bucket and then Import these avro files to Spanner using Dataflow template
The problem that I'm facing is that the last step (Import the Avro files to the Database) is starting before the previous (write Avro Files to the bucket) is done.
I tried to add the Wait.on function but that only works if returns a PCollection, but when I write the files to avro it returns PDone.
Example of the Code:
// Step 1: Read Files
PCollection<String> lines = pipeline.apply("Reading Input Data exported from Cassandra",TextIO.read().from(options.getInputFile()));
// Step 2: Convert to Avro
lines .apply("Write Item Avro File",AvroIO.writeGenericRecords(spannerItemAvroSchema).to(options.getOutput()).withSuffix(".avro"));
// Step 3: Import to the DataBase
pipeline.apply( new ImportTransform(
spannerConfig,
options.getInputDir(),
options.getWaitForIndexes(),
options.getWaitForForeignKeys(),
options.getEarlyIndexCreateFlag()));
Again, the problem is because step 3 starts before Step 2 is done
any ideas?
This is a flaw in the API, see, e.g. a recent discussion on this on the beam dev list. The only solutions for now are to either fork AvroIO to return a PCollection or run two pipelines sequentially.

Are Google API token.json files single use?

I'm making a simple app to access google sheets that I have saved on google drive. I set up a project on google, created the Oauth credentials and ran the Python quickstart code to generate the token.json file.
Yesterday, after doing that, I ran this portion of the quickstart code and it ran perfectly and returned the rows from the sample spreadsheet:
###Add step to pull in previous staff comments, joino in MRN
from __future__ import print_function
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
###This only runs when not connected to NetExtender. Should work when ported to citrix but running locally for testing is going to be difficult
###Gsheets
SCOPES = ['https://www.googleapis.com/auth/drive','https://www.googleapis.com/auth/spreadsheets']
# The ID and range of a sample spreadsheet.
SAMPLE_SPREADSHEET_ID = '1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms'
SAMPLE_RANGE_NAME = 'Class Data!A2:E'
creds = Credentials.from_authorized_user_file('token.json', SCOPES)
service = build('sheets', 'v4', credentials=creds)
# Call the Sheets API
sheet = service.spreadsheets()
result = sheet.values().get(spreadsheetId=SAMPLE_SPREADSHEET_ID,
range=SAMPLE_RANGE_NAME).execute()
values = result.get('values', [])
if not values:
print('No data found.')
else:
print('Name, Major:')
for row in values:
# Print columns A and E, which correspond to indices 0 and 4.
print('%s, %s' % (row[0], row[4]))
However, today when I run that code, it doesn't work anymore. I get an error:
('invalid_scope: Bad Request', {'error': 'invalid_scope', 'error_description':'Bad Request'})
Are the token files single use and would I need to generate a new one every time I want to run this (that's the only reason I can imagine it would work fine yesterday but not today)? If that is the issue, is there a way to program this so that I don't need to re-authenticate in Google and create a new token file every time I want to run this?
Thanks!

Failures in init.groovy.d scripts: null values returned

I'm trying to get Jenkins set up, with configuration, within a Docker environment. Per a variety of sources, it appears the suggested method is to insert scripts into JENKINS_HOME/init.groovy.d. I've taken scripts from places like the Jenkins wiki and made slight changes. They're only partially working. Here is one of them:
import java.util.logging.ConsoleHandler
import java.util.logging.FileHandler
import java.util.logging.SimpleFormatter
import java.util.logging.LogManager
import jenkins.model.Jenkins
// Log into a file
println("extralogging.groovy")
def RunLogger = LogManager.getLogManager().getLogger("hudson.model.Run")
def logsDir = new File("/var/log/jenkins")
if (!logsDir.exists()) { logsDir.mkdirs() }
FileHandler handler = new FileHandler(logsDir.absolutePath+"/jenkins-%g.log", 1024 * 1024, 10, true);
handler.setFormatter(new SimpleFormatter());
RunLogger.addHandler(handler)
This script fails on the last line, RunLogger.addHandler(handler).
2019-12-20 19:25:18.231+0000 [id=30] WARNING j.util.groovy.GroovyHookScript#execute: Failed to run script file:/var/lib/jenkins/init.groovy.d/02-extralogging.groovy
java.lang.NullPointerException: Cannot invoke method addHandler() on null object
I've had a number of other scripts return NULL objects from various gets similar to this one:
def RunLogger = LogManager.getLogManager().getLogger("hudson.model.Run")
My goal is to be able to develop (locally) a Jenkins implementation and then hand it to our sysops guys. Later, as I add pipelines and what not, I'd like to be able to also work on them in a local Jenkins configuration and then hand something for import into production Jenkins.
I'm not sure how to produce API documentation so I can chase this myself. Maybe I need to stop doing it this way and just grab the files that get modified when I do this via the GUI and just stuff the files into the right place.
Suggestions?

Google dataflow: AvroIO read from file in google storage passed as runtime parameter

I want to read Avro files in my dataflow using java SDK 2
I have schedule my dataflow using cloud function which are triggered based on the files uploaded to the bucket.
Following is the code for options:
ValueProvider <String> getInputFile();
void setInputFile(ValueProvider<String> value);
I am trying to read this input file using following code:
PCollection<user> records = p.apply(
AvroIO.read(user.class)
.from(String.valueOf(options.getInputFile())));
I get following error while running the pipeline:
java.lang.IllegalArgumentException: Unable to find any files matching RuntimeValueProvider{propertyName=inputFile, default=gs://test_bucket/user.avro, value=null}
Same code works fine in case of TextIO.
How can we read Avro file which is uploaded for triggering cloud function which triggers the dataflow pipeline?
Please try ...from(options.getInputFile())) without converting it to a string.
For simplicity, you could even define your option as simple string:
String getInputFile();
void setInputFile(String value);
You need to use simply from(options.getInputFile()): AvroIO explicitly supports reading from a ValueProvider.
Currently the code is taking options.getInputFile() which is a ValueProvider, calling the JavatoString() function on it which gives a human-readable debug string "RuntimeValueProvider{propertyName=inputFile, default=gs://test_bucket/user.avro, value=null}" and passing that as a filename for AvroIO to read, and of course this string is not a valid filename, that's why the code currently doesn't work.
Also note that the whole point of ValueProvider is that it is placeholder for a value that is not known while constructing the pipeline and will be supplied later (potentially the pipeline will be executed several times, supplying different values) - so extracting the value of a ValueProvider at pipeline construction time is impossible by design, because there is no value. At runtime though (e.g. in a DoFn) you can extract the value by calling .get() on it.

How to count total number of rows in a file using google dataflow

I would like to know if there is a way to find out total no rows in a file using google dataflow. Any code sample and pointer will be great help. Basically, I have a method as
int getCount(String fileName) {}
So, above method will return total count of rows and its implementation will be dataflow code.
Thanks
Seems like your use case is one that doesn't require distributed processing, because the file is compressed and hence can not be read in parallel. However, you may still find it useful to use Dataflow APIs for the sake of their ease of access to GCS and automatic decompression.
Since you also want to get the result out of your pipeline as an actual Java object, you need to use the Direct runner, which runs in-process, without talking to the Dataflow service or doing any distributed processing, however in return it provides the ability to extract PCollection's into Java objects:
Something like this:
PipelineOptions options = ...;
DirectPipelineRunner runner = DirectPipelineRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
PCollection<Long> countPC =
p.apply(TextIO.Read.from("gs://..."))
.apply(Count.<String>globally());
DirectPipelineRunner.EvaluationResults results = runner.run(p);
long count = results.getPCollection(countPC).get(0);

Resources