How can I programmatically create a Kubeflow recurring run from a pipeline function? - kubeflow

I am trying to create a recurring kubeflow pipeline run as follows:
from kfp import compiler
compiler.Compiler().compile(
pipeline_func=my_pipeline,
package_path='pipelines/my_pipeline.tgz')
from kfp.v2.google.client import AIPlatformClient
api_client = AIPlatformClient(project_id='...',
region='...')
api_client.create_schedule_from_job_spec(
job_spec_path='pipelines/my_pipeline.tgz',
schedule='* * * * *',
time_zone='UTC',
parameter_values=arguments
)
The first command creates the pipeline spec in YAML, but the second one expects JSON.
How otherwise can I create the recurring run programmatically [rather than via the UI]?

You're using the v1 kfp.compiler which indeed produces a yaml output and your API client is v2, which indeed expects a json input. Use the kfp.v2.compiler to be consistent with the versions, the later saves the pipeline in json, as expected by your api_client.
Sample v1 hello world produces a yaml
Sample v2 hello world produces a json
You can find an example of creating a v2 pipeline on gcp here.

Related

How do you authenticate with an API key inside an IBM Cloud Function?

I am writing an IBM Cloud Function which uses the python SDK to interface with a Cloudant service. I have the Cloudant service up, the databases populated, and service credentials / API key ready. However when I try to instantiate the CloudantV1 service inside my Function I get a runtime error "must provide authenticator".
I looked up the error in their git repos and it seems like it is trying to setup an authenticator object by looking up values from environment variables, which do not exist in the Function. I just want to pass my API key directly, but I have not found a method to do this. I am using basic code from the examples so I think my calls are correct.
I have considered injecting the environment variables inside the Function, but that sounds like a major hack. I must be doing something incorrectly. Please help me understand what it is. Here is basic Function python code which reproduces the error:
from ibmcloudant.cloudant_v1 import CloudantV1
def main(params_dict):
service = CloudantV1.new_instance()
# unreachable
return { "message" : "hello world" }
There is an example for programmatic authentication at https://cloud.ibm.com/apidocs/cloudant?code=python#programmatic-authentication - it basically looks like this:
from ibmcloudant.cloudant_v1 import CloudantV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
authenticator = IAMAuthenticator('yourAPIkey')
service = CloudantV1(authenticator=authenticator)
service.set_service_url('https://yourserviceurl.example')

Apache Beam: Wait for AvroIO write step is done before start ImportTransform Dataflow template

I'm using apache beam to create a pipeline where basically reads an InputFile, Convert to Avro, write the AvroFile to a bucket and then Import these avro files to Spanner using Dataflow template
The problem that I'm facing is that the last step (Import the Avro files to the Database) is starting before the previous (write Avro Files to the bucket) is done.
I tried to add the Wait.on function but that only works if returns a PCollection, but when I write the files to avro it returns PDone.
Example of the Code:
// Step 1: Read Files
PCollection<String> lines = pipeline.apply("Reading Input Data exported from Cassandra",TextIO.read().from(options.getInputFile()));
// Step 2: Convert to Avro
lines .apply("Write Item Avro File",AvroIO.writeGenericRecords(spannerItemAvroSchema).to(options.getOutput()).withSuffix(".avro"));
// Step 3: Import to the DataBase
pipeline.apply( new ImportTransform(
spannerConfig,
options.getInputDir(),
options.getWaitForIndexes(),
options.getWaitForForeignKeys(),
options.getEarlyIndexCreateFlag()));
Again, the problem is because step 3 starts before Step 2 is done
any ideas?
This is a flaw in the API, see, e.g. a recent discussion on this on the beam dev list. The only solutions for now are to either fork AvroIO to return a PCollection or run two pipelines sequentially.

Failures in init.groovy.d scripts: null values returned

I'm trying to get Jenkins set up, with configuration, within a Docker environment. Per a variety of sources, it appears the suggested method is to insert scripts into JENKINS_HOME/init.groovy.d. I've taken scripts from places like the Jenkins wiki and made slight changes. They're only partially working. Here is one of them:
import java.util.logging.ConsoleHandler
import java.util.logging.FileHandler
import java.util.logging.SimpleFormatter
import java.util.logging.LogManager
import jenkins.model.Jenkins
// Log into a file
println("extralogging.groovy")
def RunLogger = LogManager.getLogManager().getLogger("hudson.model.Run")
def logsDir = new File("/var/log/jenkins")
if (!logsDir.exists()) { logsDir.mkdirs() }
FileHandler handler = new FileHandler(logsDir.absolutePath+"/jenkins-%g.log", 1024 * 1024, 10, true);
handler.setFormatter(new SimpleFormatter());
RunLogger.addHandler(handler)
This script fails on the last line, RunLogger.addHandler(handler).
2019-12-20 19:25:18.231+0000 [id=30] WARNING j.util.groovy.GroovyHookScript#execute: Failed to run script file:/var/lib/jenkins/init.groovy.d/02-extralogging.groovy
java.lang.NullPointerException: Cannot invoke method addHandler() on null object
I've had a number of other scripts return NULL objects from various gets similar to this one:
def RunLogger = LogManager.getLogManager().getLogger("hudson.model.Run")
My goal is to be able to develop (locally) a Jenkins implementation and then hand it to our sysops guys. Later, as I add pipelines and what not, I'd like to be able to also work on them in a local Jenkins configuration and then hand something for import into production Jenkins.
I'm not sure how to produce API documentation so I can chase this myself. Maybe I need to stop doing it this way and just grab the files that get modified when I do this via the GUI and just stuff the files into the right place.
Suggestions?

Exported Dataflow Template Parameters Unknown

I've exported a Cloud Dataflow template from Dataprep as outlined here:
https://cloud.google.com/dataprep/docs/html/Export-Basics_57344556
In Dataprep, the flow pulls in text files via wildcard from Google Cloud Storage, transforms the data, and appends it to an existing BigQuery table. All works as intended.
However, when trying to start a Dataflow job from the exported template, I can't seem to get the startup parameters right. The error messages aren't overly specific but it's clear that for one thing, I'm not getting the locations (input and output) right.
The only Google-provided template for this use case (found at https://cloud.google.com/dataflow/docs/guides/templates/provided-templates#cloud-storage-text-to-bigquery) doesn't apply as it uses a UDF and also runs in Batch mode, overwriting any existing BigQuery table rather than append.
Inspecting the original Dataflow job details from Dataprep shows a number of parameters (found in the metadata file) but I haven't been able to get those to work within my code. Here's an example of one such failed configuration:
import time
from google.cloud import storage
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
def dummy(event, context):
pass
def process_data(event, context):
credentials = GoogleCredentials.get_application_default()
service = build('dataflow', 'v1b3', credentials=credentials)
data = event
gsclient = storage.Client()
file_name = data['name']
time_stamp = time.time()
GCSPATH="gs://[path to template]
BODY = {
"jobName": "GCS2BigQuery_{tstamp}".format(tstamp=time_stamp),
"parameters": {
"inputLocations" : '{{\"location1\":\"[my bucket]/{filename}\"}}'.format(filename=file_name),
"outputLocations": '{{\"location1\":\"[project]:[dataset].[table]\", [... other locations]"}}',
"customGcsTempLocation": "gs://[my bucket]/dataflow"
},
"environment": {
"zone": "us-east1-b"
}
}
print(BODY["parameters"])
request = service.projects().templates().launch(projectId=PROJECT, gcsPath=GCSPATH, body=BODY)
response = request.execute()
print(response)
The above example indicates invalid field ("location1", which I pulled from a completed Dataflow job. I know I need to specify the GCS location, the template location, and the BigQuery table but haven't found the correct syntax anywhere. As mentioned above, I found the field names and sample values in the job's generated metadata file.
I realize that this specific use case may not ring any bells but in general if anyone has had success determining and using the correct startup parameters for a Dataflow job exported from Dataprep, I'd be most grateful to learn more about that. Thx.
I think you need to review this document it explains exactly the syntax required for passing the various pipeline options available including the location parameters needed... 1
Specifically with your code snippet the following does not follow the correct syntax
""inputLocations" : '{{\"location1\":\"[my bucket]/{filename}\"}}'.format(filename=file_name)"
In addition to document1, you should also review the available pipeline options and their correct syntax 2
Please use the links...They are the official documentation links from Google.These links will never go stale or be removed they are actively monitored and maintained by a dedicated team

Run function / pipeline after a pipeline completes on Google DataFlow

I'm trying to run a function (or a pipeline) after a Beam pipeline completes, on Google DataFlow.
Currently I've built a hack to run the function by writing the results of the previous Pipeline to null with
_ = existing_pipeline | "do next task" >> beam.Map(func)
...where func is:
def func(_):
# do some work, and ignore `_`
But is there a better way?
Assuming you want the function to run on your machine and not in the Cloud, you should do something like this:
result = existing_pipeline.run()
result.wait_until_finish()
# do some work

Resources