How to specify machine type in vertex ai pipeline - kubeflow

I want to specify the machine type in vertex ai pipeline with using kfp sdk.
I don't know how to specify machine_type while executing it as a component of pipeline.
I tried kfp.v2.google.experimental.run_as_aiplatform_custom_job, but it ran as CustomJobExecution instead of ContainerExecution.
For that reason, I want to use Airtifact, but airtifact is not mounted on this component.
Since I want to use the airtifact of the previous components and the function of Output [Airtifact], I want to execute it as ContainerExecution instead of CustomJobExecution.

You can use set_memory_limit and set_cpu_limit, just like you would using Kubeflow Pipelines. Vertex Pipelines will convert these limits to a machine type that satisfies your request.

Related

How to use a global pipeline library when defining an Active Choices Reactive Parameter?

Is there a way to use a global pipeline library when defining an Active Choices Reactive Parameter in Jenkins?
I have added a global pipeline library in Jenkins, let's say PipelineLibrary, and I am able to use it successfully in my pipelines by loading it with #Library('PipelineLibrary') _. In this library I have a global function foo.bar(), which I would like to use also in the Groovy Script box when adding an Active Choices Reactive Parameter to several of my jobs.
So I would like to have something like this in the Groovy Script box of that parameter:
// Somehow take into use PipelineLibrary
return foo.bar();
What is the correct syntax to load the library here?
Or is it even possible? If not, is there some other way to share the Groovy script to several places without just copy-pasting the code in the GUI?
I think you're banging on this door - JENKINS-46394
Actually, I have found a way to do it. You just need to need to define new global variables in your Jenkinsfile and their values will be from the Shared library respectfully. Afterwards you can use the new defined global variables in your Active Choice Parameter script.

kubeflow OutputPath/InputPath question when writing/reading multiple files

I've a data-fetch stage where I get multiple DFs and serialize those. I'm currently treating OutputPath as directory - create it if it doesn't exist etc. and then serialize all the DFs in that path with different names for each DF.
In a subsequent pipeline stage (say, predict) I need to retrieve all those through InputPath.
Now, from the documentation it seems InputPath/OutputPath as file. Does kubeflow as any limitation if I use it as directory?
The ComponentSpec's {inputPath: input_name} and {outputPath: output_name} placeholders and their Python analogs (input_name: InputPath()/output_name: OutputPath()) are designed to support both files/blobs and directories.
They are expected to provide the path for the input/output data. No matter whether the data is a blob/file or a directory.
The only limitation is that UX might not be able to preview such artifacts.
But the pipeline itself would work.
I have experimented with a trivial pipeline - no issue is observed if InputPath/OutputPath is used as directory.

Dataflow: Consuming runtime parameters in template

Trying to create a template for dataflow job.
Is there any way to generate a template with runtime parameters?
Till now, whatever parameters were used at the time of creation of template, but when I tried passing different values for the variables, it is not picking the runtime values.
If any additional details are needed, will provide the same.
You can use value providers in your pipeline options to have runtime arguments in a pipeline.
But I'm afraid that this is too limited to where you can use these parameters (Mostly in DoFn).
This behaviour is expected from dataflow template as it is representation of a pipeline rather than the code itself.
Please bear in mind that you cannot create dataflow template with dynamic processing steps based on the value passed.
The steps are hard-coded into the template and cannot be changed unless the code to generate the template is executed again.
A parameter needs to be wrapped inside of a ValueProvider object in order for the template pipeline to access the runtime value of that parameter. All of the example templates provided here demonstrate how ValueProvider can be used to parameterize a template pipeline.
Take a look at the WordCount pipeline as an example.
As you can see, the pipeline uses a ValueProvider (instead of a simple String) to read the path of the file on which a WordCount needs to be executed:
#Description("Path of the file to read from")
ValueProvider<String> getInputFile();
void setInputFile(ValueProvider<String> value);
Since the value of the inputFile is unknown until runtime (when the template is actually executed with valid inputs), the transform using the ValueProvider will defer reading the value of the parameter until runtime (for e.g. inside a DoFn).
The native TextIO.Read Beam transform provides support for reading from a ValueProvider in addition to reading from String.

Perform action after Dataflow pipeline has processed all data

Is it possible to perform an action once a batch Dataflow job has finished processing all data? Specifically, I'd like to move the text file that the pipeline just processed to a different GCS bucket. I'm not sure where to place that in my pipeline to ensure it executes once after the data processing has completed.
I don't see why you need to do this post pipeline execution. You could use side outputs to write the file to multiple buckets, and save yourself the copy after the pipeline finishes.
If that's not going to work for you (for whatever reason), then you can simply run your pipeline in blocking execution mode i.e. use pipeline.run().waitUntilFinish(), and then just write the rest of your code (which does the copy) after that.
[..]
// do some stuff before the pipeline runs
Pipeline pipeline = ...
pipeline.run().waitUntilFinish();
// do something after the pipeline finishes here
[..]
A little trick I got from reading the source code of apache beam's PassThroughThenCleanup.java.
Right after your reader, create a side input that 'combine' the entire collection (in the source code, it is the View.asIterable() PTransform) and connect its output to a DoFn. This DoFn will be called only after the reader has finished reading ALL elements.
P.S. The code literally name the operation, cleanupSignalView which I found really clever
Note that you can achieve the same effect using Combine.globally() (java) or beam.CombineGlobally() (python). For more info check out section 4.2.4.3 here
I think two options can help you here:
1) Use TextIO to write to the bucket or folder you want, specifying the exact GCS path (for e.g. gs://sandbox/other-bucket)
2) Use Object Change Notifications in combination with Cloud Functions. You can find a good primer on doing this here and the SDK for GCS in JS here. What you will do in this option is basically setting up a trigger when something drops in a certain bucket, and move it to another one using your self-written Cloud Function.

No suitable ClassLoader found for grab when using #GrabConfig

I'm attempting to write a global function script that uses groovy.sql.SQL.
When adding the annotation #GrabConfig(systemClassLoader=true) I get an exception when using the global function in Jenkinsfile.
Here is the exception:
hudson.remoting.ProxyException: org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
General error during conversion: No suitable ClassLoader found for grab
Here is my code:
#GrabResolver(name='nexus', root='http://internal.repo.com')
#GrabConfig(systemClassLoader=true)
#Grab('com.microsoft.sqlserver:sqljdbc4:4.0')
import groovy.sql.Sql
import com.microsoft.sqlserver.jdbc.SQLServerDriver
def call(name) {
echo "Hello world, ${name}"
Sql.newInstance("jdbc:sqlserver://ipaddress/dbname", "username","password", "com.microsoft.sqlserver.jdbc.SQLServerDriver")
// sql.execute "select count(*) from TableName"
}
Ensure that the "Use groovy sandbox" checkbox is unticked (it's below the pipeline script text box).
As explained here, Pipeline "scripts" are not simple Groovy scripts, they are heavily transformed before running, some parts on master, some parts on slaves, with their state (variable values) serialized and passed to the next step. As such, not every Groovy feature is supported.
I'm not sure about #Grab support. It is discussed in JENKINS-26192 (which is declared as resolved, so maybe it works now).
Extract from a very interesting comment:
If you need to perform some complex or expensive tasks with
unrestricted Groovy physically running on a slave, it may be simplest
and most effective to simply write that code in a *.groovy file in
your workspace (for example, in an SCM checkout) and then use tool and
sh/bat to run Groovy as an external process; or even put this stuff
into a Gradle script, Groovy Maven plugin execution, etc. The workflow
script itself should be limited to simple and extremely lightweight
logical operations focused on orchestrating the overall flow of
control and interacting with other Jenkins features—slave allocation,
user input, and the like.
In short, if you can move that custom part that needs SQL to an external script and execute that in a separate process (called from your Pipeline script), that should work. But doing this in the Pipeline script itself is more complicated.

Resources