Kubeflow pipelines: Component stuck - kubeflow

I am trying to run probably the most basic kubeflow pipeline as described at this link: https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/
The Pipeline is just calling an "add" function 2 times inside a dsl pipelines. Here is the code:
def add(a: float, b: float) -> float:
print('Adding 2 numbers:' + str(a) + ' and ' + str(b))
return a + b
add_op = create_component_from_func(
add, output_component_file='add_component.yaml')
#dsl.pipeline(name='Addition pipeline',
description='An example pipeline that performs addition calculations.'
)
def add_pipeline(a='1', b='7'):
first_add_task = add_op(a, 4)
second_add_task = add_op(first_add_task.output, b)
arguments = {'a': '7', 'b': '8'}
client.create_run_from_pipeline_func(add_pipeline,
arguments=arguments,
run_name = exp_name + '-' + str(datetime.now()),
experiment_name = exp_name)
The Pipeline is created and even prints the variables passed in the first component. But the status stays as "Running" and second component does not start.
Here is the screenshot:
Can anyone please suggest, how to move forward here. Note that I am able to successfully run one component pipeline on this kfp instance. But multi-component pipelines are getting stuck.
This is the error in wait container:
time="2022-09-06T06:24:49.366Z" level=info msg="Copying
/tmp/outputs/Output/data from container base image layer to
/tmp/argo/outputs/artifacts/add-Output.tgz"
time="2022-09-06T06:24:49.390Z" level=info
msg="/var/run/argo/outputs/artifacts/tmp/outputs/Output/data.tgz ->
/tmp/argo/outputs/artifacts/add-Output.tgz"
time="2022-09-06T06:24:49.390Z" level=error msg="executor error: You
need to configure artifact storage. More information on how to do this
can be found in the docs: https://argoproj.github.io/argo-
workflows/configure-artifact-repository/"
time="2022-09-06T06:24:49.530Z" level=info msg="Create
workflowtaskresults 403"

Related

Return value between Pipelines

I have two pipelines, that works isolated between each other. And a third that invoked both of them, and it require information returned by one of them.
From the Pipeline C invoke service-docker-build and I want to obtain the Docker image tag created in that pipeline.
def dockerBuildResponse = build job: "service-docker-build", propagate: false
IMAGE_TAG = dockerBuildResponse.getBuildVariables()["IMAGE_TAG"]
In the service-docker-build pipeline I define this variable
IMAGE_TAG = Globals.nexus3PrdUrl.replaceAll("^https?://", "") + "/" + Globals.image.imageName()
But I still have in pipeline C the IMAGE_TAG a null
Any idea how can I pass this value to the invoker pipeline?
Regards

How does Jenkins choose a pipeline agent between multiple labels?

I have a Jenkins pipeline that I'd like to run on either a params-specified agent or master. The pipeline code that implements this is:
pipeline {
agent { label "${params.agent} || master" }
...
}
I've surmised, from the following posts, that the || operator needs to be inside the (double-)quotes:
Can I define multiple agent labels in a declarative Jenkins Pipeline?
https://serverfault.com/questions/1074089/how-to-apply-multiple-labels-to-jenkins-nodes
Jenkinsfile - agent matching multiple labels
When I run this job, it seems to always run on master.
When I switch the order of ${params.agent} and master in the agent statement, it seems to still always run on master.
If I remove " || master" from the agent statement, then the job runs on the params-specified agent.
Question: Is my observation that Jenkins "prefers" master a coincidence, or is there something wrong with the syntax that's making Jenkins default to master?
Is there some way to have Jenkins prefer not-master so that I can test validity of the agent statement?
So, when Jenkins encounters the line
agent { label "${params.agent} || master" }
it will do exactly one of the following:
schedule your job on one of the nodes that match that label; or
get stuck until there's a node that matches that label, or until aborted.
With regards to option 1, there's no guarantee that it will do a round-robin, a random choice, or prefer some nodes but not the others, etc. In practice, when several nodes match, Jenkins will prefer the node that ran your pipeline in the past. This is a reasonable behavior — if there's a workspace already on that node, some operations (like git checkout) may happen faster, saving time.
With regards to option 2, that's also a reasonable behavior. We actually use that to schedule a job on a non-existing label, while manipulating the labels to produce one that would match.
So, there's nothing wrong with your syntax, and Jenkins is behaving as designed.
If you want to implement some custom rule — like "always try a different node", or "try to use master as little as possible" — you have to code that.
Example pipeline (note I haven't checked it):
import hudson.model.Hudson
properties([
parameters([
string(name: 'DEPLOY_ON', defaultValue: 'node_name',
description: 'try to run on this node, or master'),
])
])
resulting_node_name = ''
pipeline {
agent { node { label 'master' } }
stages {
stage ('Do on master') {
steps {
script {
resulting_node_name = params.DEPLOY_ON
// note: this gets node by name, but you can get by label if you wish
def slave = Jenkins.instance.getNode(resulting_node_name)
if (slave == null) {
currentBuild.result = 'FAILURE'
}
def computer = slave.computer
if (computer == null || computer.getChannel() == null || slave.name != params.DEPLOY_ON) {
println "Something wrong with the slave object, setting master"
resulting_node_name = 'master'
}
printSlaveInfo(slave)
computer = null
slave = null
}
}
}
stage('Do on actual node') {
agent { node { label resulting_node_name } }
steps {
script {
println "Running on ${env.NODE_NAME}"
}
}
}
}
}
#NonCPS
def printSlaveInfo(slave) {
// some info that you can use to choose the least-busy, best-equipped, etc.
println('====================')
println('Name: ' + slave.name)
println('getLabelString: ' + slave.getLabelString())
println('getNumExectutors: ' + slave.getNumExecutors())
println('getRemoteFS: ' + slave.getRemoteFS())
println('getMode: ' + slave.getMode())
println('getRootPath: ' + slave.getRootPath())
println('getDescriptor: ' + slave.getDescriptor())
println('getComputer: ' + slave.getComputer())
def computer = slave.computer
println('\tcomputer.isAcceptingTasks: ' + computer.isAcceptingTasks())
println('\tcomputer.isLaunchSupported: ' + computer.isLaunchSupported())
println('\tcomputer.getConnectTime: ' + computer.getConnectTime())
println('\tcomputer.getDemandStartMilliseconds: ' + computer.getDemandStartMilliseconds())
println('\tcomputer.isOffline: ' + computer.isOffline())
println('\tcomputer.countBusy: ' + computer.countBusy())
println('\tcomputer.getLog: ' + computer.getLog())
println('\tcomputer.getBuilds: ' + computer.getBuilds())
println('====================')
}
h pipeline/job has a trusted agent list, more job succeed on the agent, the agent sit in the top on the list, pipeline/agent will pick top agent from list.
If your pipeline already run on master for several times and all succeed, even you give another agent to choose, pipeline always pick the most trusted agent firstly

Check if Jenkins node is online for the job, otherwise send email alert

Having the Jenkins job dedicated to special node I'd like to have a notification if the job can't be run because the node is offline. Is it possible to set up this functionality?
In other words, the default Jenkins behavior is waiting for the node if the job has been started when the node is offline ('pending' job status). I want to fail (or don't start at all) the job in this case and send 'node offline' mail.
This node checking stuff should be inside the job because the job is executed rarely and I don't care if the node is offline when it's not needed for the job. I've tried external node watching plugin, but it doesn't do exactly what I want - it triggers emails every time the node goes offline and it's redundant in my case.
I found an answer here.
You can add a command-line or PowerShell block which invokes the curl command and processes a result
curl --silent $JENKINS_URL/computer/$JENKINS_NODENAME/api/json
The result json contains offline property with true/false value
I don't think checking if the node is available can be done inside the job (e.g JobX) you want to run. The act of checking, specifically for your JobX at time of execution, will itself need a job to run - I don't know of a plugin/configuration option that'll do this. JobX can't check if the node is free for JobX.
I use a lot of flow jobs (in process of converting to pipeline logic) where JobA will trigger the JobB, thus JobA could run on master check the node for JobB, JobX in your case, triggering it if up.
JobA would need to be a freestyle job and run a 'execute system groovy script > Groovy command' build step. The groovy code below is pulled together from a number of working examples, so untested:
import hudson.model.*;
import hudson.AbortException;
import java.util.concurrent.CancellationException;
def allNodes = jenkins.model.Jenkins.instance.nodes
def triggerJob = false
for (node in allNodes) {
if ( node.getComputer().isOnline() && node.nodeName == "special_node" ) {
println node.nodeName + " " + node.getComputer().countBusy() + " " + node.getComputer().getOneOffExecutors().size
triggerJob = true
break
}
}
if (triggerJob) {
println("triggering child build as node available")
def job = Hudson.instance.getJob('JobB')
def anotherBuild
try {
def params = [
new StringParameterValue('ParamOne', '123'),
]
def future = job.scheduleBuild2(0, new Cause.UpstreamCause(build), new ParametersAction(params))
anotherBuild = future.get()
} catch (CancellationException x) {
throw new AbortException("${job.fullDisplayName} aborted.")
}
} else {
println("failing parent build as node not available")
build.getExecutor().interrupt(hudson.model.Result.FAILURE)
throw new InterruptedException()
}
To get the node offline email, you could just trigger a post build action to send emails on failure.

How to get the current build's node name in jenkins using groovy

I have a pipeline job running in Jenkins and I want to know the name of the node it's running on. Is there a way to get the node name from within the job's Groovy script?
I have tried the below code:
print currentBuild.getBuiltOn().getNodeName()
the error is:
org.jenkinsci.plugins.scriptsecurity.sandbox.RejectedAccessException: unclassified method org.jenkinsci.plugins.workflow.job.WorkflowRun getBuiltOn
I also tried this:
def build = currentBuild.build()
print build.getExecutor().getOwner().getNode().getNodeName()
but the result is ''.
There is an environment variable 'NODE_NAME' which has this.
You can access it like this:
echo "NODE_NAME = ${env.NODE_NAME}"
When you are editing a pipeline job, you can find all the available environment variables by going to the "pipeline syntax" help link (left of page) then look for the "Global Variables" section and click through to the "Global Variables Reference". There is a section "env" that lists the environment variables available.
It is not documented, but indeed Node and Executor objects can be obtained from CpsThread class of the pipeline. Of course, they are defined only inside node { } block:
import org.jenkinsci.plugins.workflow.cps.CpsThread;
#NonCPS
obtainContextVariables() {
return CpsThread.current().getContextVariables().values;
}
node('myNode') {
print('Node: ' + obtainContextVariables().findAll(){ x -> x instanceof Computer }[0].getNode())
print('Executor: ' + obtainContextVariables().findAll(){ x -> x instanceof Executor }[0])
}

Temporarily disable SCM polling on Jenkins Server in System Groovy

We have a Jenkins server which is running somewhere between 20 and 30 jobs.
Since the build process is reasonably complex we're broken the actual build down into 1 sub-builds, some of which can run concurrently, others have to follow previous build steps. As a result we've grouped each of the build steps into 3 groups, which block while the builds are in pogress.
For example:
Main Build : GroupA : Builds A1, A2 & A3
: GroupB : Builds B1, B2 & B3
: GroupC : Builds C1, C2, C3, C4, C5 & C6
: GroupD : HW_Tests T1, T2, T3, T4 & T5
Builds B1, B2 & B3 rely on the output from A1, A2, A3 etc
Since there are builds and tests running pretty much 24/7, I am finding it difficult to schedule a restart of the Jenkins Master. Choosing "Prepare for shutdown" will mean new jobs are queued, but it will invariably block a running job since, to use my example above, if GroupB is active, builds C1, C2, etc will be queued also, and the Main Build will be blocked.
As a work around, I would like to disable the SCM polling on the server until all running jobs have finished. This will prevent new jobs from triggering but also allow the running jobs to finish. I can then restart Jenkins and , re-enable SCM polling, allowing normal service to resume.
The SCM we are using is Perforce.
I have not been able to find anywhere which suggests the above is possible, however, I am sure it must be feasible in System Groovy ... just not sure how. Does anyone here have any ideas please?
Many Thanks
You could disable only those jobs which have an SCM polling trigger. This groovy script will do that:
Hudson.instance.items.each { job ->
if ( job.getTrigger( hudson.triggers.SCMTrigger ) != null ) {
println "will disable job ${job.name}"
job.disable()
}
}
Re-enabling the jobs will be left as an exercise : )
Jenkins 2.204.1 + version of using the groovy script console to disable all jobs with an SCM trigger:
Jenkins.instance.getAllItems(Job.class).each{ job ->
if ( job.getSCMTrigger() != null ) {
println "will disable job ${job.name}"
job.setDisabled(true)
}
}
Try following Groovy script to comment out SCM Polling:
// WARNING: Use on your own risk! Without any warranty!
import hudson.triggers.SCMTrigger
import hudson.triggers.TriggerDescriptor
// from: https://issues.jenkins-ci.org/browse/JENKINS-12785
TriggerDescriptor SCM_TRIGGER_DESCRIPTOR = Hudson.instance.getDescriptorOrDie(SCMTrigger.class)
assert SCM_TRIGGER_DESCRIPTOR != null;
MAGIC = "#MAGIC# "
// comment out SCM Trigger
def disable_scmpoll_trigger(trig){
if ( !trig.spec.startsWith(MAGIC) ){
return new SCMTrigger(MAGIC + trig.spec)
}
return null
}
// enable commented out SCM Trigger
def enable_scmpoll_trigger(trig){
if ( trig.spec.startsWith(MAGIC) ){
return new SCMTrigger(trig.spec.substring(MAGIC.length()))
}
return null
}
Hudson.instance.items.each {
job ->
//println("Checking job ${job.name} of type ${job.getClass().getName()} ...")
// from https://stackoverflow.com/a/39100687
def trig = job.getTrigger( hudson.triggers.SCMTrigger )
if ( trig == null ) return
println("Job ${job.name} has SCMTrigger: '${trig.spec}'")
SCMTrigger newTrig = disable_scmpoll_trigger(trig)
// SCMTrigger newTrig = enable_scmpoll_trigger(trig)
if (newTrig != null ){
newTrig.ignorePostCommitHooks = trig.ignorePostCommitHooks
newTrig.job = job
println("Updating SCMTrigger '${trig.spec}' -> '${newTrig.spec}' for job: ${job.name}")
job.removeTrigger(SCM_TRIGGER_DESCRIPTOR)
job.addTrigger(newTrig)
job.save()
}
}
return ''
To enable SCM polling again just change these two lines
//SCMTrigger newTrig = disable_scmpoll_trigger(trig)
SCMTrigger newTrig = enable_scmpoll_trigger(trig)
Tested on Jenkins ver. 2.121.3
Known limitations:
supports single line "Schedule" (spec property) only
If it is one or 2 builds that are configured to do SCM polling, you can go into the configuration of each build and uncheck the box. It is that simple :)
If you are using jenkins job builder, it should be even easier to change the configuration of several jobs at a time.
If you using slaves or even on master, SCM polling is dependent on JAVA ? remove JAVA from the machine temporarily from the location where it is configured in master jenkins ;) The polling will fail :P This is a stupid hack ;)
I hope that helps !! ?

Resources