Allow failure in Kubeflow Pipelines - kubeflow

Context
I have a kubeflow pipeline running multiple stages with python scripts. In one of the inner stages, I use kfp.dsl.ParallelFor to run 5-6 deep learning models, and in the next stage, I choose the best one with respect to a validation metric.
Problem
The issue is if one of the models fail, the whole pipeline fails. It'll complain that the dependencies of the next stage is not satisfied. However, if model A fails and model B is still running at that time, the pipeline state will continue to be running till the time model B is running, and it'll change only at end of all models in that stage.
Question
How can I allow partial failures in a stage? As long as at least one of the model is working, the next stage can work. How do I make it happen in kubeflow? For example, I have setup CI in Gitlab, which supports this.
If it is not possible to have this, I want the pipeline to fail immediately as soon as one model fails, and not wait for others only to fail later, which possibly can be way later based on configurations.
Obviously, a way to avoid failure will be to include a top level try - except in the python script, and it'll always return exit code as 0. However, in this way there shall be no visual indication that one (or more) models failed. It can be recovered from the logs, but it's rarely monitored in a scheduled pipeline when the entire run status is successful.

Related

How can I debug why my Dataflow job is stuck?

I have a Dataflow job that is not making progress - or it is making very slow progress, and I do not know why. How can I start looking into why the job is slow / stuck?
The first resource that you should check is Dataflow documentation. It should be useful to check these:
Troubleshooting your Pipeline
Common error guidance
If these resources don't help, I'll try to summarize some reasons why your job may be stuck, and how you can debug it. I'll separate these issues depending on which part of the system is causing the trouble. Your job may be:
Job stuck at startup
A job can get stuck being received by the Dataflow service, or starting up new Dataflow workers. Some risk factors for this are:
Did you add a custom setup.py file?
Do you have any dependencies that require a special setup on worker startup?
Are you manipulating the worker container?
To debug this sort of issue I usually open StackDriver logging, and look for worker-startup logs (see next figure). These logs are written by the worker as it starts up a docker container with your code, and your dependencies. If you see any problem here, it would indicate an issue with your setup.py, your job submission, staged artifacts, etc.
Another thing you can do is to keep the same setup, and run a very small pipeline that stages everything:
with beam.Pipeline(...) as p:
(p
| beam.Create(['test element'])
| beam.Map(lambda x: logging.info(x)))
If you don't see your logs in StackDriver, then you can continue to debug your setup. If you do see the log in StackDriver, then your job may be stuck somewhere else.
Job seems stuck in user code
Something else that could happen is that your job is performing some operation in user code that is stuck or slow. Some risk factors for this are:
Is your job performing operations that require you to wait for them? (e.g. loading data to an external service, waiting for promises/futures)
Note that some of the builtin transforms of Beam do exactly this (e.g. the Beam IOs like BigQueryIO, FileIO, etc).
Is your job loading very large side inputs into memory? This may happen if you are using View.AsList for a side input.
Is your job loading very large iterables after GroupByKey operations?
A symptom of this kind of issue can be that the pipeline's throughput is lower than you would expect. Another symptom is seeing the following line in the logs:
Processing stuck in step <STEP_NAME>/<...>/<...> for at least <TIME> without outputting or completing in state <STATE>
.... <a stacktrace> ....
In cases like these it makes sense to look at which step is consuming the most time in your pipeline, and inspect the code for that step, to see what may be the problem.
Some tips:
Very large side inputs can be troublesome, so if your pipeline relies on accessing a very large side input, you may need to redesign it to avoid that bottleneck.
It is possible to have asynchronous requests to external services, but I recommend that you commit / finalize work on startBundle and finishBundle calls.
If your pipeline's throughput is not what you would normally expect, it may be because you don't have enough parallelism. This can be fixed by a Reshuffle, or by sharding your existing keys into subkeys (Beam often does processing per-key, and so if you have too few keys, your parallelism will be low) - or using a Combiner instead of GroupByKey + ParDo.
Another reason that your throughput is low may be that your job is waiting too long on external calls. You can try addressing this by trying out batching strategies, or async IO.
In general, there's no silver bullet to improve your pipeline's throughput,and you'll need to have experimentation.
The data freshness or system lag are increasing
First of all, I'd recommend you check out this presentation on watermarks.
For streaming, the advance of the watermarks is what drives the pipeline to make progress, thus, it is important to be watchful of things that could cause the watermark to be held back, and stall your pipeline downstream. Some reasons why the watermark may become stuck:
One possibility is that your pipeline is hitting an unresolvable error condition. When a bundle fails processing, your pipeline will continue to attempt to execute that bundle indefinitely, and this will hold the watermark back.
When this happens, you will see errors in your Dataflow console, and the count will keep climbing as the bundle is retried. See:
You may have a bug when associating the timestamps to your data. Make sure that the resolution of your timestamp data is the correct one!
Although unlikely, it is possible that you've hit a bug in Dataflow. If neither of the other tips helps, please open a support ticket.

How do I structure Jobs in Jenkins?

I have been tasked with setting up automated deployment and, after some research, settled on Jenkins to get the job done. Prior to this I had approximately zero knowledge of Jenkins, beyond hearing the name. I have no real knowledge of Devops beyond what I have learnt in the last couple of weeks; no formal training, no actual books, just Google searches.
We are not running a full blown/classic CI/CD process; this is a business decision. The basic requirements are:
Source code will be stored in GitHub.
Pull requests must be peer approved.
Pull requests must pass build/unit/db deploy tests.
Commits to specific branches must trigger a deployment to a related specific environment (Production, Staging or Development).
The basic functionality that I am attempting to support covers (what I currently see as) two separate processes:
On creation of a pull request, application is built, unit tests run, and db deploy tested. Status info must be passed to GitHub.
On commit to one of three specific branches (master, staging and dev) the application should be built, and deployed to one of three environments (production, staging and dev).
I have managed to cobble together a pipeline that does the first task rather well. I am using the generic web hook trigger, and manually handling all steps using a declarative pipeline stored in source control. This works rather well so far and, after much hacking, I am quite happy with the shape of it.
I am now starting work on the next bit, automated deployment.
On to my actual question(s).
In short, how do I split this up into Jobs in Jenkins?
To my mind, there are 1, 2 or 4 Jobs to be created:
One Job to Rule them All
This seems sub-optimal to me, as the pipeline will include relatively complex conditional logic and, depending on whether the Job is triggered by a Pull Request or a Commit, different stages will be run. The historical data will be so polluted as to be near useless.
OR
One job for handling pull requests
One job for handling commits
Historical data for deployments across all environments will be intermixed. I am a little concerned that I will end up with >1 Jenkinsfile in my repository. Although I see no technical reason why I can't have >1 Jenkinsfile, every example I see uses a single file. Is it OK to have >1 Jenkinsfile (Jenkinsfile_Test & Jenkinsfile_Deploy) in the repository?
OR
One job for handling pull requests
One job for handling commits to Development
One job for handling commits to Staging
One job for handling commits to Production
This seems to have some benefit over the previous option, because historical data for deployments into each environment will not be cross polluting each other. But now we're well over the >1 Jenkinsfile (perceived) limit, and I will end up with (Jenkinsfile_Test, Jenkinsfile_Deploy_Development, Jenkinsfile_Deploy_Staging and Jenkinsfile_Deploy_Production). This method also brings either extra complexity (common code in a shared library) or copy/paste code reuse, which I certainly want to avoid.
My primary objective is for this to be maintainable by someone other than myself, because Bus Factor. A real Devops/Jenkins person will have to update/manage all of this one day, and I would strongly prefer them not to suffer from my ignorance.
I have done countless searches, but I haven't found anything that provides the direction I need here. Searches for best practices make no mention on handling >1 Jenkinsfile, instead focusing on the contents of a single pipeline.
After further research, I have found an answer to my core question. This might not be the absolute correct answer, but it makes sense to me, and serves my needs.
While it is technically possible to have >1 Jenkinsfile in a project, that does not appear to align with best practices.
The best practice appears to be to create a separate repository for each Jenkinsfile, which maps 1:1 with a Job in Jenkins.
To support my specific use case I have removed the Jenkinsfile from my main source code repository. I then create 4 new repositories:
Project_Jenkinsfile_Test
Project_Jenkinsfile_Deploy_Development
Project_Jenkinsfile_Deploy_Staging
Project_Jenkinsfile_Deploy_Production
Each repository contains a single Jenkinsfile and a readme.md that, in theory, contains useful information.
This separation gives me a nice view of the historical success/failure of the Test runs as a whole, and Deployments to each environment separately.
It is highly likely that I will later create a fifth repository:
Project_Jenkinsfile_Deploy_SharedLibrary
This last repository would contain pipeline code that is shared amongst the four 'core' pipelines. Once I have the 'core' pipelines up and running properly, I will consider refactoring what I can into this shared library.
I will not accept my own answer at this point, in the hope that more answers are forthcoming.
Here's a proposal I would try for your requirements based on the experience at my last job.
Job1: builds and runs unit tests on every commit on master or whatever your main dev branch is (checks every 20 minutes or whatever suits you); this job usually finds compile and unit test issues very fast
Job2 (optional): run integration tests and various static code checks (e.g. clang-tidy, valgrind, cppcheck, etc.) every night, if the last run of Job1 was successful; this job finds usually lots of things, but probably takes lots of time, so let it run only at night
Job3: builds and tests every pull request for release branches; so you get some info in your pull requests, if its mature enough to be merged into the release branches
Job4: deploys to the appropriate environment on every commit on a release branch; on dev and staging you could probably trigger some more tests, if you have them
So Job1, Job2 and Job3 should run all the time. If pull requests to your release branches are approved by QA (i.e. reviews OK and tests successful) and merged to release branches, the deployment is done by Job4 automatically.
It depends on your requirements and your dev process, if you want to trigger Job4 only manually instead.

Jenkins - How to handle concurrent jobs that use a limited resource pool?

I'm trying to improve some of the testing procedures at work and since I'm not an expert on Jenkins was hoping you guys could maybe point me in the right direction?.
Our current situation is like this. We have a huge suite of E2E tests that run regularly. These tests rely on a pool of limited resources (AWS VMs that are use to run each tests). We have 2 test suites. A full blown regression that consumes, at its peak, a total of ~80% of those resources and a much more light weight smoke run that just uses 15% or so.
Right now I'm using the lockable resources plugin. When the Test Run step comes it checks whether you are running a regression or not and if you are then it will request the single lock. If it is available then all good and if not it will wait until it becomes available before continuing. This allows me to make sure that at no point there will be more than 1 regression running at the same point but it has a lot of gaps. Like a regression could be running and several smoke runs might be triggered which will exhaust the resource pool.
What I would like to accomplish on a best-case-scenario would be some sort of conditional rules that would decide whether the test execution step can go forward or not based on something like this:
Only 1 regression can be running at the same time.
If a regression is running allow only 1 smoke run to be run in
parallel.
If no regression is running then allow up to 5 or 6 smoke tests.
If 2 or more smoke tests are running do not allow a regression to
launch.
Would something like that be possible from a Jenkins pipeline? In this case I'm using the declarative pipeline with a bunch of helper groovy code I've put together over time. My first idea is to see if there's a way to check if a lockable resource is available or not (but without actually requesting it yet) and then go through a bunch of if/then/else to set up the logic. But again I'm not sure if there's a way to check a lockable resource state or how many of a kind have already been requested.
Honestly, something this complex might probably be outside of what Jenkins is supposed to handle but I'm not sure and figured asking here would be a good start.
Thanks!.
Create a declarative pipeline with steps that build individual jobs. Don't allow people to run the jobs ad-hoc, or when changes are pushed to the repository, and force a build schedule.
How can this solve your issue:
Only 1 regression can be running at the same time.
Put all these jobs in sequence in a declarative pipeline.
If a regression is running allow only 1 smoke run to be run in parallel.
Put smoke tests that are related to the regression test in sequence, just after the regression build, but run the smoke tests in parallel, prior to the next regression build.
If no regression is running then allow up to 5 or 6 smoke tests.
See previous
If 2 or more smoke tests are running do not allow a regression to launch.
It will never happen if you run things in sequence.
Here is an ugly picture explaining what I am talking about.
You can manually create the pipeline, or use the coolness of blue ocean to give you a graphical interface to put the steps in sequence or in parallel:
https://jenkins.io/doc/tutorials/create-a-pipeline-in-blue-ocean/
The downside is that if one of those jobs fails, it will stop the build, but that is not necessarily a bad thing if the jobs are highly correlated.
Completely forgot to update this but after reading and experimenting a bit more with the lockable resources plugin I found out you could have several resources under the same label and request a set quantity whenever a specific job starts.
I defined 5 resources and set the Jenkinsfile to check whether you are running the test suite with the parameter regression or not. If you are running a full regression it will try to request 4 locks while a smoke test will only try to request 1. This way when there aren't enough locks available the job will wait until either the enough amount becomes available or the timeout expires.
Here's a snippet from my Jenkinsfile:
stage('Test') {
steps {
lock(resource: null, label: 'server-farm-tokens', quantity: getQuantityBySuiteType()) {
<<whatever I do to run my tests here>>
}
}
resource has to be null due to a bug in Jenkin's declarative pipeline. If you're using the scripted one you can ignore that parameter.

Massive-Distributed Parallel Execution of tasks

We are currently struggling with the following task. We need to run a windows application (single instance only working) 1000 times with different input parameters. One run of this application can take up to multiple hours. It feels like we have the same problem like any video rendering farm – each picture of a video should be calculated independently and parallel – but it is not rendering.
Currently we tried to execute it with Jenkins and Pipeline jobs. We used the parallel steps in pipeline and lets Jenkins queue and execute the application. We use the Jenkins Label Expression to lets Jenkins choose which job can be run on which node.
The limitation in Jenkins is currently with massive parallel jobs (https://issues.jenkins-ci.org/browse/JENKINS-47724). When the queue contains multiple hundred jobs adding new jobs took much longer – will become even worse by increasing queue. And main problem: Jenkins will start the execution of parallel pipeline part-jobs only after finishing adding all to the queue.
We already investigated ideas how to solve this problem:
Python Distributed: https://distributed.readthedocs.io/en/latest/
a. For single functions it looks great, but for the complete run like we have in Jenkins => Deploy and collect results looks complex
b. Client->Server bidirectional communication needed – no chance to bring it online through a NAT (VM Server)
BOINC: https://boinc.berkeley.edu/
a. for our understanding we had to extend the backend in a massive way to bring our jobs working => to configure the jobs in BOINC we had to write a lot of new automating code
b. currently we need a predeployed application which can differ between different inputs => no equivalent of Jenkins Label Expression
Any ideas how to solve it?
Thanks in advance

How to disable a CodedUI Test Agent from code?

We have a service to pick up custom tests in XML and convert those to CodedUI tests. We then start a process for MSTest to load the tests into the Test Controller which then distributes the tests across various Agents. We run regression tests at night, so no one is around to fix a system if something goes wrong. When certain exceptions occur in the test program, it pops open an error window and no more test can run on the system. Subsequent tests are loaded into the agent and fail immediately because they can not perform their assigned tasks. Thousands of tests that should take all night on multiple systems now fail in minutes.
We can detect that an error occurred by how quickly a test is returned, but we don't know how to disable the agent so as not to allow it to pick up any more tests.
addendum:
If the test has failed so miserably that no more tests can attempt a successful run (as noted, we may not have an action to handle some, likely new, popup), then we want to disable that agent as no more tests need to run on it: they will all fail. As we have many agents running concurrently, if one fails (and gets disabled), the load can still be distributed without a long string of failures. These other regression tests can still have a chance to succeed (everything works) or fail (did we miss another popup, or is this an actual regression failure).
2000 failures in 20 seconds doesn't say anything except 1 system had an problem that no one realized it would have and now we wasted a whole night of tests. 2 failures (1 natural, 1 caused by issue from previous failure) and 1 system down means the total nights run might be extended by an hour or two and we have useful data on how to start the day: fix 1 test and rerun both failures.
One would need to abort the testrun in that case. If you are running mstest yourself, you would need to inject a ^c into the command line process. But: if no-one is around to fix it, why does it matter that the consequenting test fail ? if its just to see which test was the cause of the error quickly, why not generate a code ui check to see if the message box is there and mark the test inconclusive with Assert.inconclusive. The causing test would stand out like a flag.
If you can detect the point at which you want to disable the agent then you can disable the agent by running the "TestAgentConfig.exe delete" which will reset the agent to unconfigured state.

Resources