Identify when retry happened in Jenkins pipeline - jenkins

I have implemented retries around some code in Jenkins pipeline.
retry(2) {
// code
}
Is there a way to identify when a retry has happened? Besides just manually checking console logs.
Would like to identify builds which have retried some flakey code and send notification.

I don't know of a way to know that inside the retry{ } block.
You could try a fairly generic solution (even maybe as a global var in a shared pipeline library), such as using a try catch block (since the retry will wait for an exception):
int retryAttempts = 0
retry(2) {
try {
if (retryAttempts>0) {
// a retry is occurring
// Do pre-retry logic, if needed
...
}
// Do stuff
.....
} catch (e) {
retryAttempts++ // a retry WILL occur
throw e // rethrow to trigger retry
}
}
if (retryAttempts>0) {
// a retry has occurred
// Do post-retry logic, if needed
...
}

Related

Test using StepVerifier blocks when using Spring WebClient with retry

EDIT: here https://github.com/wujek-srujek/reactor-retry-test is a repository with all the code.
I have the following Spring WebClient code to POST to a remote server (Kotlin code without imports for brevity):
private val logger = KotlinLogging.logger {}
#Component
class Client(private val webClient: WebClient) {
companion object {
const val maxRetries = 2L
val firstBackOff = Duration.ofSeconds(5L)
val maxBackOff = Duration.ofSeconds(20L)
}
fun send(uri: URI, data: Data): Mono<Void> {
return webClient
.post()
.uri(uri)
.contentType(MediaType.APPLICATION_JSON)
.bodyValue(data)
.retrieve()
.toBodilessEntity()
.doOnSubscribe {
logger.info { "Calling backend, uri: $uri" }
}
.retryExponentialBackoff(maxRetries, firstBackOff, maxBackOff, jitter = false) {
logger.debug { "Call to $uri failed, will retry (#${it.iteration()} of max $maxRetries)" }
}
.doOnError {
logger.error { "Call to $uri with $maxRetries retries failed with $it" }
}
.doOnSuccess {
logger.info { "Call to $uri succeeded" }
}
.then()
}
}
(It returns an empty Mono as we don't expect an answer, nor do we care about it.)
I would like to test 2 cases, and one of them is giving me headaches, namely the one in which I want to test that all the retries have been fired. We are using MockWebServer (https://github.com/square/okhttp/tree/master/mockwebserver) and the StepVerifier from reactor-test. (The test for success is easy and doesn't need any virtual time scheduler magic, and works just fine.) Here is the code for the failing one:
#JsonTest
#ContextConfiguration(classes = [Client::class, ClientConfiguration::class])
class ClientITest #Autowired constructor(
private val client: Client
) {
lateinit var server: MockWebServer
#BeforeEach
fun `init mock server`() {
server = MockWebServer()
server.start()
}
#AfterEach
fun `shutdown server`() {
server.shutdown()
}
#Test
fun `server call is retried and eventually fails`() {
val data = Data()
val uri = server.url("/server").uri()
val responseStatus = HttpStatus.INTERNAL_SERVER_ERROR
repeat((0..Client.maxRetries).count()) {
server.enqueue(MockResponse().setResponseCode(responseStatus.value()))
}
StepVerifier.withVirtualTime { client.send(uri, data) }
.expectSubscription()
.thenAwait(Duration.ofSeconds(10)) // wait for the first retry
.expectNextCount(0)
.thenAwait(Duration.ofSeconds(20)) // wait for the second retry
.expectNextCount(0)
.expectErrorMatches {
val cause = it.cause
it is RetryExhaustedException &&
cause is WebClientResponseException &&
cause.statusCode == responseStatus
}
.verify()
// assertions
}
}
I am using withVirtualTime because I don't want the test to take nearly seconds.
The problem is that the test blocks indefinitely. Here is the (simplified) log output:
okhttp3.mockwebserver.MockWebServer : MockWebServer[51058] starting to accept connections
Calling backend, uri: http://localhost:51058/server
MockWebServer[51058] received request: POST /server HTTP/1.1 and responded: HTTP/1.1 500 Server Error
Call to http://localhost:51058/server failed, will retry (#1 of max 2)
Calling backend, uri: http://localhost:51058/server
MockWebServer[51058] received request: POST /server HTTP/1.1 and responded: HTTP/1.1 500 Server Error
Call to http://localhost:51058/server failed, will retry (#2 of max 2)
As you can see, the first retry works, but the second one blocks. I don't know how to write the test so that it doesn't happen. To make matters worse, the client will actually use jitter, which will make the timing hard to anticipate.
The following test using StepVerifier but without WebClient works fine, even with more retries:
#Test
fun test() {
StepVerifier.withVirtualTime {
Mono
.error<RuntimeException>(RuntimeException())
.retryExponentialBackoff(5,
Duration.ofSeconds(5),
Duration.ofMinutes(2),
jitter = true) {
println("Retrying")
}
.then()
}
.expectSubscription()
.thenAwait(Duration.ofDays(1)) // doesn't matter
.expectNextCount(0)
.expectError()
.verify()
}
Could anybody help me fix the test, and ideally, explain what is wrong?
This is a limitation of virtual time and the way the clock is manipulated in StepVerifier. The thenAwait methods are not synchronized with the underlying scheduling (that happens for example as part of the retryBackoff operation). This means that the operator submits retry tasks at a point where the clock has already been advanced by one day. So the second retry is scheduled for + 1 day and 10 seconds, since the clock is at +1 day. After that, the clock is never advanced so the additional request is never made to MockWebServer.
Your case is made even more complicated in the sense that there is an additional component involved, the MockWebServer, that still works "in real time".
Though advancing the virtual clock is a very quick operation, the response from the MockWebServer still goes through a socket and thus has some amount of latency to the retry scheduling, which makes things more complicated from the test writing perspective.
One possible solution to explore would be to externalize the creation of the VirtualTimeScheduler and tie advanceTimeBy calls to the mockServer.takeRequest(), in a parallel thread.

How to handle FileSystemException in dart when watching a directory

I have written a simple command line tool in dart that watches changes in a directory if a directory does not exits I get FileSystemsException.
I have tried to handle it using try and catch clause. When the exception occurs the code in catch clause is not executed
try {
watcher.events.listen((event) {
if (event.type == ChangeType.ADD) {
print("THE FILE WAS ADDED");
print(event.path);
} else if (event.type == ChangeType.MODIFY) {
print("THE FILE WAS MODIFIED");
print(event.path);
} else {
print("THE FILE WAS REMOVED");
print(event.path);
}
});
} on FileSystemException {
print("Exception Occurs");
}
I expect the console to print "Exception Occurs"
There are two possibilities:
The exception is happening outside this block (maybe where the watcher is constructed?)
The exception is an unhandled async exception. This might be coming from either the Stream or it might be getting fired from some other Future, like perhaps the ready Future.
You can add handlers for the async exceptions like this:
try {
// If this is in an `async` method, use `await` within the try block
await watcher.ready;
// Otherwise add a error handler on the Future
watcher.ready.catchError((e) {
print('Exception in the ready Future');
});
watcher.events.listen((event) {
...
}, onError: (e) {
print('Exception in the Stream');
});
} on FileSystemException {
print("Exception Occurs");
}
My guess would be it's an error surfaced through the ready Future.

Jenkins pipeline - custom timeout behavior

I need custom behavior for the timeout function. For example, when I use:
timeout(time: 10, unit: 'MINUTES') {
doSomeStuff()
}
it terminates the doSomeStuff() function.
What I want to achieve is not to terminate the execution of the function, but to call another function every 10 minutes until doSomeStuff() is done with executing.
I can't use the Build-timeout Plugin from Jenkins since I need to apply this behavior to pipelines.
Any help would be appreciated.
In case anyone else has the same issue: After some research, the only way that came to my mind to solve my problem was to modify the notification plugin for the jenkins pipeline, in a way to add a new field that would contain value of time (in minutes) to delay the invoking of the url. In the code itself, where the url was invoked, i put those lines in a new thread and let that thread sleep for the needed amount of time before executing the remaining code. Something like this:
#Override
public void onStarted(final Run r, final TaskListener listener) {
HudsonNotificationProperty property = (HudsonNotificationProperty) r.getParent().getProperty(HudsonNotificationProperty.class);
int invokeUrlTimeout = 0;
if (property != null && !property.getEndpoints().isEmpty()){
invokeUrlTimeout = property.getEndpoints().get(0).getInvokeUrlTimeout();
}
int finalInvokeUrlTimeout = invokeUrlTimeout;
new Thread(() -> {
sleep(finalInvokeUrlTimeout * 60 * 1000);
Executor e = r.getExecutor();
Phase.QUEUED.handle(r, TaskListener.NULL, e != null ? System.currentTimeMillis() - e.getTimeSpentInQueue() : 0L);
Phase.STARTED.handle(r, listener, r.getTimeInMillis());
}).start();
}
Maybe not the best solution but it works for me, and I hope it helps other people too.

Jenkins timeout/abort exception

We have a Jenkins pipeline script that requests approval from the user after all the preparatory steps are complete, before it actually applies the changes.
We want to add a timeout to this step, so that if there is no input from the user then the build is aborted, and are currently working on using this kind of method:
try {
timeout(time: 30, unit: 'SECONDS') {
userInput = input("Apply changes?")
}
} catch(err) {
def user = err.getCauses()[0].getUser()
if (user.toString == 'SYSTEM') { // if it's system it's a timeout
didTimeout = true
echo "Build timed out at approval step"
} else if (userInput == false) { // if not and input is false it's the user
echo "Build aborted by: [${user}]"
}
}
This code is based on examples found here: https://support.cloudbees.com/hc/en-us/articles/226554067-Pipeline-How-to-add-an-input-step-with-timeout-that-continues-if-timeout-is-reached-using-a-default-value and other places online, but I really dislike catching all errors then working out what's caused the exception using err.getCauses()[0].getUser(). I'd rather explicitly catch(TimeoutException) or something like that.
So my question is, what are the actual exceptions that would be thrown by either the approval step timing out or the userInput being false? I haven't been able to find anything in the docs or Jenkins codebase so far about this.
The exception class they are referring to is org.jenkinsci.plugins.workflow.steps.FlowInterruptedException.
Cannot believe that this is an example provided by CloudBeeds.
Most (or probably all?) other exceptions won't even have the getCauses() method which of course would throw another exception then from within the catch block.
Furthermore as you already mentioned it is not a good idea to just catch all exceptions.
Edit:
By the way: Scrolling further down that post - in the comments - there you'll find an example catching a FlowInterruptedException.
Rather old topic, but it helped me, and I've done some more research on it.
As I figured out, FlowInterruptedException's getCauses()[0] has .getUser() only when class of getCauses()[0] is org.jenkinsci.plugins.workflow.support.steps.input.Rejection. It is so only when timeout occured while input was active. But, if timeout occured not in input, getCause()[0] will contain object of another class: org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution$ExceededTimeout (directly mentioning timeout).
So, I end up with this:
def is_interrupted_by_timeout(org.jenkinsci.plugins.workflow.steps.FlowInterruptedException e, Boolean throw_again=true) {
// if cause is not determined, re-throw exception
try {
def cause = e.getCauses()[0]
def cause_class = cause.getClass()
//echo("cause ${cause} class: ${cause_class}")
if( cause_class == org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution$ExceededTimeout ) {
// strong detection
return true
} else if( cause_class == org.jenkinsci.plugins.workflow.support.steps.input.Rejection ) {
// indirect detection
def user = cause.getUser()
if( user.toString().equals('SYSTEM') ) {
return true
} else {
return false
}
}
} catch(org.jenkinsci.plugins.scriptsecurity.sandbox.RejectedAccessException e_access) {
// here, we may deal with situation when restricted methods are not approved:
// show message and Jengins' admin will copy/paste and execute them only once per Jenkins installation.
error('''
To run this job, Jenkins admin needs to approve some Java methods.
There are two possible ways to do this:
1. (better) run this code in Jenkins Console (URL: /script):
import org.jenkinsci.plugins.scriptsecurity.scripts.ScriptApproval;
def scriptApproval = ScriptApproval.get()
scriptApproval.approveSignature('method org.jenkinsci.plugins.workflow.steps.FlowInterruptedException getCauses')
scriptApproval.approveSignature('method org.jenkinsci.plugins.workflow.support.steps.input.Rejection getUser')
scriptApproval.save()
'''.stripIndent())
return null
}
if( throw_again ) {
throw e
} else {
return null
}
}
And now, you may catch it with something like this:
try {
...
} catch (org.jenkinsci.plugins.workflow.steps.FlowInterruptedException err) {
if( is_interrupted_by_timeout(err) ) {
echo('It is timeout!')
}
}
P.S. I agree, this is bad Jenkins design.

Waiting for running Reactor Mono instances to complete

I wrote this code to spin off a large number of WebClients (limited by reactor.ipc.netty.workerCount), start the Mono immediately, and wait for the all Monos to complete:
List<Mono<List<MetricDataModel>>> monos = new ArrayList<>(metricConfigs.size());
for (MetricConfig metricConfig : metricConfigs) {
try {
monos.add(extractMetrics.queryMetricData(metricConfig)
.doOnSuccess(result -> {
metricDataList.addAll(result);
})
.cache());
} catch (Exception e) {
}
}
Mono.when(monos)
.doFinally(onFinally -> {
Map<String, Date> latestMap;
try {
latestMap = extractInsights.queryInsights();
Transform transform = new Transform(copierConfig.getEventType());
ArrayList<Event> eventList = transform.toEvents(latestMap, metricDataList);
} catch (Exception e) {
log.error("copy: mono: when: {}", e.getMessage(), e);
}
})
.block();
It 'works', that is the results are as expected.
Two questions:
Is this correct? Does cache() result in the when waiting for all Monos to complete?
Is it efficient? Is there a way to make this faster?
Thanks.
You should try as much as possible to:
use Reactor operators and compose a single reactive chain
avoid using doOn* operators for something other than side-effects (like logging)
avoid shared state
Your code could look a bit more like
List<MetricConfig> metricConfigs = //...
Mono<List<MetricDataModel>> data = Flux.fromIterable(metricConfigs)
.flatMap(config -> extractMetrics.queryMetricData(config))
.collectList();
Also, the cache() operator does not wait the completion of the stream (that's actually then()'s job).

Resources