JSR-352 Java Batch: why does JobListener.afterJob() always get batch status STARTED? - jsr352

I'm a Java Batch newbie. I deployed a simple batch job that includes a JobListener on WebSphere Liberty 17.0.0.4 (note that I'm using IBM's JSR-352 implementation, not Spring Batch). The batch job itself runs as expected: it reads an input file, does a simple data transformation, and writes to a DB. But in the JobListener's afterJob() method for a successful execution, I see a batch status of STARTED and an exit status of null. (These are the same values I see logged in the beforeJob() method). I expected afterJob() to see a status of COMPLETED unless there were exceptions.
The execution id logged by afterJob() is the same value returned by JobOperator.start() when I kick off the job, so I know I'm getting the status of the correct job execution.
I couldn't find any examples of a JobListener that fetches a batch status, so there's probably a simple error in my JSL, or I'm fetching the batch status incorrectly. Or do I need to explicitly set a status somewhere in the implementation of the step? I'd appreciate any pointers to the correct technique for setting and getting a job execution's final batch status and exit status.
Here's the JSL:
<job ...>
<properties>...</properties>
<listeners>
<listener ref="jobCompletionNotificationListener"/>
</listeners>
<flow id="flow1">
<step id="step1">...</step>
</flow>
</job>
Here's the listener's definition in batch.xml:
<ref id="jobCompletionNotificationListener"
class="com.llbean.batch.translatepersonnames.jobs.JobCompletedListener"/>
Here's the JobListener implementation:
#Dependent
#Named("jobCompletedListener")
public class JobCompletedListener implements JobListener {
...
#Inject
private JobContext jobContext;
#Override
public void afterJob() {
long executionId = jobContext.getExecutionId();
JobExecution jobExecution = BatchRuntime.getJobOperator().getJobExecution(executionId);
BatchStatus batchStatus = jobExecution.getBatchStatus();
String exitStatus = jobExecution.getExitStatus();
logger.info("afterJob(): Job id " + executionId + " batch status = " + batchStatus +
", exit status = " + exitStatus);
...
}
}
I tried adding <end on="*" exit-status="COMPLETED"/> to the <job> and <flow> in the JSL, but that had no effect or resulted in a status of FAILED.

Good question. Let me tack on a couple points to #cheng's answer.
First, to understand why we implemented it this way, consider the case where the JobListener throws an exception. Should that fail the job? In Liberty we decided it should. But if the job already had a COMPLETED status, well, that would imply that it was... completed, and shouldn't be able to fail at that point.
So afterJob() is really more like "end of job" (or you could think of it as "after job steps").
Second, one reason to even ask this question is because you want to know, in the afterJob() method, whether the job executed successfully or not.
Well, in the Liberty implementation at least (which I work on, for IBM), you can indeed see this. A previous failure will have set the BatchStatus to FAILED while a successful (this far) execution will have its status still at STARTED.
(For what it's worth, this was an area we realized could use more attention and standardization a bit late in the 1.0 spec effort, and I hope we can address more in the future.)
If it helps and you're interested, you can see the basic logic in the flow leading up to and including the WorkUnitThreadControllerImpl.endOfWorkUnit call here.

It's because job listener's afterJob method is part of the job execution. So when you call getBatchStatus() inside afterJob() method, the job execution is still going on, and not yet completed, hence the batch status STARTED.

Related

Is there a way to make repeatedly forever apache beam trigger to only execute after the previous execution is completed?

I am using global window with repeated forever after processing time trigger to process streaming data from pub-sub as below :
PCollection<KV<String,SMMessage>> perMSISDNLatestEvents = messages
.apply("Apply global window",Window.<SMMessage>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1))))
.discardingFiredPanes())
.apply("Convert into kv of msisdn and SM message", ParDo.of(new SmartcareMessagetoKVFn()))
.apply("Get per MSISDN latest event",Latest.perKey()).apply("Write into Redis", ParDo.of(new WriteRedisFn()));
Is there a way to make repeatedly forever apache beam trigger to only execute after the previous execution is completed ? The reason for my question is because the next trigger processing will need to read data from redis, written by the previous trigger execution.
Thank You
So the trigger here would fire at the interval you provided. The trigger is not aware of any downstream processing so it's unable to depend on such steps of your pipeline.
Instead of depending on the trigger for consistency here, you could add a barrier (a DoFn) that exists before the Write step and only gives up execution after you see the previous data in Redis.
You could try and explicitly declare a global window trigger, as the example below:
Trigger subtrigger = AfterProcessingTime.pastFirstElementInPane();
Trigger maintrigger = Repeatedly.forever(subtrigger);
I think that triggers would help you on your case, since it will allow you to create event times, which will run when you or your code trigger them, so you would only run repeatedly forever when a trigger finishes first.
I found this documentation which might guide you on the triggers you are trying to create.

Is DRBD split-brain handler called only when after-sb-Xpri is set to disconnect?

I wanted to run a script if a split brain is detected. In the documentation, it is mentioned that we can do that by providing the path of the script like
resource <resource>
handlers {
split-brain <handler>;
...
}
But, below that, for the after-sb-0pri configuration, it is mentioned that
disconnect: Do not recover automatically, simply invoke the
split-brain handler script (if configured), drop the connection and
continue in disconnected mode.
So, my question is that, will the configured script be run only when after-sb-0pri is set to disconnect, or will that run for any set value
Document Link: https://linbit.com/drbd-user-guide/users-guide-drbd-8-4/#s-configure-split-brain-behavior
DRBD should invoke the split-brain handler whenever a split-brain is detected. I.E. Anything that would log a Split-Brain detected ... in the system/kernel logs. The documentation attempts to explain this at the beginning of chapter 5.17.1 with:
DRBD invokes the split-brain handler, if configured, at any time split
brain is detected.
Additionally, disconnect is the default value for after-sb-0pri. So, even if not explicitly set that will still be the behavior.

How can I programmatically cancel a Dataflow job that has run for too long?

I'm using Apache Beam on Dataflow through Python API to read data from Bigquery, process it, and dump it into Datastore sink.
Unfortunately, quite often the job just hangs indefinitely and I have to manually stop it. While the data gets written into Datastore and Redis, from the Dataflow graph I've noticed that it's only a couple of entries that get stuck and leave the job hanging.
As a result, when a job with fifteen 16-core machines is left running for 9 hours (normally, the job runs for 30 minutes), it leads to huge costs.
Maybe there is a way to set a timer that would stop a Dataflow job if it exceeds a time limit?
It would be great if you can create a customer support ticket where we would could try to debug this with you.
Maybe there is a way to set a timer that would stop a Dataflow job if
it exceeds a time limit?
Unfortunately the answer is no, Dataflow does not have an automatic way to cancel a job after a certain time. However, it is possible to do this using the APIs. It is possible to wait_until_finish() with a timeout then cancel() the pipeline.
You would do this like so:
p = beam.Pipeline(options=pipeline_options)
p | ... # Define your pipeline code
pipeline_result = p.run() # doesn't do anything
pipeline_result.wait_until_finish(duration=TIME_DURATION_IN_MS)
pipeline_result.cancel() # If the pipeline has not finished, you can cancel it
To sum up, with the help of #ankitk answer, this works for me (python 2.7, sdk 2.14):
pipe = beam.Pipeline(options=pipeline_options)
... # main pipeline code
run = pipe.run() # doesn't do anything
run.wait_until_finish(duration=3600000) # (ms) actually starts a job
run.cancel() # cancels if can be cancelled
Thus, in case if a job was successfully finished within the duration time in wait_until_finished() then cancel() will just print a warning "already closed", otherwise it will close a running job.
P.S. if you try to print the state of a job
state = run.wait_until_finish(duration=3600000)
logging.info(state)
it will be RUNNING for the job that wasn't finished within wait_until_finished(), and DONE for finished job.
Note: this technique will not work when running Beam from within a Flex Template Job...
The run.cancel() method doesn't work if you are writing a template and I haven't seen any successful work around it...

How to make Postman/Newman to fail a test after certain time has passed?

So, in my collection I have about ten requests, with the last two being:
/Wait 10 seconds
/Check Complete
The first makes a call to the postman's echo (delay by 10 seconds) and the second is the call to my system to check for the status complete. Now, if status is unavailable I wait another 10s:
postman.setNextRequest("Wait 10 seconds");
The complete status on my system can appear in a minute or so. Now, as one can see - it is an infinite loop if something goes wrong with the system and status is never complete. Is there a way in postman/newman test to fail a test if it has been going for more than 2 minutes, for example.
Additionally, this will be executed in jenkins with command line, so I am not really looking into postman settings or delays between requests in the runner.
you may have a look to newman options here : https://www.npmjs.com/package/newman#newman-run-collection-file-source-options. The interesting option is
--timeout-request : it will surely fulfill your need.
In Postman itself, you may test the responseTime. I recall that there is a snippet, on the right part, which looks like this:
tests["Response time is less than 200ms"] = responseTime < 200;
and which could help you as the test fails if response does not occur within the requested time.
Alexandre
If you are going to be using Jenkins pipeline you can use the timeout step to cause long running jobs to result in failure, here's on for 2 mins.
timeout(120) {
node {
sh 'newman command'
}
}
Check out the "Pipeline Syntax" editor in Jenkins to generated your code block and look for other useful functions.

How can I set the job timeout for all jobs using the Jenkins DSL

I read How can I set the job timeout using the Jenkins DSL. That sets the timeout for one job. I want to set it for all jobs, and with slightly different settings: 150%, averaged over 10 jobs, with a max of 30 minutes.
According to the relevant job-dsl-plugin documentation I should use this syntax:
job('example-3') {
wrappers {
timeout {
elastic(150, 10, 30)
failBuild()
writeDescription('Build failed due to timeout after {0} minutes')
}
}
}
I tested in http://job-dsl.herokuapp.com/ and this is the relevant XML part:
<buildWrappers>
<hudson.plugins.build__timeout.BuildTimeoutWrapper>
<strategy class='hudson.plugins.build_timeout.impl.ElasticTimeOutStrategy'>
<timeoutPercentage>150</timeoutPercentage>
<numberOfBuilds>10</numberOfBuilds>
<timeoutMinutesElasticDefault>30</timeoutMinutesElasticDefault>
</strategy>
<operationList>
<hudson.plugins.build__timeout.operations.FailOperation></hudson.plugins.build__timeout.operations.FailOperation>
<hudson.plugins.build__timeout.operations.WriteDescriptionOperation>
<description>Build failed due to timeout after {0} minutes</description>
</hudson.plugins.build__timeout.operations.WriteDescriptionOperation>
</operationList>
</hudson.plugins.build__timeout.BuildTimeoutWrapper>
</buildWrappers>
I verified with a job I edited manually before, and the XML is correct. So I know that the Jenkins DSL syntax up to here is correct.
Now I want to apply this to all jobs. First I tried to list all the job names:
import jenkins.model.*
jenkins.model.Jenkins.instance.items.findAll().each {
println("Job: " + it.name)
}
This works too, all job names are printed to console.
Now I want to plug it all together. This is the full code I use:
import jenkins.model.*
jenkins.model.Jenkins.instance.items.findAll().each {
job(it.name) {
wrappers {
timeout {
elastic(150, 10, 30)
failBuild()
writeDescription('Build failed due to timeout after {0} minutes')
}
}
}
}
When I push this code and Jenkins runs the DSL seed job, I get this error:
ERROR: Type of item "jobname" does not match existing type, item type can not be changed
What am I doing wrong here?
The Job-DSL plugin can only be used to maintain jobs that have been created by that plugin before. You're trying to modify the configuration of jobs that have been created in some other way -- this will not work.
For mass-modification of existing jobs (like, in your case, adding the timeout) the most straightforward way is to change the job's XML specification directly,
either by changing the config.xml file on disk, or
using the REST or CLI API
xmlstarlet is a powerful tool for performing such tasks directly on shell level.
Alternatively, it is possible to perform the change via a Groovy script from the "Script Console" -- but for that you need some understanding of Jenkins' internal workings and data structures.

Resources