In an EMR cluster, I run multiple Spark steps.
Steps may or may not have the same name.
I want to monitor the number of failed steps, grouped by the step name.
EMR triggers EventBridge events for a step status change, but I want numbers: the goal is to trigger an alarm if more than (say) 5 steps with the same name failed within (say) the last hour.
Was hoping to get a Cloudwatch metric counting failed steps, with a dimension of the step name. Can I achieve that?
Related
I have a scenario where I have N numbers of nodes and N number of tests, and these tests will be distributed to the nodes. My nodes have a label Windows.
Here's an example:
I have a pipeline job that will manage the distribution of the tests to the nodes. I set my pipeline job to run 10 tests on 10 VMs having the label Windows and it will run smoothly. However, one of my requirements is to concurrently run that pipeline job. The problem I might encounter If I have 10 tests on VMs 1-10 in the first run of the job, and run another job for 5 tests for VMs 11-15, given that I am using the Windows label, there might be a possibility that Jenkins will assign the test to VMs 1-10 but should run on VMs 11-15 or vice versa.
The solution I came up with is to dynamically change the label of the VMs from one of the jobs to a unique label that will only be used for that Job. Unfortunately, I still don't know how to do that.
Basically, I just want to logically group my nodes via label on demand in my pipeline script.
I searched all throughout the internet and yet I still wasn't able to find a solution that fits my needs.
Kindly help me with this. Also, I am open to using a different approach.
I'm using gcloud dataflow job and want individual execution times for all the steps in my dataflow including nested transforms. I'm using a streaming dataflow and the pipeline currently looks like this:
Current dataflow
Can anyone please suggest a solution?
Answer is WallTime. You can access this info by clicking one of the task in your pipeline(even nested).
Elapsed time of a job is the total time takes to complete your dataflow job while wall time is the sum time taken to run each step by the assigned workers. See the below image for more details.
IBM's version of JSR352 provides a Rest API which can be used to trigger jobs, restart them, get the job logs. Can it also be used to get the status of each step and each partition of the step?
I want to build a job monitoring console from where i can trigger the jobs and monitor the status of the steps and partitions in real time without actually having to look into the job log. (after i trigger the job it should periodically give me the status of the step and partitions)
How should i go about doing this?
You can subscribe to our batch events, a JMS topic tree where we publish messages at various stages in the batch job lifecycle, (job started/ended, step checkpointed, etc.)
See the Knowledge Center documentation and this whitepaper as well for more information.
I have around three Jenkins slave that are configured to run the same job allowing only one concurrent run on each slave. Each of these slave is connected to an embedded hardware that we run the job on. The total duration of the job is around 2 hours. The first 1 hour 50 mins is just taken to compile and configure the slave and the last 10 mins is when the embedded device is used. So basically I was looking for something that I can lock on for the last 10 mins. This would allow us to run multiple concurrent builds on the same slave.
Locks and Latches locks are shared across nodes.
What I am looking for is a node specific lock
If you can separate the problematic section from the compilation process you can just create another job to handle the last 10 minutes and call it using Parameterized Trigger Plugin. This job will run one instance at a time and will act as a native blocker for the run. That way, you can configure concurrent executions and throttling (if needed) on the main job, and create a "gate" to the problematic section.
I set my Jenkins job to build automaticlally many times a day by the scheduler.
If the build is failed, it will send mail to my team.
However I don't want to spamming the mail box. How can I set a condition to stop the build scheduler if it was failed more than 10 times ?
Rather than scheduling the job continuously, try the continuous integration paradigm, like this:
Unconditionally schedule the job only rarely. Perhaps once per day, just to ensure than any external factors (missing resources, changed interfaces, etc.) haven't come into play.
Trigger the job when any known source or dependency changes (e.g. source code, jar in your artifact repository, DB schema change, etc.)
Use a suitable plugin to retry failures.
I recommend the Naginator plugin for this. It can nag a limited number of times, and it auto-throttles: it nags frequently to begin with, then less frequently after a protacted period of failure.
Even if you don't change how the job is trigger, Naginator is probably a good solution for you. Use it to send your emails, instead of using an unconditional on-failure step.